Journal Article•DOI•

Residential Demand Response of Thermostatically Controlled Loads Using Batch Reinforcement Learning

Frederik Ruelens¹, Bert Claessens², Stijn Vandael¹, Bart De Schutter³, Robert Babuska³, Ronnie Belmans¹ - Show less +2 more•Institutions (3)

Katholieke Universiteit Leuven¹, Flemish Institute for Technological Research², Delft University of Technology³

01 Sep 2017-IEEE Transactions on Smart Grid (IEEE)-Vol. 8, Iss: 5, pp 2149-2159

TL;DR: The experiments show that batch RL techniques provide a valuable alternative to model-based controllers and that they can be used to construct both closed-loop and open-loop policies.

read less

Abstract: Driven by recent advances in batch Reinforcement Learning (RL), this paper contributes to the application of batch RL to demand response. In contrast to conventional model-based approaches, batch RL techniques do not require a system identification step, making them more suitable for a large-scale implementation. This paper extends fitted Q-iteration, a standard batch RL technique, to the situation when a forecast of the exogenous data is provided. In general, batch RL techniques do not rely on expert knowledge about the system dynamics or the solution. However, if some expert knowledge is provided, it can be incorporated by using the proposed policy adjustment method. Finally, we tackle the challenge of finding an open-loop schedule required to participate in the day-ahead market. We propose a model-free Monte Carlo method that uses a metric based on the state-action value function or Q-function and we illustrate this method by finding the day-ahead schedule of a heat-pump thermostat. Our experiments show that batch RL techniques provide a valuable alternative to model-based controllers and that they can be used to construct both closed-loop and open-loop policies.

...read moreread less

Summary (5 min read)

Jump to: [Introduction] – [II. BUILDING BLOCKS: MODEL-FREE APPROACH] – [III. MARKOV DECISION PROCESS FORMULATION] – [A. State Description] – [B. Backup Controller] – [C. Cost Function] – [D. Reinforcement Learning for Demand Response] – [A. Fitted Q-Iteration Using a Forecast of the Exogenous Data] – [B. Expert Policy Adjustment] – [C. Day-Ahead Consumption Plan] – [V. SIMULATIONS] – [A. Thermostatically Controlled Loads] – [B. Experiment 1] – [C. Experiment 2] – [D. Experiment 3] and [VI. CONCLUSION]

Introduction

The market share of these loads is expected to increase as a result of the electrification of heating and cooling [2], making them an interesting domain for demand response [1], [3]–[5].
As a result, different end users are expected to have different model parameters and even different models.
O’Neill et al. propose an automated energy management system based on Q-learning that learns how to make optimal decisions for the consumers.

II. BUILDING BLOCKS: MODEL-FREE APPROACH

Fig. 1 presents an overview of their model-free learning agent applied to a Thermostatically Controlled Load (TCL), where the gray building blocks correspond to the learning agent.
At the start of each day the learning agent uses a batch RL method to construct a control policy for the next day, given a batch of past interactions with its environment.
The temperature dynamics of the building are modeled using a second-order equivalent thermal parameter model [32].
The operation and settings of the backup controller are assumed to be unknown to the learning agent.
This feature extraction step can have two functions namely to extract hidden state information or to reduce the dimensionality of the state vector.

III. MARKOV DECISION PROCESS FORMULATION

This section formulates the decision-making problem of the learning agent as a Markov decision process.
The Markov decision process is defined by its d-dimensional state space X ⊂ Rd, its action space U ⊂ R, its stochastic discrete-time transition function f , and its cost function ρ.
The Q-function is the cumulative return starting from state x, taking action u, and following h thereafter.
The optimal Q-function corresponds the best Q-function that can be obtained by any policy: Q∗(x, u) = min h Qh(x, u).

A. State Description

Following the notational style of [37], the state space X is spanned by a time-dependent component Xt, a controllable component Xph, and an uncontrollable exogenous component Xex: X = Xt × Xph × Xex. (8) 1) Timing:.
The rationale is that most consumer behavior tends to be repetitive and tends to follows a diurnal pattern.
The state description of the uncontrollable exogenous state is split into two components: Xex = Xphex × Xcex. (11) However, most physical processes, such as the outside temperature and solar radiation, exhibit a certain degree of autocorrelation, where the outcome of the next state depends on the current state.
This work assumes that a deterministic forecast of the exogenous state information related to the cost x̂cex and related to the physical dynamics, i.e., outside temperature and solar radiation, x̂phex is provided for the time span covering the optimization problem.

B. Backup Controller

This paper assumes that each TCL is equipped with an overrule mechanism that guarantees comfort and safety constraints.
(12) The settings of the backup function B are unknown to the learning agent, but the resulting action uphk can be measured by the learning agent (see the dashed arrow in Fig. 1).

C. Cost Function

In general, RL techniques do not require a description of the cost function.
This paper considers two typical cost functions related to demand response.
In the dynamic pricing scenario an external price profile is known deterministically at the start of the horizon.
The day-ahead consumption plan should be minimized based on day-ahead prices.
In addition, any deviation between the planned consumption and actual consumption should be avoided.

D. Reinforcement Learning for Demand Response

When the description of the transition function and cost function is available, techniques that make use of the Markov decision process framework, such as approximate dynamic programming [38] or direct policy search [30], can be used to find near-optimal policies.
In their implementation the authors assume that the transition function f , the backup controller B, and the underlying probability of the exogenous information w are unknown.
In addition, the authors assume that they are challenging to obtain in a residential setting.
For these reasons, the authors present a model-free batch RL approach that builds on previous theoretical work on RL, in particular fitted Q-iteration [23], expert knowledge [39], and the synthesis of artificial trajectories [21].

A. Fitted Q-Iteration Using a Forecast of the Exogenous Data

Here the authors demonstrate how fitted Q-iteration [23] can be extended to the situation when a forecast of the exogenous component is provided (Algorithm 1).
In the subsequent iterations, Q-values are updated using the value iteration based on the Q-function of the previous iteration.
It is important to note that x̂′l denotes the successor state in F , where the observed exogenous information xph ′l,ex is replaced by its forecast x̂ph ′l,ex (line 3 in Algorithm 1).
By replacing the observed exogenous parts of the next state by their forecasts, the Q-function of the next state assumes that the exogenous information will follow its forecast.
The proposed algorithm is relevant for demand response applications that are influenced by exogenous weather data.

B. Expert Policy Adjustment

Given the Q-function from Algorithm 1, a near-optimal policy ĥ∗ can be constructed by solving (6) for every state in the state space.
In some cases, e.g., when F contains a limited number of observations, the resulting policy can be improved by using general prior knowledge about its shape.
The authors show how expert knowledge about the monotonicity of the policy can be exploited to regularize the policy.
In order to define a convex optimization problem the authors use a fuzzy model with triangular membership functions [30] to approximate the policy.
This partitioning leads to Nd state-dependent membership functions for each action.

C. Day-Ahead Consumption Plan

This section explains how to construct a day-ahead schedule starting from the Q-function obtained by Algorithm 1 and using cost function (14).
Finding a day-ahead schedule has a direct relation to two situations: a day-ahead market, where participants have to submit a day-ahead schedule one day in advance of the actual consumption [35]; a distributed optimization process, where two or more participants are coupled by a common constraint, e.g., congestion management [9].
These p sequences can be seen as a proxy of the actual trajectories that could be obtained by simulating the policy on the given control problem.
Each new transition is selected by minimizing a distance metric with the previously selected transition.
The motivation behind using Q-values instead of the Euclidean distance in X×U is that Q-values capture the dynamics of the system and, therefore, there is no need to select individual weights.

V. SIMULATIONS

This section presents the simulation results of three experiments and evaluates the performance of the proposed algorithms.
The authors focus on two examples of flexible loads, i.e., an electric water heater and a heat-pump thermostat.
The rationale behind using extended FQI for a heat-pump thermostat, is that the temperature dynamics of a building is influenced by exogenous weather data, which is less the case for an electric water heater.
In the second experiment, the authors apply the policy adjustment method to an electric water heater.
The final experiment uses the model-free Monte Carlo method to find a day-ahead consumption plan for a heat-pump thermostat.

A. Thermostatically Controlled Loads

Here the authors describe the state definition and the settings of the backup controller of the electric water heater and the heatpump thermostat.
This work uses feature extraction to reduce the dimensionality of the controllable state space component by replacing it with with the average sensor measurement.
A detailed description of the nonlinear tank model of the electric water heater and the calculation of its state of charge xsoc can be found in [31].
The authors second application considers a heat-pump thermostat that can measure the indoor air temperature, the outdoor air temperature, and the solar radiation.
In the simulation section the authors define a minimum and maximum comfort setting of 19◦C and 23◦C.

B. Experiment 1

The goal of the first experiment is to compare the performance of Fitted Q-Iteration (standard FQI) [23] to the performance of their extension of FQI (extended FQI), given by Algorithm 1.
The objective of the considered heating system is to minimize the electricity cost of the heat pump by responding to an external price signal.
The extended FQI controller uses the forecasted values of the outside temperature and solar radiance to construct the next states in the batch x̂′l ← (xq ′l,t , T ′l,in, T̂ ′l,out, Ŝ ′l ), where (.̂) denotes a forecast.
Since more interactions result in a better coverage of the state-action space, the exploration probability εd is decreased on a daily basis, according to the harmonic sequence 1/dn, where n is set to 0.7 and d denotes the current day.
In order to compare the performance of the FQI controllers the authors define the following metric: M = cfqi − cd co − cd , (24) where cfqi denotes the daily cost of the FQI controller, cd denotes the daily cost of the default controller and co denotes the daily cost of the optimal controller.

C. Experiment 2

The following experiment demonstrates the policy adjustment method for an electric water heater.
As stated in (19), the state space of an electric water heater consists of two dimensions, i.e., the time component and the average temperature.
These monotonicity constraints are added to the least-squares problem (15).
The original policies obtained with fitted Q-iteration after 7, 14 and 21 days are presented in the left column of Fig.
The adjusted policies obtained by the policy adjustment method are depicted in the right column.

D. Experiment 3

The final experiment demonstrates the Model-Free Monte Carlo (MFMC) method (Algorithm 2) to find the day-ahead consumption plan of a heat-pump thermostat.
The resulting Q-function is then used as a metric to build p distinct artificial trajectories (line 8 of Algorithm 2).
A day-ahead consumption plan was obtained by taking the average of these 4 artificial trajectories.
Moreover, the authors define a performance metric M = cMFMC/co, where cMFMC is the daily cost of the MFMC method and co is the daily cost of the optimal model-based controller.
The top left plot depicts the daily metric M of their MFMC method, where the metric of the optimal model-based controller corresponds to 1.

VI. CONCLUSION

Driven by the challenges presented by the system identification step of model-based controllers, this paper has contributed to the application of model-free batch Reinforcement Learning (RL) to demand response.
Therefore, the authors have presented an expert policy adjustment method that can exploit this expert knowledge.
The authors build upon the theoretical work of [16], [23], and [24], and they demonstrate how fitted Q-iteration can be adapted to work in a demand response setting.
The intent of the experiment is to compare fitted Q-iteration with two other approaches namely Q-learning and the optimal solution.
They demonstrate that the lower and upper bounds converge at least linearly towards the true return of the policy, when the size of the batch grows, i.e., new tuples are added to the batch.

Did you find this useful? Give us your feedback

Figures (5)

Fig. 5. Mean daily cost and standard deviation of 100 simulation runs with Q-learning for different learning rates α, fitted Q-iteration and model-based solution (Optimal) during 300 iterations.

Fig. 4. The top plots depict the metric M (left) and the daily deviation between the day-ahead consumption plan and actual consumption (right). The day-ahead consumption plan and the indoor temperature obtained with the Model-Free Monte Carlo (MFMC) method are depicted in the middle and bottom plot.

Fig. 1. Building blocks of a model-free Reinforcement Learning (RL) agent (gray) applied to a Thermostatically Controlled Load (TCL).

Fig. 2. Simulation results for a heat-pump thermostat and a dynamic pricing scheme using an optimal controller (Optimal), Fitted Q-Iteration (FQI), extended FQI, and a default controller (Default). The top plot depicts the performance metric M and the bottom plot depicts the cumulative electricity cost.

Fig. 3. Simulation results for an electric water heater and a dynamic pricing scheme. The left column depicts the original policies for day 7, 14, and 21 where the color corresponds to the control action, namely white (off) and black (on). The corresponding repaired policies are depicted in the right column. The price profile corresponding to each policy is depicted in the background.

Content maybe subject to copyright Report

Delft University of Technology

Residential demand response of thermostatically controlled loads using batch

Reinforcement Learning

Ruelens, F; Claessens, BJ; Vandael, S; De Schutter, Bart; Babuska, Robert; Belmans, R

DOI

10.1109/TSG.2016.2517211

Publication date

2017

Document Version

Accepted author manuscript

Published in

IEEE Transactions on Smart Grid

Citation (APA)

Ruelens, F., Claessens, BJ., Vandael, S., De Schutter, B., Babuska, R., & Belmans, R. (2017). Residential

demand response of thermostatically controlled loads using batch Reinforcement Learning.

IEEE

Transactions on Smart Grid

(5), 2149-2159. https://doi.org/10.1109/TSG.2016.2517211

Important note

To cite this publication, please use the final published version (if applicable).

Please check the document version above.

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent

of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Takedown policy

Please contact us and provide details if you believe this document breaches copyrights.

We will remove access to the work immediately and investigate your claim.

This work is downloaded from Delft University of Technology.

For technical reasons the number of authors shown on this cover page is limited to a maximum of 10.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Residential Demand Response of

Thermostatically Controlled Loads

Using Batch Reinforcement Learning

Frederik Ruelens, Bert J. Claessens, Stijn Vandael, Bart De Schutter, Senior Member, IEEE,

Robert Babuška, and Ronnie Belmans, Fellow, IEEE

Abstract—Driven by recent advances in batch Reinforcement

Learning (RL), this paper contributes to the application of batch

RL to demand response. In contrast to conventional model-

based approaches, batch RL techniques do not require a system

identiﬁcation step, making them more suitable for a large-scale

implementation. This paper extends ﬁtted Q-iteration, a stan-

dard batch RL technique, to the situation when a forecast of the

exogenous data is provided. In general, batch RL techniques do

not rely on expert knowledge about the system dynamics or the

solution. However, if some expert knowledge is provided, it can be

incorporated by using the proposed policy adjustment method.

Finally, we tackle the challenge of ﬁnding an open-loop sched-

ule required to participate in the day-ahead market. We propose

a model-free Monte Carlo method that uses a metric based on

the state-action value function or Q-function and we illustrate

this method by ﬁnding the day-ahead schedule of a heat-pump

thermostat. Our experiments show that batch RL techniques pro-

vide a valuable alternative to model-based controllers and that

they can be used to construct both closed-loop and open-loop

policies.

Index Terms—Batch reinforcement learning, demand response,

electric water heater, ﬁtted Q-iteration, heat pump.

I. INTRODUCTION

HE INCREASING share of renewable energy sources

introduces the need for ﬂexibility on the demand side

of the electricity system [1]. A prominent example of loads

that offer ﬂexibility at the residential level are thermostati-

cally controlled loads, such as heat pumps, air conditioning

units, and electric water heaters. These loads represent about

20% of the total electricity consumption at the residential

level in the United States [2]. The market share of these

loads is expected to increase as a result of the electriﬁca-

tion of heating and cooling [2], making them an interesting

domain for demand response [1], [3]–[5]. Demand response

programs offer demand ﬂexibility by motivating end users to

Manuscript received April 3, 2015; revised July 2, 2015 and September

29, 2015; accepted December 7, 2015. This work was supported by the

Flemish Institute for the Promotion of Scientiﬁc and Technological Research

in Industry (IWT). Paper no. TSG-00382-2015.

F. Ruelens, S. Vandael, and R. Belmans are with the Department of

Electrical Engineering, Katholieke Universiteit Leuven/EnergyVille, Leuven

3000, Belgium (e-mail: frederik.ruelens@esat.kuleuven.be).

B. J. Claessens is with the Energy Department, Flemish Institute for

Technological Research, Mol 2400, Belgium.

B. De Schutter and R. Babuška are with the Delft Center for Systems and

Control, Delft University of Technology, Delft 2600 AA, The Netherlands.

Digital Object Identiﬁer 10.1109/TSG.2016.2517211

adapt their consumption proﬁle in response to changing elec-

tricity prices or other grid signals. The forecast uncertainty

of renewable energy sources [6], combined with their limited

controllability, have made demand response the topic of an

extensive number of research projects [1], [7], [8] and scien-

tiﬁc papers [3], [5], [9]–[11]. The traditional control paradigm

deﬁnes the demand response problem as a model-based con-

trol problem [3], [7], [9], requiring a model of the demand

response application, an optimizer, and a forecasting tech-

nique. A critical step in setting up a model-based controller

includes selecting accurate models and estimating the model

parameters. This step becomes more challenging considering

the heterogeneity of the end users and their different patterns

of behavior. As a result, different end users are expected to

have different model parameters and even different models.

As such, a large-scale implementation of model-based con-

trollers requires a stable and robust approach that is able to

identify the appropriate model and the corresponding model

parameters. A detailed report of the implementation issues

of a model predictive control strategy applied to the heat-

ing system of a building can be found in [12]. Moreover, the

authors of [3] and [13] demonstrate a successful implementa-

tion of a model predictive control approach at an aggregated

level to control a heterogeneous cluster of thermostatically

controlled loads.

In contrast, Reinforcement Learning (RL) [14], [15]is

a model-free technique that requires no system identiﬁca-

tion step and no a priori knowledge. Recent developments

in the ﬁeld of reinforcement learning show that RL tech-

niques can either replace or supplement model-based tech-

niques [16]. A number of recent papers provide examples of

how a popular RL method, Q-learning [14], can be used for

demand response [4], [10], [17], [18]. For example in [10],

O’Neill et al. propose an automated energy management sys-

tem based on Q-learning that learns how to make optimal

decisions for the consumers. In [17], Henze and Schoenmann

investigate the potential of Q-learning for the operation of

commercial cold stores and in [4], Kara et al. use Q-learning to

control a cluster of thermostatically controlled loads. Similarly,

in [19] Liang et al. propose a Q-learning approach to minimize

the electricity cost of the ﬂexible demand and the disutility of

the user. Furthermore, inspired by [20], Lee and Powell pro-

pose a bias-corrected form of Q-learning to operate battery

charging in the presence of volatile prices [18].

1949-3053

 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Accepted Author Manuscript. Link to published article IEEE Transactions on Smart Grid: http://dx.doi.org/10.1109/TSG.2016.2517211

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON SMART GRID

Fig. 1. Building blocks of a model-free Reinforcement Learning (RL) agent (gray) applied to a Thermostatically Controlled Load (TCL).

While being a popular method, one of the fundamental

drawbacks of Q-learning is its inefﬁcient use of data, given

that Q-learning discards the current data sample after every

update. As a result, more observations are needed to propagate

already known information through the state space. In order to

overcome this drawback, batch RL techniques [21]–[24] can

be used. In batch RL, a controller estimates a control pol-

icy based on a batch of experiences. These experiences can

be a ﬁxed set [23] or can be gathered online by interacting

with the environment [25]. Given that batch RL algorithms

can reuse past experiences, they converge faster compared

to standard temporal difference methods like Q-learning [26]

or SARSA [27]. This makes batch RL techniques suitable

for practical implementations, such as demand response. For

example, the authors of [28] combine Q-learning with eligibil-

ity traces in order to learn the consumer and time preferences

of demand response applications. In [5], the authors use a

batch RL technique to schedule a cluster of electric water

heaters and in [29], Vandael et al. use a batch RL technique

to ﬁnd a day-ahead consumption plan of a cluster of electric

vehicles. An excellent overview of batch RL methods can be

found in [25] and [30].

Inspired by the recent developments in batch RL, in particu-

lar ﬁtted Q-iteration by Ernst et al. [16], this paper builds upon

the existing batch RL literature and contributes to the applica-

tion of batch RL techniques to residential demand response.

The contributions of our paper can be summarized as follows:

(1) we demonstrate how ﬁtted Q-iteration can be extended

to the situation when a forecast of the exogenous data is pro-

vided; (2) we propose a policy adjustment method that exploits

general expert knowledge about monotonicity conditions of

the control policy; (3) we introduce a model-free Monte Carlo

method to ﬁnd a day-ahead consumption plan by making use

of a novel metric based on Q-values.

This paper is structured as follows: Section II deﬁnes the

building blocks of our batch RL approach applied to demand

response. Section III formulates the problem as a Markov deci-

sion process. Section IV describes our model-free batch RL

techniques for demand response. Section V demonstrates the

presented techniques in a realistic demand response setting.

To conclude, Section VI summarizes the results and discusses

further research.

II. B

UILDING BLOCKS:MODEL-FREE APPROACH

Fig. 1 presents an overview of our model-free learning

agent applied to a Thermostatically Controlled Load (TCL),

where the gray building blocks correspond to the learning

agent.

At the start of each day the learning agent uses a batch

RL method to construct a control policy for the next day,

given a batch of past interactions with its environment. The

learning agent needs no a priori information on the model

dynamics and considers its environment as a black box.

Nevertheless, if a model of the exogenous variables, e.g., a

forecast of the outside temperature, or reward model is pro-

vided, the batch RL method can use this information to enrich

its batch. Once a policy is found, an expert policy adjust-

ment method can be used to shape the policy obtained with

the batch RL method. During the day, the learning agent

uses an exploration-exploitation strategy to interact with its

environment and to collect new transitions that are added

systematically to the given batch.

In this paper, the proposed learning agent is applied to

two types of TCLs. The ﬁrst type is a residential electric

water heater with a stochastic hot-water demand. The dynamic

behavior of the electric water heater is modeled using a non-

linear stratiﬁed thermal tank model as described in [31]. Our

second TCL is a heat-pump thermostat for a residential build-

ing. The temperature dynamics of the building are modeled

using a second-order equivalent thermal parameter model [32].

This model describes the temperature dynamics of the indoor

air and of the building envelope. However, to develop a realis-

tic implementation, this paper assumes that the learning agent

cannot measure the temperature of the building envelope and

considers it as a hidden state variable.

In addition, we assume that both TCLs are equipped with

a backup controller that guarantees the comfort and safety

settings of its users. The backup controller is a built-in over-

rule mechanism that turns the TCL ‘on’ or ‘off’ depending on

the current state and a predeﬁned switching logic. The oper-

ation and settings of the backup controller are assumed to be

unknown to the learning agent. However, the learning agent

can measure the overrule action of the backup controller (see

the dashed arrow in Fig. 1).

The observable state information contains sensory input data

of the state of the process and its environment. Before this

information is sent to the batch RL algorithm, the learning

agent can apply feature extraction [33]. This feature extrac-

tion step can have two functions namely to extract hidden

state information or to reduce the dimensionality of the state

vector. For example, in the case of a heat-pump thermostat,

this step can be used to extract a feature that represents the

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

RUELENS et al.: RESIDENTIAL DEMAND RESPONSE OF THERMOSTATICALLY CONTROLLED LOADS 3

hidden state information, e.g., the temperature of the building

envelope. Alternatively, a feature extraction mapping can be

used to ﬁnd a low-dimensional representation of the sensory

input data. For example, in the case of an electrical water

heater, the observed state vector consists of the temperature

sensors that are installed along the hull of the buffer tank.

When the number of temperature sensors is large, it can be

interesting to map this high-dimensional state vector to a low-

dimensional feature vector. This mapping can be the result

of an auto-encoder network or principal component analysis.

For example in [34], Curran et al. indicate that when only a

limited number of observations are available, a mapping to a

low-dimensional state space can improve the convergence of

the learning algorithm.

In this paper, the learning agent is applied to two relevant

demand response business models: dynamic pricing and day-

ahead scheduling [1], [35]. In dynamic pricing, the learning

agent learns a control policy that minimizes its electricity cost

by adapting its consumption proﬁle in response to an external

price signal. The solution of this optimal control problem is

a closed-loop control policy that is a function of the current

measurement of the state. The second business model relates to

the participation in the day-ahead market. The learning agent

constructs a day-ahead consumption plan and then tries to fol-

low it during the day. The objective of the learning agent is

to minimize its cost in the day-ahead market and to mini-

mize any deviation between the day-ahead consumption plan

and the actual consumption. In contrast to the solution of the

dynamic pricing scheme, the day-ahead consumption plan is

a feed-forward plan for the next day, i.e., an open-loop policy,

which does not depend on future measurements of the state.

III. M

ARKOV DECISION PROCESS FORMULATION

This section formulates the decision-making problem of the

learning agent as a Markov decision process. The Markov

decision process is deﬁned by its d-dimensional state space

X ⊂ R

, its action space U ⊂ R, its stochastic discrete-time

transition function f, and its cost function ρ. The optimiza-

tion horizon is considered ﬁnite, comprising T ∈ N \{0} steps,

where at each discrete time step k, the state evolves as follows:

k+1

= f (x

, u

, w

) ∀k ∈{1,...,T − 1}, (1)

with w

a realization of a random disturbance drawn from

a conditional probability distribution p

(·|x

), u

∈ U the

control action, and x

∈ X the state. Associated with each

state transition, a cost c

is given by:

= ρ(x

, u

, w

) ∀k ∈{1,...,T}. (2)

The goal is to ﬁnd an optimal control policy h

∗

: X → U that

minimizes the expected T-stage return for any state in the

state space. The expected T-stage return starting from x

and

following a policy h is deﬁned as follows:

) = E

∼p

(·|x

)





k=1

ρ(x

, h(x

), w

)



. (3)

A convenient way to characterize the policy h is by using a

state-action value function or Q-function:

(x, u) = E

w∼p

(·|x)



ρ(x, u, w) + γ J

( f (x, u, w))



, (4)

where γ ∈ (0, 1) is the discount factor. The Q-function is the

cumulative return starting from state x, taking action u, and

following h thereafter.

The optimal Q-function corresponds the best Q-function that

can be obtained by any policy:

∗

(x, u) = min

(x, u). (5)

Starting from an optimal Q-function for every state-action pair,

the optimal policy is calculated as follows:

∗

(x) ∈ arg min

u∈U

∗

(x, u), (6)

where Q

∗

satisﬁes the Bellman optimality equation [36]:

∗

(x, u) = E

w∼p

(·|x)



ρ(x, u, w) + γ min



∈U

∗



f (x, u, w), u





(7)

The next three paragraphs give a formal description of the state

space, the backup controller, and the cost function tailored to

demand response.

A. State Description

Following the notational style of [37], the state space

X is spanned by a time-dependent component X

, a con-

trollable component X

, and an uncontrollable exogenous

component X

X = X

× X

. (8)

1) Timing: The time-dependent component X

describes the

part of the state space related to timing, i.e., it carries timing

information that is relevant for the dynamics of the system:

= X

× X

with X

{

1, ..., 96

}

, X

{

1, ..., 7

}

, (9)

where x

∈ X

denotes the quarter in the day, and x

∈ X

denotes the day in the week. The rationale is that most con-

sumer behavior tends to be repetitive and tends to follows a

diurnal pattern.

2) Physical Representation: The controllable component

represents the physical state information related to the

quantities that are measured locally and that are inﬂuenced by

the control actions, e.g., the indoor air temperature or the state

of charge of an electric water heater:

∈ X

with x

< x

, (10)

where x

and x

denote the lower and upper bounds, set to

guarantee the comfort and safety of the end user.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON SMART GRID

3) Exogenous Information: The state description of the

uncontrollable exogenous state is split into two components:

= X

× X

. (11)

When the random disturbance w

k+1

is independent of w

there is no need to include exogenous variables in the state

space. However, most physical processes, such as the outside

temperature and solar radiation, exhibit a certain degree of

autocorrelation, where the outcome of the next state depends

on the current state. For this reason we include an exoge-

nous component x

∈ X

in our state space description.

This exogenous component is related to the observable exoge-

nous information that has an impact on the physical dynamics

and that cannot be inﬂuenced by the control actions, e.g., the

outside temperature.

The second exogenous component x

∈ X

has no direct

inﬂuence on the dynamics, but contains information to calcu-

late the cost c

. This work assumes that a deterministic forecast

of the exogenous state information related to the cost ˆx

and

related to the physical dynamics, i.e., outside temperature and

solar radiation, ˆx

is provided for the time span covering the

optimization problem.

B. Backup Controller

This paper assumes that each TCL is equipped with an over-

rule mechanism that guarantees comfort and safety constraints.

The backup function B : X × U −→ U

maps the requested

control action u

∈ U taken in state x

to a physical control

action u

∈ U

= B(x

, u

). (12)

The settings of the backup function B are unknown to the

learning agent, but the resulting action u

can be measured

by the learning agent (see the dashed arrow in Fig. 1).

C. Cost Function

In general, RL techniques do not require a description of the

cost function. However, for most demand response business

models a cost function is available. This paper considers two

typical cost functions related to demand response.

1) Dynamic Pricing: In the dynamic pricing scenario an

external price proﬁle is known deterministically at the start of

the horizon. The cost function is described as:

= u

ˆx

k,ex

t, (13)

where ˆx

k,ex

is the electricity price at time step k and t is the

length of a control period.

2) Day-Ahead Scheduling: The objective of the second

business case is to determine a day-ahead consumption plan

and to follow this plan during operation. The day-ahead con-

sumption plan should be minimized based on day-ahead prices.

In addition, any deviation between the planned consumption

and actual consumption should be avoided. As such, the cost

function can be written as:

= u

ˆx

k,ex

t + α



t − u

t



, (14)

Algorithm 1 Fitted Q-Iteration Using a Forecast of the

Exogenous Data (Extended FQI)

Input: F ={(x

, u

, x



, u

)}

l=1

, {(ˆx

l,ex

, ˆx

l,ex

)}

l=1

, T

1: let



be zero everywhere on X × U

2: for l = 1,...,#F do

3: ˆx



← (x

q 

l,t

, x

d 

l,t

, x



l,ph

, ˆx

ph 

l,ex

)  replace the observed

exogenous part of the next state x

ph 

l,ex

by its forecast ˆx

ph 

l,ex

4: end for

5: for N = 1,...,T do

6: for l = 1,...,#F do

7: c

← ρ(ˆx

l,ex

, u

)

8: Q

N,l

← c

+ min

u∈U



N−1

(ˆx



, u)

9: end for

10: use regression to obtain



from

reg



, u

), Q

N,l



, l = 1,...,#F



11: end for

Ensure:



∗



where u

is the planned consumption, u

is the actual con-

sumption, ˆx

k,ex

is the forecasted day-ahead price and α>0is

a penalty. The ﬁrst part of (14) is the cost for buying energy

at the day-ahead market, whereas the second part penalizes

any deviation between the planned consumption and the actual

consumption.

D. Reinforcement Learning for Demand Response

When the description of the transition function and cost

function is available, techniques that make use of the Markov

decision process framework, such as approximate dynamic

programming [38] or direct policy search [30], can be used to

ﬁnd near-optimal policies. However, in our implementation we

assume that the transition function f , the backup controller B,

and the underlying probability of the exogenous information w

are unknown. In addition, we assume that they are challenging

to obtain in a residential setting. For these reasons, we present

a model-free batch RL approach that builds on previous the-

oretical work on RL, in particular ﬁtted Q-iteration [23],

expert knowledge [39], and the synthesis of artiﬁcial

trajectories [21].

IV. A

LGORITHMS

Typically, batch RL techniques construct policies based on

a batch of tuples of the form: F ={(x

, u

, x



, c

)}

l=1

, where

= (x

l,t

, x

l,t

, x

l,ph

, x

l,ex

) denotes the state at time step l and x



denotes the state at time step l+1. However, for most demand

response applications, the cost function ρ is given a priori, and

of the form ρ(ˆx

l,ex

, u

). As such, this paper considers tuples

of the form (x

, u

, x



, u

A. Fitted Q-Iteration Using a Forecast of

the Exogenous Data

Here we demonstrate how ﬁtted Q-iteration [23] can be

extended to the situation when a forecast of the exogenous

HTML Viewer

Frequently Asked Questions (12)

Q1. What is the way to find near-optimal policies?

When the description of the transition function and cost function is available, techniques that make use of the Markov decision process framework, such as approximate dynamic programming [38] or direct policy search [30], can be used to find near-optimal policies.

Q2. What contributions have the authors mentioned in the paper "Residential demand response of thermostatically controlled loads using batch reinforcement learning" ?

Driven by recent advances in batch Reinforcement Learning ( RL ), this paper contributes to the application of batch RL to demand response. This paper extends fitted Q-iteration, a standard batch RL technique, to the situation when a forecast of the exogenous data is provided. The authors propose a model-free Monte Carlo method that uses a metric based on the state-action value function or Q-function and they illustrate this method by finding the day-ahead schedule of a heat-pump thermostat.

Q3. What are the future works mentioned in the paper "Residential demand response of thermostatically controlled loads using batch reinforcement learning" ?

In a final experiment, spanning 100 days, the authors have successfully tested this method to find the day-ahead consumption plan of a residential heat pump. Their future research in this area will focus on employing the presented algorithms in a realistic lab environment. The authors are currently testing the expert policy adjustment method on a converted electric water heater and an air conditioning unit with promising results. The preliminary findings of the lab experiments indicate that the expert policy adjustment and extended fitted Q-iteration can be successfully used in a real-world demand response setting.

Q4. What are the two business models that are applied to the learning agent?

In this paper, the learning agent is applied to two relevant demand response business models: dynamic pricing and dayahead scheduling [1], [35].

Q5. How does the method enforce monotonicity conditions?

The method enforces monotonicity conditions by using convex optimization to approximate the policy, where expert knowledge is included in the form of extra constraints.

Q6. What is the final challenge for model-free batch RL techniques?

A final challenge for model-free batch RL techniques is that of finding a consumption plan for the next day, i.e., an open-loop solution.

Q7. What is the effect of the policy adjustment method on the performance of a heat pump?

the results indicate that when the number of tuples in F is small, the expert policy adjustment method can be used to improve the performance of standard fitted Q-iteration.

Q8. How did the experiment with an electric water heater reduce the cost objective?

The results of an experiment with an electric water heater have indicated that the policy adjustment method was able to reduce the cost objective by 11% compared to fitted Q-iteration without expert knowledge.

Q9. What is the distance metric in XU?

In [21], Fonteneau et al. propose the following distance metric in X×U : ((x, x′), (u, u′)) = ‖x − x′‖ + ‖u − u′‖, where ‖·‖ denotes the Euclidean norm.

Q10. What is the definition of the observed state information of both FQI controllers?

The observed state information of both FQI controllers is defined by (22), where a handcrafted feature is used to represent the temperature of the building envelope (nr set to 3).

Q11. What is the way to map the state vector to a low-dimensional feature vector?

When the number of temperature sensors is large, it can be interesting to map this high-dimensional state vector to a lowdimensional feature vector.

Q12. What is the simplest way to construct a batch of tuples?

batch RL techniques construct policies based on a batch of tuples of the form: F = {(xl, ul, x′l, cl)}#Fl=1, where xl = (xql,t, xdl,t, xl,ph, xphl,ex) denotes the state at time step l and x′l denotes the state at time step l+1.

Residential Demand Response of Thermostatically Controlled Loads Using Batch Reinforcement Learning

Summary (5 min read)

Introduction

II. BUILDING BLOCKS: MODEL-FREE APPROACH

III. MARKOV DECISION PROCESS FORMULATION

A. State Description

B. Backup Controller

C. Cost Function

D. Reinforcement Learning for Demand Response

A. Fitted Q-Iteration Using a Forecast of the Exogenous Data

B. Expert Policy Adjustment

C. Day-Ahead Consumption Plan

V. SIMULATIONS

A. Thermostatically Controlled Loads

B. Experiment 1

C. Experiment 2

D. Experiment 3

VI. CONCLUSION

Figures (5)

Citations

References

"Residential Demand Response of Ther..." refers background or methods in this paper

Related Papers (5)

Frequently Asked Questions (12)

Q1. What is the way to find near-optimal policies?

Q2. What contributions have the authors mentioned in the paper "Residential demand response of thermostatically controlled loads using batch reinforcement learning" ?

Q3. What are the future works mentioned in the paper "Residential demand response of thermostatically controlled loads using batch reinforcement learning" ?

Q4. What are the two business models that are applied to the learning agent?

Q5. How does the method enforce monotonicity conditions?

Q6. What is the final challenge for model-free batch RL techniques?

Q7. What is the effect of the policy adjustment method on the performance of a heat pump?

Q8. How did the experiment with an electric water heater reduce the cost objective?

Q9. What is the distance metric in XU?

Q10. What is the definition of the observed state information of both FQI controllers?

Q11. What is the way to map the state vector to a low-dimensional feature vector?

Q12. What is the simplest way to construct a batch of tuples?