scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Residential Demand Response of Thermostatically Controlled Loads Using Batch Reinforcement Learning

TL;DR: The experiments show that batch RL techniques provide a valuable alternative to model-based controllers and that they can be used to construct both closed-loop and open-loop policies.
Abstract: Driven by recent advances in batch Reinforcement Learning (RL), this paper contributes to the application of batch RL to demand response. In contrast to conventional model-based approaches, batch RL techniques do not require a system identification step, making them more suitable for a large-scale implementation. This paper extends fitted Q-iteration, a standard batch RL technique, to the situation when a forecast of the exogenous data is provided. In general, batch RL techniques do not rely on expert knowledge about the system dynamics or the solution. However, if some expert knowledge is provided, it can be incorporated by using the proposed policy adjustment method. Finally, we tackle the challenge of finding an open-loop schedule required to participate in the day-ahead market. We propose a model-free Monte Carlo method that uses a metric based on the state-action value function or Q-function and we illustrate this method by finding the day-ahead schedule of a heat-pump thermostat. Our experiments show that batch RL techniques provide a valuable alternative to model-based controllers and that they can be used to construct both closed-loop and open-loop policies.

Summary (5 min read)

Introduction

  • The market share of these loads is expected to increase as a result of the electrification of heating and cooling [2], making them an interesting domain for demand response [1], [3]–[5].
  • As a result, different end users are expected to have different model parameters and even different models.
  • O’Neill et al. propose an automated energy management system based on Q-learning that learns how to make optimal decisions for the consumers.

II. BUILDING BLOCKS: MODEL-FREE APPROACH

  • Fig. 1 presents an overview of their model-free learning agent applied to a Thermostatically Controlled Load (TCL), where the gray building blocks correspond to the learning agent.
  • At the start of each day the learning agent uses a batch RL method to construct a control policy for the next day, given a batch of past interactions with its environment.
  • The temperature dynamics of the building are modeled using a second-order equivalent thermal parameter model [32].
  • The operation and settings of the backup controller are assumed to be unknown to the learning agent.
  • This feature extraction step can have two functions namely to extract hidden state information or to reduce the dimensionality of the state vector.

III. MARKOV DECISION PROCESS FORMULATION

  • This section formulates the decision-making problem of the learning agent as a Markov decision process.
  • The Markov decision process is defined by its d-dimensional state space X ⊂ Rd, its action space U ⊂ R, its stochastic discrete-time transition function f , and its cost function ρ.
  • The Q-function is the cumulative return starting from state x, taking action u, and following h thereafter.
  • The optimal Q-function corresponds the best Q-function that can be obtained by any policy: Q∗(x, u) = min h Qh(x, u).

A. State Description

  • Following the notational style of [37], the state space X is spanned by a time-dependent component Xt, a controllable component Xph, and an uncontrollable exogenous component Xex: X = Xt × Xph × Xex. (8) 1) Timing:.
  • The rationale is that most consumer behavior tends to be repetitive and tends to follows a diurnal pattern.
  • The state description of the uncontrollable exogenous state is split into two components: Xex = Xphex × Xcex. (11) However, most physical processes, such as the outside temperature and solar radiation, exhibit a certain degree of autocorrelation, where the outcome of the next state depends on the current state.
  • This work assumes that a deterministic forecast of the exogenous state information related to the cost x̂cex and related to the physical dynamics, i.e., outside temperature and solar radiation, x̂phex is provided for the time span covering the optimization problem.

B. Backup Controller

  • This paper assumes that each TCL is equipped with an overrule mechanism that guarantees comfort and safety constraints.
  • (12) The settings of the backup function B are unknown to the learning agent, but the resulting action uphk can be measured by the learning agent (see the dashed arrow in Fig. 1).

C. Cost Function

  • In general, RL techniques do not require a description of the cost function.
  • This paper considers two typical cost functions related to demand response.
  • In the dynamic pricing scenario an external price profile is known deterministically at the start of the horizon.
  • The day-ahead consumption plan should be minimized based on day-ahead prices.
  • In addition, any deviation between the planned consumption and actual consumption should be avoided.

D. Reinforcement Learning for Demand Response

  • When the description of the transition function and cost function is available, techniques that make use of the Markov decision process framework, such as approximate dynamic programming [38] or direct policy search [30], can be used to find near-optimal policies.
  • In their implementation the authors assume that the transition function f , the backup controller B, and the underlying probability of the exogenous information w are unknown.
  • In addition, the authors assume that they are challenging to obtain in a residential setting.
  • For these reasons, the authors present a model-free batch RL approach that builds on previous theoretical work on RL, in particular fitted Q-iteration [23], expert knowledge [39], and the synthesis of artificial trajectories [21].

A. Fitted Q-Iteration Using a Forecast of the Exogenous Data

  • Here the authors demonstrate how fitted Q-iteration [23] can be extended to the situation when a forecast of the exogenous component is provided (Algorithm 1).
  • In the subsequent iterations, Q-values are updated using the value iteration based on the Q-function of the previous iteration.
  • It is important to note that x̂′l denotes the successor state in F , where the observed exogenous information xph ′l,ex is replaced by its forecast x̂ph ′l,ex (line 3 in Algorithm 1).
  • By replacing the observed exogenous parts of the next state by their forecasts, the Q-function of the next state assumes that the exogenous information will follow its forecast.
  • The proposed algorithm is relevant for demand response applications that are influenced by exogenous weather data.

B. Expert Policy Adjustment

  • Given the Q-function from Algorithm 1, a near-optimal policy ĥ∗ can be constructed by solving (6) for every state in the state space.
  • In some cases, e.g., when F contains a limited number of observations, the resulting policy can be improved by using general prior knowledge about its shape.
  • The authors show how expert knowledge about the monotonicity of the policy can be exploited to regularize the policy.
  • In order to define a convex optimization problem the authors use a fuzzy model with triangular membership functions [30] to approximate the policy.
  • This partitioning leads to Nd state-dependent membership functions for each action.

C. Day-Ahead Consumption Plan

  • This section explains how to construct a day-ahead schedule starting from the Q-function obtained by Algorithm 1 and using cost function (14).
  • Finding a day-ahead schedule has a direct relation to two situations: a day-ahead market, where participants have to submit a day-ahead schedule one day in advance of the actual consumption [35]; a distributed optimization process, where two or more participants are coupled by a common constraint, e.g., congestion management [9].
  • These p sequences can be seen as a proxy of the actual trajectories that could be obtained by simulating the policy on the given control problem.
  • Each new transition is selected by minimizing a distance metric with the previously selected transition.
  • The motivation behind using Q-values instead of the Euclidean distance in X×U is that Q-values capture the dynamics of the system and, therefore, there is no need to select individual weights.

V. SIMULATIONS

  • This section presents the simulation results of three experiments and evaluates the performance of the proposed algorithms.
  • The authors focus on two examples of flexible loads, i.e., an electric water heater and a heat-pump thermostat.
  • The rationale behind using extended FQI for a heat-pump thermostat, is that the temperature dynamics of a building is influenced by exogenous weather data, which is less the case for an electric water heater.
  • In the second experiment, the authors apply the policy adjustment method to an electric water heater.
  • The final experiment uses the model-free Monte Carlo method to find a day-ahead consumption plan for a heat-pump thermostat.

A. Thermostatically Controlled Loads

  • Here the authors describe the state definition and the settings of the backup controller of the electric water heater and the heatpump thermostat.
  • This work uses feature extraction to reduce the dimensionality of the controllable state space component by replacing it with with the average sensor measurement.
  • A detailed description of the nonlinear tank model of the electric water heater and the calculation of its state of charge xsoc can be found in [31].
  • The authors second application considers a heat-pump thermostat that can measure the indoor air temperature, the outdoor air temperature, and the solar radiation.
  • In the simulation section the authors define a minimum and maximum comfort setting of 19◦C and 23◦C.

B. Experiment 1

  • The goal of the first experiment is to compare the performance of Fitted Q-Iteration (standard FQI) [23] to the performance of their extension of FQI (extended FQI), given by Algorithm 1.
  • The objective of the considered heating system is to minimize the electricity cost of the heat pump by responding to an external price signal.
  • The extended FQI controller uses the forecasted values of the outside temperature and solar radiance to construct the next states in the batch x̂′l ← (xq ′l,t , T ′l,in, T̂ ′l,out, Ŝ ′l ), where (.̂) denotes a forecast.
  • Since more interactions result in a better coverage of the state-action space, the exploration probability εd is decreased on a daily basis, according to the harmonic sequence 1/dn, where n is set to 0.7 and d denotes the current day.
  • In order to compare the performance of the FQI controllers the authors define the following metric: M = cfqi − cd co − cd , (24) where cfqi denotes the daily cost of the FQI controller, cd denotes the daily cost of the default controller and co denotes the daily cost of the optimal controller.

C. Experiment 2

  • The following experiment demonstrates the policy adjustment method for an electric water heater.
  • As stated in (19), the state space of an electric water heater consists of two dimensions, i.e., the time component and the average temperature.
  • These monotonicity constraints are added to the least-squares problem (15).
  • The original policies obtained with fitted Q-iteration after 7, 14 and 21 days are presented in the left column of Fig.
  • The adjusted policies obtained by the policy adjustment method are depicted in the right column.

D. Experiment 3

  • The final experiment demonstrates the Model-Free Monte Carlo (MFMC) method (Algorithm 2) to find the day-ahead consumption plan of a heat-pump thermostat.
  • The resulting Q-function is then used as a metric to build p distinct artificial trajectories (line 8 of Algorithm 2).
  • A day-ahead consumption plan was obtained by taking the average of these 4 artificial trajectories.
  • Moreover, the authors define a performance metric M = cMFMC/co, where cMFMC is the daily cost of the MFMC method and co is the daily cost of the optimal model-based controller.
  • The top left plot depicts the daily metric M of their MFMC method, where the metric of the optimal model-based controller corresponds to 1.

VI. CONCLUSION

  • Driven by the challenges presented by the system identification step of model-based controllers, this paper has contributed to the application of model-free batch Reinforcement Learning (RL) to demand response.
  • Therefore, the authors have presented an expert policy adjustment method that can exploit this expert knowledge.
  • The authors build upon the theoretical work of [16], [23], and [24], and they demonstrate how fitted Q-iteration can be adapted to work in a demand response setting.
  • The intent of the experiment is to compare fitted Q-iteration with two other approaches namely Q-learning and the optimal solution.
  • They demonstrate that the lower and upper bounds converge at least linearly towards the true return of the policy, when the size of the batch grows, i.e., new tuples are added to the batch.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Delft University of Technology
Residential demand response of thermostatically controlled loads using batch
Reinforcement Learning
Ruelens, F; Claessens, BJ; Vandael, S; De Schutter, Bart; Babuska, Robert; Belmans, R
DOI
10.1109/TSG.2016.2517211
Publication date
2017
Document Version
Accepted author manuscript
Published in
IEEE Transactions on Smart Grid
Citation (APA)
Ruelens, F., Claessens, BJ., Vandael, S., De Schutter, B., Babuska, R., & Belmans, R. (2017). Residential
demand response of thermostatically controlled loads using batch Reinforcement Learning.
IEEE
Transactions on Smart Grid
,
8
(5), 2149-2159. https://doi.org/10.1109/TSG.2016.2517211
Important note
To cite this publication, please use the final published version (if applicable).
Please check the document version above.
Copyright
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent
of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Takedown policy
Please contact us and provide details if you believe this document breaches copyrights.
We will remove access to the work immediately and investigate your claim.
This work is downloaded from Delft University of Technology.
For technical reasons the number of authors shown on this cover page is limited to a maximum of 10.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
1
Residential Demand Response of
Thermostatically Controlled Loads
Using Batch Reinforcement Learning
Frederik Ruelens, Bert J. Claessens, Stijn Vandael, Bart De Schutter, Senior Member, IEEE,
Robert Babuška, and Ronnie Belmans, Fellow, IEEE
Abstract—Driven by recent advances in batch Reinforcement
Learning (RL), this paper contributes to the application of batch
RL to demand response. In contrast to conventional model-
based approaches, batch RL techniques do not require a system
identification step, making them more suitable for a large-scale
implementation. This paper extends fitted Q-iteration, a stan-
dard batch RL technique, to the situation when a forecast of the
exogenous data is provided. In general, batch RL techniques do
not rely on expert knowledge about the system dynamics or the
solution. However, if some expert knowledge is provided, it can be
incorporated by using the proposed policy adjustment method.
Finally, we tackle the challenge of finding an open-loop sched-
ule required to participate in the day-ahead market. We propose
a model-free Monte Carlo method that uses a metric based on
the state-action value function or Q-function and we illustrate
this method by finding the day-ahead schedule of a heat-pump
thermostat. Our experiments show that batch RL techniques pro-
vide a valuable alternative to model-based controllers and that
they can be used to construct both closed-loop and open-loop
policies.
Index Terms—Batch reinforcement learning, demand response,
electric water heater, fitted Q-iteration, heat pump.
I. INTRODUCTION
T
HE INCREASING share of renewable energy sources
introduces the need for flexibility on the demand side
of the electricity system [1]. A prominent example of loads
that offer flexibility at the residential level are thermostati-
cally controlled loads, such as heat pumps, air conditioning
units, and electric water heaters. These loads represent about
20% of the total electricity consumption at the residential
level in the United States [2]. The market share of these
loads is expected to increase as a result of the electrifica-
tion of heating and cooling [2], making them an interesting
domain for demand response [1], [3]–[5]. Demand response
programs offer demand flexibility by motivating end users to
Manuscript received April 3, 2015; revised July 2, 2015 and September
29, 2015; accepted December 7, 2015. This work was supported by the
Flemish Institute for the Promotion of Scientific and Technological Research
in Industry (IWT). Paper no. TSG-00382-2015.
F. Ruelens, S. Vandael, and R. Belmans are with the Department of
Electrical Engineering, Katholieke Universiteit Leuven/EnergyVille, Leuven
3000, Belgium (e-mail: frederik.ruelens@esat.kuleuven.be).
B. J. Claessens is with the Energy Department, Flemish Institute for
Technological Research, Mol 2400, Belgium.
B. De Schutter and R. Babuška are with the Delft Center for Systems and
Control, Delft University of Technology, Delft 2600 AA, The Netherlands.
Digital Object Identifier 10.1109/TSG.2016.2517211
adapt their consumption profile in response to changing elec-
tricity prices or other grid signals. The forecast uncertainty
of renewable energy sources [6], combined with their limited
controllability, have made demand response the topic of an
extensive number of research projects [1], [7], [8] and scien-
tific papers [3], [5], [9]–[11]. The traditional control paradigm
defines the demand response problem as a model-based con-
trol problem [3], [7], [9], requiring a model of the demand
response application, an optimizer, and a forecasting tech-
nique. A critical step in setting up a model-based controller
includes selecting accurate models and estimating the model
parameters. This step becomes more challenging considering
the heterogeneity of the end users and their different patterns
of behavior. As a result, different end users are expected to
have different model parameters and even different models.
As such, a large-scale implementation of model-based con-
trollers requires a stable and robust approach that is able to
identify the appropriate model and the corresponding model
parameters. A detailed report of the implementation issues
of a model predictive control strategy applied to the heat-
ing system of a building can be found in [12]. Moreover, the
authors of [3] and [13] demonstrate a successful implementa-
tion of a model predictive control approach at an aggregated
level to control a heterogeneous cluster of thermostatically
controlled loads.
In contrast, Reinforcement Learning (RL) [14], [15]is
a model-free technique that requires no system identifica-
tion step and no a priori knowledge. Recent developments
in the field of reinforcement learning show that RL tech-
niques can either replace or supplement model-based tech-
niques [16]. A number of recent papers provide examples of
how a popular RL method, Q-learning [14], can be used for
demand response [4], [10], [17], [18]. For example in [10],
O’Neill et al. propose an automated energy management sys-
tem based on Q-learning that learns how to make optimal
decisions for the consumers. In [17], Henze and Schoenmann
investigate the potential of Q-learning for the operation of
commercial cold stores and in [4], Kara et al. use Q-learning to
control a cluster of thermostatically controlled loads. Similarly,
in [19] Liang et al. propose a Q-learning approach to minimize
the electricity cost of the flexible demand and the disutility of
the user. Furthermore, inspired by [20], Lee and Powell pro-
pose a bias-corrected form of Q-learning to operate battery
charging in the presence of volatile prices [18].
1949-3053
c
2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Accepted Author Manuscript. Link to published article IEEE Transactions on Smart Grid: http://dx.doi.org/10.1109/TSG.2016.2517211

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2 IEEE TRANSACTIONS ON SMART GRID
Fig. 1. Building blocks of a model-free Reinforcement Learning (RL) agent (gray) applied to a Thermostatically Controlled Load (TCL).
While being a popular method, one of the fundamental
drawbacks of Q-learning is its inefficient use of data, given
that Q-learning discards the current data sample after every
update. As a result, more observations are needed to propagate
already known information through the state space. In order to
overcome this drawback, batch RL techniques [21]–[24] can
be used. In batch RL, a controller estimates a control pol-
icy based on a batch of experiences. These experiences can
be a fixed set [23] or can be gathered online by interacting
with the environment [25]. Given that batch RL algorithms
can reuse past experiences, they converge faster compared
to standard temporal difference methods like Q-learning [26]
or SARSA [27]. This makes batch RL techniques suitable
for practical implementations, such as demand response. For
example, the authors of [28] combine Q-learning with eligibil-
ity traces in order to learn the consumer and time preferences
of demand response applications. In [5], the authors use a
batch RL technique to schedule a cluster of electric water
heaters and in [29], Vandael et al. use a batch RL technique
to find a day-ahead consumption plan of a cluster of electric
vehicles. An excellent overview of batch RL methods can be
found in [25] and [30].
Inspired by the recent developments in batch RL, in particu-
lar fitted Q-iteration by Ernst et al. [16], this paper builds upon
the existing batch RL literature and contributes to the applica-
tion of batch RL techniques to residential demand response.
The contributions of our paper can be summarized as follows:
(1) we demonstrate how fitted Q-iteration can be extended
to the situation when a forecast of the exogenous data is pro-
vided; (2) we propose a policy adjustment method that exploits
general expert knowledge about monotonicity conditions of
the control policy; (3) we introduce a model-free Monte Carlo
method to find a day-ahead consumption plan by making use
of a novel metric based on Q-values.
This paper is structured as follows: Section II defines the
building blocks of our batch RL approach applied to demand
response. Section III formulates the problem as a Markov deci-
sion process. Section IV describes our model-free batch RL
techniques for demand response. Section V demonstrates the
presented techniques in a realistic demand response setting.
To conclude, Section VI summarizes the results and discusses
further research.
II. B
UILDING BLOCKS:MODEL-FREE APPROACH
Fig. 1 presents an overview of our model-free learning
agent applied to a Thermostatically Controlled Load (TCL),
where the gray building blocks correspond to the learning
agent.
At the start of each day the learning agent uses a batch
RL method to construct a control policy for the next day,
given a batch of past interactions with its environment. The
learning agent needs no a priori information on the model
dynamics and considers its environment as a black box.
Nevertheless, if a model of the exogenous variables, e.g., a
forecast of the outside temperature, or reward model is pro-
vided, the batch RL method can use this information to enrich
its batch. Once a policy is found, an expert policy adjust-
ment method can be used to shape the policy obtained with
the batch RL method. During the day, the learning agent
uses an exploration-exploitation strategy to interact with its
environment and to collect new transitions that are added
systematically to the given batch.
In this paper, the proposed learning agent is applied to
two types of TCLs. The first type is a residential electric
water heater with a stochastic hot-water demand. The dynamic
behavior of the electric water heater is modeled using a non-
linear stratified thermal tank model as described in [31]. Our
second TCL is a heat-pump thermostat for a residential build-
ing. The temperature dynamics of the building are modeled
using a second-order equivalent thermal parameter model [32].
This model describes the temperature dynamics of the indoor
air and of the building envelope. However, to develop a realis-
tic implementation, this paper assumes that the learning agent
cannot measure the temperature of the building envelope and
considers it as a hidden state variable.
In addition, we assume that both TCLs are equipped with
a backup controller that guarantees the comfort and safety
settings of its users. The backup controller is a built-in over-
rule mechanism that turns the TCL ‘on’ or ‘off depending on
the current state and a predefined switching logic. The oper-
ation and settings of the backup controller are assumed to be
unknown to the learning agent. However, the learning agent
can measure the overrule action of the backup controller (see
the dashed arrow in Fig. 1).
The observable state information contains sensory input data
of the state of the process and its environment. Before this
information is sent to the batch RL algorithm, the learning
agent can apply feature extraction [33]. This feature extrac-
tion step can have two functions namely to extract hidden
state information or to reduce the dimensionality of the state
vector. For example, in the case of a heat-pump thermostat,
this step can be used to extract a feature that represents the

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
RUELENS et al.: RESIDENTIAL DEMAND RESPONSE OF THERMOSTATICALLY CONTROLLED LOADS 3
hidden state information, e.g., the temperature of the building
envelope. Alternatively, a feature extraction mapping can be
used to find a low-dimensional representation of the sensory
input data. For example, in the case of an electrical water
heater, the observed state vector consists of the temperature
sensors that are installed along the hull of the buffer tank.
When the number of temperature sensors is large, it can be
interesting to map this high-dimensional state vector to a low-
dimensional feature vector. This mapping can be the result
of an auto-encoder network or principal component analysis.
For example in [34], Curran et al. indicate that when only a
limited number of observations are available, a mapping to a
low-dimensional state space can improve the convergence of
the learning algorithm.
In this paper, the learning agent is applied to two relevant
demand response business models: dynamic pricing and day-
ahead scheduling [1], [35]. In dynamic pricing, the learning
agent learns a control policy that minimizes its electricity cost
by adapting its consumption profile in response to an external
price signal. The solution of this optimal control problem is
a closed-loop control policy that is a function of the current
measurement of the state. The second business model relates to
the participation in the day-ahead market. The learning agent
constructs a day-ahead consumption plan and then tries to fol-
low it during the day. The objective of the learning agent is
to minimize its cost in the day-ahead market and to mini-
mize any deviation between the day-ahead consumption plan
and the actual consumption. In contrast to the solution of the
dynamic pricing scheme, the day-ahead consumption plan is
a feed-forward plan for the next day, i.e., an open-loop policy,
which does not depend on future measurements of the state.
III. M
ARKOV DECISION PROCESS FORMULATION
This section formulates the decision-making problem of the
learning agent as a Markov decision process. The Markov
decision process is defined by its d-dimensional state space
X R
d
, its action space U R, its stochastic discrete-time
transition function f, and its cost function ρ. The optimiza-
tion horizon is considered finite, comprising T N \{0} steps,
where at each discrete time step k, the state evolves as follows:
x
k+1
= f (x
k
, u
k
, w
k
) k ∈{1,...,T 1}, (1)
with w
k
a realization of a random disturbance drawn from
a conditional probability distribution p
W
(·|x
k
), u
k
U the
control action, and x
k
X the state. Associated with each
state transition, a cost c
k
is given by:
c
k
= ρ(x
k
, u
k
, w
k
) k ∈{1,...,T}. (2)
The goal is to find an optimal control policy h
: X U that
minimizes the expected T-stage return for any state in the
state space. The expected T-stage return starting from x
1
and
following a policy h is defined as follows:
J
h
T
(x
1
) = E
w
k
p
W
(·|x
k
)
T
k=1
ρ(x
k
, h(x
k
), w
k
)
. (3)
A convenient way to characterize the policy h is by using a
state-action value function or Q-function:
Q
h
(x, u) = E
wp
W
(·|x)
ρ(x, u, w) + γ J
h
T
( f (x, u, w))
, (4)
where γ (0, 1) is the discount factor. The Q-function is the
cumulative return starting from state x, taking action u, and
following h thereafter.
The optimal Q-function corresponds the best Q-function that
can be obtained by any policy:
Q
(x, u) = min
h
Q
h
(x, u). (5)
Starting from an optimal Q-function for every state-action pair,
the optimal policy is calculated as follows:
h
(x) arg min
uU
Q
(x, u), (6)
where Q
satisfies the Bellman optimality equation [36]:
Q
(x, u) = E
wp
W
(·|x)
ρ(x, u, w) + γ min
u
U
Q
f (x, u, w), u
.
(7)
The next three paragraphs give a formal description of the state
space, the backup controller, and the cost function tailored to
demand response.
A. State Description
Following the notational style of [37], the state space
X is spanned by a time-dependent component X
t
, a con-
trollable component X
ph
, and an uncontrollable exogenous
component X
ex
:
X = X
t
× X
ph
× X
ex
. (8)
1) Timing: The time-dependent component X
t
describes the
part of the state space related to timing, i.e., it carries timing
information that is relevant for the dynamics of the system:
X
t
= X
q
t
× X
d
t
with X
q
t
=
{
1, ..., 96
}
, X
d
t
=
{
1, ..., 7
}
, (9)
where x
q
t
X
q
t
denotes the quarter in the day, and x
d
t
X
d
t
denotes the day in the week. The rationale is that most con-
sumer behavior tends to be repetitive and tends to follows a
diurnal pattern.
2) Physical Representation: The controllable component
X
ph
represents the physical state information related to the
quantities that are measured locally and that are influenced by
the control actions, e.g., the indoor air temperature or the state
of charge of an electric water heater:
x
ph
X
ph
with x
ph
< x
ph
< x
ph
, (10)
where x
ph
and x
ph
denote the lower and upper bounds, set to
guarantee the comfort and safety of the end user.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4 IEEE TRANSACTIONS ON SMART GRID
3) Exogenous Information: The state description of the
uncontrollable exogenous state is split into two components:
X
ex
= X
ph
ex
× X
c
ex
. (11)
When the random disturbance w
k+1
is independent of w
k
there is no need to include exogenous variables in the state
space. However, most physical processes, such as the outside
temperature and solar radiation, exhibit a certain degree of
autocorrelation, where the outcome of the next state depends
on the current state. For this reason we include an exoge-
nous component x
ph
ex
X
ph
ex
in our state space description.
This exogenous component is related to the observable exoge-
nous information that has an impact on the physical dynamics
and that cannot be influenced by the control actions, e.g., the
outside temperature.
The second exogenous component x
c
ex
X
c
ex
has no direct
influence on the dynamics, but contains information to calcu-
late the cost c
k
. This work assumes that a deterministic forecast
of the exogenous state information related to the cost ˆx
c
ex
and
related to the physical dynamics, i.e., outside temperature and
solar radiation, ˆx
ph
ex
is provided for the time span covering the
optimization problem.
B. Backup Controller
This paper assumes that each TCL is equipped with an over-
rule mechanism that guarantees comfort and safety constraints.
The backup function B : X × U −→ U
ph
maps the requested
control action u
k
U taken in state x
k
to a physical control
action u
ph
k
U
ph
:
u
ph
k
= B(x
k
, u
k
). (12)
The settings of the backup function B are unknown to the
learning agent, but the resulting action u
ph
k
can be measured
by the learning agent (see the dashed arrow in Fig. 1).
C. Cost Function
In general, RL techniques do not require a description of the
cost function. However, for most demand response business
models a cost function is available. This paper considers two
typical cost functions related to demand response.
1) Dynamic Pricing: In the dynamic pricing scenario an
external price profile is known deterministically at the start of
the horizon. The cost function is described as:
c
k
= u
ph
k
ˆx
c
k,ex
t, (13)
where ˆx
c
k,ex
is the electricity price at time step k and t is the
length of a control period.
2) Day-Ahead Scheduling: The objective of the second
business case is to determine a day-ahead consumption plan
and to follow this plan during operation. The day-ahead con-
sumption plan should be minimized based on day-ahead prices.
In addition, any deviation between the planned consumption
and actual consumption should be avoided. As such, the cost
function can be written as:
c
k
= u
k
ˆx
c
k,ex
t + α
u
k
t u
ph
k
t
, (14)
Algorithm 1 Fitted Q-Iteration Using a Forecast of the
Exogenous Data (Extended FQI)
Input: F ={(x
l
, u
l
, x
l
, u
ph
l
)}
#F
l=1
, {(ˆx
ph
l,ex
, ˆx
c
l,ex
)}
#F
l=1
, T
1: let
Q
0
be zero everywhere on X × U
2: for l = 1,...,#F do
3: ˆx
l
(x
q
l,t
, x
d
l,t
, x
l,ph
, ˆx
ph
l,ex
) replace the observed
exogenous part of the next state x
ph
l,ex
by its forecast ˆx
ph
l,ex
4: end for
5: for N = 1,...,T do
6: for l = 1,...,#F do
7: c
l
ρ(ˆx
c
l,ex
, u
ph
l
)
8: Q
N,l
c
l
+ min
uU
Q
N1
(ˆx
l
, u)
9: end for
10: use regression to obtain
Q
N
from
T
reg
=

(x
l
, u
l
), Q
N,l
, l = 1,...,#F
11: end for
Ensure:
Q
=
Q
T
where u
k
is the planned consumption, u
ph
k
is the actual con-
sumption, ˆx
c
k,ex
is the forecasted day-ahead price and α>0is
a penalty. The first part of (14) is the cost for buying energy
at the day-ahead market, whereas the second part penalizes
any deviation between the planned consumption and the actual
consumption.
D. Reinforcement Learning for Demand Response
When the description of the transition function and cost
function is available, techniques that make use of the Markov
decision process framework, such as approximate dynamic
programming [38] or direct policy search [30], can be used to
find near-optimal policies. However, in our implementation we
assume that the transition function f , the backup controller B,
and the underlying probability of the exogenous information w
are unknown. In addition, we assume that they are challenging
to obtain in a residential setting. For these reasons, we present
a model-free batch RL approach that builds on previous the-
oretical work on RL, in particular fitted Q-iteration [23],
expert knowledge [39], and the synthesis of artificial
trajectories [21].
IV. A
LGORITHMS
Typically, batch RL techniques construct policies based on
a batch of tuples of the form: F ={(x
l
, u
l
, x
l
, c
l
)}
#F
l=1
, where
x
l
= (x
q
l,t
, x
d
l,t
, x
l,ph
, x
ph
l,ex
) denotes the state at time step l and x
l
denotes the state at time step l+1. However, for most demand
response applications, the cost function ρ is given a priori, and
of the form ρ(ˆx
c
l,ex
, u
ph
l
). As such, this paper considers tuples
of the form (x
l
, u
l
, x
l
, u
ph
l
).
A. Fitted Q-Iteration Using a Forecast of
the Exogenous Data
Here we demonstrate how fitted Q-iteration [23] can be
extended to the situation when a forecast of the exogenous

Citations
More filters
Posted Content
TL;DR: This work discusses core RL elements, including value function, in particular, Deep Q-Network (DQN), policy, reward, model, planning, and exploration, and important mechanisms for RL, including attention and memory, unsupervised learning, transfer learning, multi-agent RL, hierarchical RL, and learning to learn.
Abstract: We give an overview of recent exciting achievements of deep reinforcement learning (RL). We discuss six core elements, six important mechanisms, and twelve applications. We start with background of machine learning, deep learning and reinforcement learning. Next we discuss core RL elements, including value function, in particular, Deep Q-Network (DQN), policy, reward, model, planning, and exploration. After that, we discuss important mechanisms for RL, including attention and memory, unsupervised learning, transfer learning, multi-agent RL, hierarchical RL, and learning to learn. Then we discuss various applications of RL, including games, in particular, AlphaGo, robotics, natural language processing, including dialogue systems, machine translation, and text generation, computer vision, neural architecture design, business management, finance, healthcare, Industry 4.0, smart grid, intelligent transportation systems, and computer systems. We mention topics not reviewed yet, and list a collection of RL resources. After presenting a brief summary, we close with discussions. Please see Deep Reinforcement Learning, arXiv:1810.06339, for a significant update.

935 citations

Journal ArticleDOI
TL;DR: In this paper, a review of the use of reinforcement learning for demand response applications in the smart grid is presented, and the authors identify a need to further explore reinforcement learning to coordinate multi-agent systems that can participate in demand response programs under demand-dependent electricity prices.

429 citations

Journal ArticleDOI
TL;DR: Deep learning, reinforcement learning and their combination-deep reinforcement learning are representative methods and relatively mature methods in the family of AI 2.0 and their potential for application in smart grids is summarized and an overview of the research work on their application is provided.
Abstract: Smart grids are the developmental trend of power systems and they have attracted much attention all over the world. Due to their complexities, and the uncertainty of the smart grid and high volume of information being collected, artificial intelligence techniques represent some of the enabling technologies for its future development and success. Owing to the decreasing cost of computing power, the profusion of data, and better algorithms, AI has entered into its new developmental stage and AI 2.0 is developing rapidly. Deep learning (DL), reinforcement learning (RL) and their combination-deep reinforcement learning (DRL) are representative methods and relatively mature methods in the family of AI 2.0. This article introduces the concept and status quo of the above three methods, summarizes their potential for application in smart grids, and provides an overview of the research work on their application in smart grids.

322 citations

Journal ArticleDOI
TL;DR: Simulation results show that this proposed DR algorithm, can promote SP profitability, reduce energy costs for CUs, balance energy supply and demand in the electricity market, and improve the reliability of electric power systems, which can be regarded as a win-win strategy for both SP and CUs.

312 citations

References
More filters
Book
01 Jan 1988
TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Abstract: Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability. The book is divided into three parts. Part I defines the reinforcement learning problem in terms of Markov decision processes. Part II provides basic solution methods: dynamic programming, Monte Carlo methods, and temporal-difference learning. Part III presents a unified view of the solution methods and incorporates artificial neural networks, eligibility traces, and planning; the two final chapters present case studies and consider the future of reinforcement learning.

37,989 citations


"Residential Demand Response of Ther..." refers background or methods in this paper

  • ...For example, the authors of [28] combine Q-learning with eligibility traces in order to learn the consumer and time preferences of demand response applications....

    [...]

  • ...how a popular RL method, Q-learning [14], can be used for demand response [4], [10], [17], [18]....

    [...]

  • ...In contrast, Reinforcement Learning (RL) [14], [15] is...

    [...]

  • ...A number of recent papers provide examples of how a popular RL method, Q-learning [14], can be used for demand response [4], [10], [17], [18]....

    [...]

  • ...In this experiment, we implemented Q-learning [26] and we applied it to a heat-pump thermostat with a time-varying price profile as described in Section V....

    [...]

Journal ArticleDOI
28 Jul 2006-Science
TL;DR: In this article, an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data is described.
Abstract: High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such "autoencoder" networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.

16,717 citations

Book
21 Oct 1957
TL;DR: The more the authors study the information processing aspects of the mind, the more perplexed and impressed they become, and it will be a very long time before they understand these processes sufficiently to reproduce them.
Abstract: From the Publisher: An introduction to the mathematical theory of multistage decision processes, this text takes a functional equation approach to the discovery of optimum policies. Written by a leading developer of such policies, it presents a series of methods, uniqueness and existence theorems, and examples for solving the relevant equations. The text examines existence and uniqueness theorems, the optimal inventory equation, bottleneck problems in multistage production processes, a new formalism in the calculus of variation, strategies behind multistage games, and Markovian decision processes. Each chapter concludes with a problem set that Eric V. Denardo of Yale University, in his informative new introduction, calls a rich lode of applications and research topics. 1957 edition. 37 figures.

14,187 citations

Journal ArticleDOI
TL;DR: This paper presents and proves in detail a convergence theorem forQ-learning based on that outlined in Watkins (1989), showing that Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action- values are represented discretely.
Abstract: \cal Q-learning (Watkins, 1989) is a simple way for agents to learn how to act optimally in controlled Markovian domains. It amounts to an incremental method for dynamic programming which imposes limited computational demands. It works by successively improving its evaluations of the quality of particular actions at particular states. This paper presents and proves in detail a convergence theorem for \cal Q-learning based on that outlined in Watkins (1989). We show that \cal Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action-values are represented discretely. We also sketch extensions to the cases of non-discounted, but absorbing, Markov environments, and where many \cal Q values can be changed each iteration, rather than just one.

8,450 citations

Journal ArticleDOI
TL;DR: Sequence alignment methods often use something called a 'dynamic programming' algorithm, which can be a good idea or a bad idea, depending on the method used.
Abstract: Sequence alignment methods often use something called a 'dynamic programming' algorithm. What is dynamic programming and how does it work?

5,348 citations

Frequently Asked Questions (12)
Q1. What is the way to find near-optimal policies?

When the description of the transition function and cost function is available, techniques that make use of the Markov decision process framework, such as approximate dynamic programming [38] or direct policy search [30], can be used to find near-optimal policies. 

Driven by recent advances in batch Reinforcement Learning ( RL ), this paper contributes to the application of batch RL to demand response. This paper extends fitted Q-iteration, a standard batch RL technique, to the situation when a forecast of the exogenous data is provided. The authors propose a model-free Monte Carlo method that uses a metric based on the state-action value function or Q-function and they illustrate this method by finding the day-ahead schedule of a heat-pump thermostat. 

In a final experiment, spanning 100 days, the authors have successfully tested this method to find the day-ahead consumption plan of a residential heat pump. Their future research in this area will focus on employing the presented algorithms in a realistic lab environment. The authors are currently testing the expert policy adjustment method on a converted electric water heater and an air conditioning unit with promising results. The preliminary findings of the lab experiments indicate that the expert policy adjustment and extended fitted Q-iteration can be successfully used in a real-world demand response setting. 

In this paper, the learning agent is applied to two relevant demand response business models: dynamic pricing and dayahead scheduling [1], [35]. 

The method enforces monotonicity conditions by using convex optimization to approximate the policy, where expert knowledge is included in the form of extra constraints. 

A final challenge for model-free batch RL techniques is that of finding a consumption plan for the next day, i.e., an open-loop solution. 

the results indicate that when the number of tuples in F is small, the expert policy adjustment method can be used to improve the performance of standard fitted Q-iteration. 

The results of an experiment with an electric water heater have indicated that the policy adjustment method was able to reduce the cost objective by 11% compared to fitted Q-iteration without expert knowledge. 

In [21], Fonteneau et al. propose the following distance metric in X×U : ((x, x′), (u, u′)) = ‖x − x′‖ + ‖u − u′‖, where ‖·‖ denotes the Euclidean norm. 

The observed state information of both FQI controllers is defined by (22), where a handcrafted feature is used to represent the temperature of the building envelope (nr set to 3). 

When the number of temperature sensors is large, it can be interesting to map this high-dimensional state vector to a lowdimensional feature vector. 

batch RL techniques construct policies based on a batch of tuples of the form: F = {(xl, ul, x′l, cl)}#Fl=1, where xl = (xql,t, xdl,t, xl,ph, xphl,ex) denotes the state at time step l and x′l denotes the state at time step l+1.