scispace - formally typeset
Open AccessJournal ArticleDOI

Data science for building energy management: a review

TLDR
In this paper, the authors reviewed how Data Science has been applied to address the most difficult problems faced by practitioners in the field of Energy Management, especially in the building sector. And they also discussed the challenges and opportunities that will arise with the advent of fully connected devices and new computational technologies.
Abstract
The energy consumption of residential and commercial buildings has risen steadily in recent years, an increase largely due to their HVAC systems. Expected energy loads, transportation, and storage as well as user behavior influence the quantity and quality of the energy consumed daily in buildings. However, technology is now available that can accurately monitor, collect, and store the huge amount of data involved in this process. Furthermore, this technology is capable of analyzing and exploiting such data in meaningful ways. Not surprisingly, the use of data science techniques to increase energy efficiency is currently attracting a great deal of attention and interest. This paper reviews how Data Science has been applied to address the most difficult problems faced by practitioners in the field of Energy Management, especially in the building sector. The work also discusses the challenges and opportunities that will arise with the advent of fully connected devices and new computational technologies.

read more

Content maybe subject to copyright    Report

Data Science for Building Energy Management: a review
Miguel Molina-Solana
a,b
, Mar´ıa Ros
a,
, M. Dolores Ruiz
a
, Juan omez-Romero
a
, M.J. Martin-Bautista
a
a
Department of Computer Science and Artificial Intelligence, Universidad de Granada
b
Data Science Institute, Imperial College London
Abstract
The energy consumption of residential and commercial buildings has risen steadily in recent years, an
increase largely due to their HVAC systems. Expected energy loads, transportation, and storage as well
as user behavior influence the quantity and quality of the energy consumed daily in buildings. However,
technology is now available that can accurately monitor, collect, and store the huge amount of data involved
in this process. Furthermore, this technology is capable of analyzing and exploiting such data in meaningful
ways. Not surprisingly, the use of data science techniques to increase energy efficiency is currently attracting
a great deal of attention and interest. This paper reviews how Data Science has been applied to address the
most difficult problems faced by practitioners in the field of Energy Management, especially in the building
sector. The work also discusses the challenges and opportunities that will arise with the advent of fully
connected devices and new computational technologies.
1. Introduction
There is a general consensus in the world today that human activities are having a negative impact
on the environment and have accelerated both global warming and climate change. These environmental
threats have been intensified by the emissions produced by the energy required for the lighting and HVAC
(heating, ventilation and air-conditioning) systems in building constructions. According to the International
Energy Agency (IEA), residential and commercial buildings are responsible for up to 32% of the total final
energy consumption. In fact, in most IEA countries, they account for approximately 40% of the primary
energy consumption. Similar statistics are given by the World Business Council for Sustainable Development
(WBCSD) within the framework of its Energy Efficiency in Buildings (EEB) project
1
. Also provided is a
comprehensive review [1] of the state of the art in building energy use (with a primary focus on energy
demand).
These data indicate that inefficient energy management in aging buildings combined with rising construc-
tion activity in developed countries will cause energy consumption to soar in the near future and heighten the
negative impacts associated with this consumption. Moreover, variable energy costs call for the implemen-
tation of more intelligent strategies to adapt and reduce energy consumption as well as to find alternative
and sustainable energy sources. The relevance of these issues is clearly reflected in the research priorities of
the European Union, as stated in its Horizon2020 Societal Challenge “Secure, Clean and Efficient Energy”.
This work program targets a significant reduction in energy consumption by 2020 in the transportation and
building sectors, both of which have great potential for energy savings.
Increasing energy efficiency is a two-fold process. Not only does it involve the use of affordable energy
sources, but also the improvement of current energy management procedures and infrastructures. The
Corresponding author
Email addresses: miguelmolina@imperial.ac.uk (Miguel Molina-Solana), marosiz@decsai.ugr.es (Mar´ıa Ros),
mdruiz@decsai.ugr.es (M. Dolores Ruiz), jgomez@decsai.ugr.es (Juan omez-Romero), mbautis@decsai.ugr.es (M.J.
Martin-Bautista)
1
http://www.wbcsd.org/web/eeb.htm
Preprint submitted to Renewable & Sustainable Energy Reviews June 25, 2017

latter includes the optimization of energy generation and transportation based on user demand [2], one of
the most important issues for energy companies. In this regard, computer-aided approaches have recently
come into the spotlight. More specifically, increased data awareness in companies has led to the development
of solutions based on Data Mining, a research area that studies how to automatically discover non-trivial
knowledge from data, and Data Science, which encompasses a wide range of techniques and more complex
datasets.
In the area of building energy management, Data Science is now used to address problems such as the
following: (i)the prediction of energy demand in order to adapt production and distribution; (ii) the analysis
of building operations as well as of equipment status and failures to optimize operation and maintenance
costs; (iii) the detection of energy consumption patterns to create customized commercial offers and to
detect fraud. This requires collecting data pertaining to building operation and user behavior. These data
must also be interpreted to implement adapted energy management policies. The information collected may
come from very heterogeneous sources ranging from in-site sensors (located in the equipment and in the
immediate environment) to external parameters (e.g. weather, energy costs, etc.). These advances have also
signified a shift in the perception of who owns these data and who benefits from them [3]. Customers are
increasingly aware of the importance of their actions and the value of the data that they generate. In this
sense, they have become actors with a key role in the energy efficiency landscape.
This paper reviews different data science techniques and explains how they have been employed to deal
with the difficult challenges faced by building energy management. As reflected in recent literature on the
topic, classification and clustering methods are frequently used for this purpose, but there is still room
for improvement in relatively underexplored areas, such as frequent and temporal pattern discovery for
load prediction. Also discussed are future trends in Data Science, which will lead to new methods and
tools capable of the more intelligent processing of large amounts of data collected from multiple distributed
devices. Although there are other reviews on automatic techniques for building efficiency assessment [4, 5],
and on classification methods for load and energy consumption prediction [6], this work examines and
discusses a broader set of data science techniques, and their applications to the different aspects of building
energy management.
The paper is structured as follows. After an introduction to data science techniques (Section 2), Section
3 summarizes recent work in Energy Data Science and situates it in the context of the current requirements
and needs of building energy managers. Section 4 discusses the data science techniques employed in various
fields related to building energy management. Finally, Section 5 provides an overview of new approaches
that are expected to lead to research advances, and concludes with recommendations and guidelines for the
future.
2. Data Science
Over the years, technological tools have benefited a wide range of domains, and Energy Efficiency and
Management is no exception. Developments in various areas of Information and Communications Technology
(ICT), such as Control and Automation, Smart Metering, Real-time Monitoring, and Data Science, have
had a tremendous impact on this field. As is well known, Data Science builds systems and algorithms to
discover knowledge, detect patterns, and generate useful insights and predictions from large-scale data. It
encompasses the whole data analysis process, which begins with data extraction and cleaning, and extends
to data analysis, description and summarization. The results is the prediction of new values and their
visualization. Data Science thus involves mathematical and statistical analysis, combined with information
technology tools.
However, deriving insights from data is not only achieved by using such techniques. The expert must
also manage and interpret the data in order to obtain valuable knowledge. As shown in Figure 1, the process
starts with the collection of raw data. After that, it is necessary to clean the data, and select the subset that
has the relevant information. For that purpose, the expert applies filters to the data or formulates queries
that will eliminate irrelevant information. At this step, it is also when additional sources of information
might be integrated and fused with the original data to provide further knowledge. Once the data are
prepared for use, an exploratory analysis (including visualization tools) can help decide which methods or
2

Data Processing
Collection of
Raw Data
Data
Cleaning
Data Pre-Processing
Data
Filtering
Exploratory
Analysis &
Visualization
of Data
Models &
Algorithms
Data
Querying
Revision
Reports
Decision
Making
Visualization
of Results
Results: Data description & prediction
Data Selection
Data
Aggregation
Figure 1: Data science process
algorithms are most effective to obtain the desired knowledge. The final process will lead to a set of results
that guide the decision-making, which again, might rely on visualisation.
Based on the preliminary outcomes, the whole process might need to be tuned to obtain better results.
This could entail setting new parameter values or adding/discarding new sets of data. Since such decisions
cannot be made automatically, the participation of the expert in the analysis of the results is a crucial factor.
From a more technically perspective, Data Science comprises a set of techniques and tools which pursue
different goals and depart from different situations. Some of the most popular techniques are classification,
clustering, regression and association rule mining. Although these techniques have been the most frequently
applied in Energy Efficiency and Management, others, which are not so well known (e.g. sequence analysis
and anomaly detection), are also useful in providing solutions for building energy problems.
Classification When classifying a set of objects, the objective is to predict the class of each one on the
basis of their attributes. Decision trees (i.e. a kind of flowchart for the classification of new data) are
a common way of performing and visualizing that classification [7]. Decision trees can be generated by
many different algorithms, though the most well known are CLS, ID3, C4.5, C5.0, and CART. Random
Forest is another classification technique that constructs a set of decision trees and then predicts the
class by aggregating the values obtained with each tree (e.g. by using the mode or mean). This method
corrects overfitting (when the models from the learning algorithm perform very well on the training set,
at the cost of an increased error on the validation set), a common practical difficulty in decision trees.
Support Vector Machine (SVM) [8] is a technique that is also used for classification. SVMs perfom
classification tasks by constructing a hyperplane (or a set of hyperplanes) in a multidimensional space
to separate the data (regarded as points in the space) into classes. Once the hyperplanes is constructed,
it classifies the new examples according to the previously specified decision boundaries.
Bayesian classification, genetic algorithms, and neural Networks have been also employed in classifi-
cation tasks. There are various approximations that use probabilistic classifiers based on the Bayes’
theorem, but as a consequence, there are strong independence assumptions between the variables in-
3

volved [9]. Class prediction with genetic programming algorithms [10] are based on chromosome-like
structures that can be combined and/or mutated with other chromosomes to create new individuals.
Neural Networks (NNs) are able to predict new observations from existing ones by means of intercon-
nected elements called neurons [11]. The main advantage of NNs is that they are robust and tolerant
of errors. A self-organizing map (SOM) is a type of artificial neural network that is trained by un-
supervised learning to produce low-dimensional views of high-dimensional data. Another well-known
classification method is that of k-Nearest Neighbors, which classifies and object by the majority vote
of its k neighbors. In other words, an object is assigned to a category based on the category of its k
nearest neighbors [12].
Regression The main objective of regression analysis is to numerically estimate the relationship between
variables. This involves ascertaining whether variables are independent. When they are not, it is
then necessary to discover the type of dependence of their relation [13]. Regression analysis is widely
used in prediction and forecasting as well as to understand how the values of dependent variables
change while those of independent variables remain fixed. Linear and non-linear (polynomial, logistic,
etc.) regression methods are mainly used for this purpose. In linear regression, the model assumes
that variables are a linear combination of the parameters. Examples of linear regression methods are
linear least squares, Bayesian linear regression, and generalized linear models (GLM). Nevertheless,
linear models often do not provide a good fit to reality, and then non-linear models are required. In
this case, classification-based techniques, such as support vector regression or k-Nearest Neighbors,
can also be used for regression. In particular, ARMA (Autoregressive Moving Average) or ARIMA
(Autoregressive Integrated Moving Average) are capable of predicting the future values of time series,
based on past values. The relationship between variables can also be statistically measured by means
of the standard deviation, Pearson correlation, and other correlation coefficients.
Clustering Clustering is the separation of objects into groups (clusters) based on their degree of similarity
[14]. It is unsupervised, because there is no previous knowledge of the classes to which the objects can
be assigned. Depending on the criterion used to measure similarity, there are different models of cluster
analysis: (i) connectivity models, based on distance connectivity (e.g. hierarchical clustering); (ii)
centroid models, which are constructed by assigning objects to the nearest cluster center (e.g. k -means
or k-medians); (iii) distribution models using statistical distributions (e.g. expectation-maximization
algorithm); (iv) density models where clusters are defined based on high-density areas in the data set;
(v) graph-based models in which the data are expressed as graphs. A further distinction can be made
between hierarchical and non-hierarchical models. Hierarchical models take the form of a hierarchy
of clusters (e.g. hierarchical tree or agglomerative hierarchical clustering) whereas non-hierarchical
models are based on a plain cluster organization without any relations between them but rather group
a set of units into a pre-determined number of groups, using an iterative algorithm that optimizes a
chosen criterion.
Clustering techniques are often a first step in a classification problem when there is no information
about the classes. In an initial phase, clustering is used to identify groups of objects with similar
features. Classification techniques are then applied to assign new objects to these groups. When there
is no previous information about the objects, clustering techniques can also be used for classification
purposes.
Association rules (ARs) Association rules are a useful tool for the representation of new information
extracted from raw data and comprehensively expressed for decision-making in the form of implication
rules of the type A B [15]. These rules depict the frequent co-occurrence of attributes with a
high reliability in a database. For example “most transactions containing beer also contain diapers”
is an association rule that could be found in a supermarket database. The Apriori algorithm and its
adaptations (e.g. generalized rule induction algorithm) are the most widely used, though there are
others, such as the FP-Growth and ECLAT algorithms, which improve scalability in very large datasets
[16, 17]. Association rules now have more sophisticated versions that not only capture correlations
but other kinds of association as well. Examples include the following: (i) generalized ARs, which use
4

a concept hierarchy to obtain rules relating the different granularities of items; (ii) quantitative ARs,
which deal with categorical and quantitative data; (iii) gradual dependence rules, which capture data
tendencies by obtaining rules of the type “the more/less A the more/less B”; (iv) sequential rules,
which identify relationships between items while considering some ordering criterion (e.g. time).
Sequence discovery Sequence discovery comprises techniques that identify statistically relevant patterns
in data, whose values are distributed in order [18]. Frequent problems in sequence analysis include
the following: (i) the extraction of sequence information using techniques such as Motif Mining (MM);
(ii) the detection of frequently occurring patterns; (iii) the search for similar sequences with a time
lag by means of autocorrelation methods such as the ACF (Autocorrelation Function) and PACF
(Partial Autocorrelation Function); (iv) the recovery of missing sequence members. Many of the other
previously explained techniques are also capable of dealing with this kind of data.
Anomaly or outlier detection The objective of detecting anomalies is to identify items, events, or ob-
servations that deviate from expected patterns or from the usual behavior of other data items [19].
The discovery of anomalous items is crucial in the resolution of bank fraud, medical diagnoses, errors
in data transmission, noise, etc. Since the previously described techniques are based on the identifica-
tion/classification of similar items, most frequent patterns, etc., variations of these methods can also
be employed for anomaly discovery. Methods used for this purpose are the following: density-based
techniques, correlation, clustering, searching deviations from association rules, and combinations of
diverse techniques using, for example, feature bagging or score normalization.
Time series analysis Time series analysis is performed on time-series data (i.e. data points that are
recorded over time) in order to model data and then use the model to predict or monitor future
values of the time series [20]. The most frequently used methods include the following: (i) methods for
exploratory analysis (e.g. autocorrelation, trend analysis, wavelets, etc.); (ii) prediction and forecasting
techniques (e.g. regression methods, signal estimation, etc.); (iii) classification methods which assign
a category to patterns in the series; (iv) segmentation which aims to identify a sequence of points
sharing specific properties (e.g. ARMA or ARIMA).
Most of the previously mentioned techniques have a fuzzy extension that allows them to process with
imprecise and uncertain data in various domains [21]. Fuzzy logic allows a non-strict representation of
object membership to a set, thus avoiding the problem of hard boundaries that are often present in basic
techniques, such as clustering and classification methods. For example, fuzzy k-means is a clustering method
that has proved effective in many scenarios since it permits the assignment of data elements to one or more
clusters [22]. Fuzzy approaches also allow a more human-friendly representation of the extracted knowledge;
since fuzzy association rules are easier to interpret than purely numerical rules [23].
3. Applications of Data Science for Building Energy Management
Data science techniques have been frequently used to support and improve basic aspects of Energy
Efficiency and Management. Accordingly, this section focuses on applications of Data Science that are
capable of doing the following: (1) predicting the energy demand required for the efficient operation of a
building; (2) optimizing building operation; (3) enabling building retroffiting; (3) verifying the operational
status and failures of building equipment and networks; (4) analyzing the economic and commercial impact
of user energy consumption; (5) detecting and preventing energy fraud.
3.1. Prediction of building energy load
Energy demand, or energy load, refers to the amount of energy required at a certain time instant or
interval. In particular, HVAC systems focus on thermal loads, which refer to the quantity of heating and
cooling energy that must be added or removed from the building to keep its occupants comfortable. Thermal
loads can be classified as internal loads, when heat transfer/influence is produced by elements (e.g. lightning,
5

Citations
More filters
Journal ArticleDOI

Model Predictive Control (MPC) for Enhancing Building and HVAC System Energy Efficiency: Problem Formulation, Applications and Opportunities

TL;DR: In this paper, the authors introduce a common dictionary and taxonomy that gives a common ground to all the engineering disciplines involved in building design and control, and critically discuss the outcomes of different existing MPC algorithms for building and HVAC system management.
Journal ArticleDOI

A review of strategies for building energy management system: Model predictive control, demand side management, optimization, and fault detect & diagnosis

TL;DR: A review of management strategies for building energy management systems for improving energy efficiency is presented and different management strategies are investigated in non-residential and residential buildings.
Journal ArticleDOI

A review of machine learning in building load prediction

TL;DR: This paper reviews the application of machine learning techniques in building load prediction under the organization and logic of the machine learning, which is to perform tasks T using Performance measure P and based on learning from Experience E.
Journal ArticleDOI

Renewable energy: Present research and future scope of Artificial Intelligence

TL;DR: In this paper, the authors summarized the review of reviews and the state-of-the-art research outcomes related to wind energy, solar energy, geothermal energy, hydro energy, ocean energy, bioenergy, hydrogen energy, and hybrid energy.
Journal ArticleDOI

A review of operating performance in green buildings: Energy use, indoor environmental quality and occupant satisfaction

TL;DR: In this article, the authors reviewed the published researches on post-occupancy performance of green buildings in terms of energy use, indoor environment quality (IEQ) and occupant satisfaction.
References
More filters
Journal ArticleDOI

Data clustering: a review

TL;DR: An overview of pattern clustering methods from a statistical pattern recognition perspective is presented, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners.
Journal ArticleDOI

Time Series Analysis.

Journal ArticleDOI

Anomaly detection: A survey

TL;DR: This survey tries to provide a structured and comprehensive overview of the research on anomaly detection by grouping existing techniques into different categories based on the underlying approach adopted by each technique.
Journal ArticleDOI

FCM: The fuzzy c-means clustering algorithm

TL;DR: A FORTRAN-IV coding of the fuzzy c -means (FCM) clustering program is transmitted, which generates fuzzy partitions and prototypes for any set of numerical data.
Related Papers (5)
Frequently Asked Questions (21)
Q1. What contributions have the authors mentioned in the paper "Data science for building energy management: a review" ?

This paper reviews how Data Science has been applied to address the most difficult problems faced by practitioners in the field of Energy Management, especially in the building sector. 

In the near future, Big Data techniques will expand these possibilities and democratize them. This will enhance energy awareness, since users will have access to more data and be able to understand their own energy consumption habits. 

fuzzy rules (which have been widely used for HVAC control) can also be used for descriptive reports of energy loads since they offer a robust representation in the context of high imprecision and uncertainty. 

The techniques traditionally used for this task are classification, clustering, and pattern analysis (mostly by means of association rules). 

Apart from Big Data, other technologies that are expected to have a significant impact on Energy Efficiency and Management include Smart metering, the Internet of Things and Cloud computing. 

The ISPC algorithm (Incremental Summarization and Pattern Characterization) was used by De Silva et al. [52] to structure stream data into a data warehouse based on key dimensions for enabling a rapid interim summarization. 

Xaio and Fan [35] used cluster analysis to identify daily power consumption patterns, whereas Morbitzer et al. [36] applied clustering to analyze simulation results for performance predictions in order to extract predicted operation rules. 

According to the International Energy Agency (IEA), residential and commercial buildings are responsible for up to 32% of the total final energy consumption. 

Cloud computing enables continuous and transparent updates and improvements, which are readily available to customers. 

Frequent problems in sequence analysis include the following: (i) the extraction of sequence information using techniques such as Motif Mining (MM); (ii) the detection of frequently occurring patterns; (iii) the search for similar sequences with a time lag by means of autocorrelation methods such as the ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function); (iv) the recovery of missing sequence members. 

Examples include methods with more accurate results, methods capable of handling temporal data or data streams, etc., which could feasibly be applied to Building Energy Management. 

In addition, classification models are effective tools that can be used to predict building user comfort under different environmental conditions [28]. 

The objective of detecting anomalies is to identify items, events, or observations that deviate from expected patterns or from the usual behavior of other data items [19]. 

Because of security constraints and privacy concerns, some industries are still reluctant to embrace cloud computing and cloud technologies in general. 

Techniques such as association rules in all its variants are certainly underrepresented when modelling and predicting energy loads. 

Classification techniques were also used by Jiang et al. [64], who created a new automatic feature analysis method using wavelet techniques and combining multiple classifiers to identify fraud in electricity distribution networks. 

For most companies, cloud computing seems a plausible choice since they can avoid scalability problems, and reduce deployment costs and time. 

Filho et al. [63] described a method to fight against fraud in electricity companies, which involves a classifying algorithm, based on decision trees, to pre-select potentially fraudulent customers, who will then undergo in-site inspection for fraud or faulty measurement equipment identification. 

By continuously monitoring the building, it is possible to detect when a fault has happened (typically an anomalous event) and how it affects to other equipment (by means of correlation analysis). 

The same authors also applied classification and regression techniques couple with building indoor daylight methods to assist decision-making and optimize building design [38]. 

Although these techniques have been the most frequently applied in Energy Efficiency and Management, others, which are not so well known (e.g. sequence analysis and anomaly detection), are also useful in providing solutions for building energy problems.