LASSO vector autoregression structures for very short-term wind power forecasting

doi:10.1002/WE.2029

WIND ENERGY

Wind Energ.

2017; 20:657–675

Published online 19 September 2016 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/we.2029

RESEARCH ARTICLE

LASSO vector autoregression structures for very

short-term wind power forecasting

Laura Cavalcante

1

, Ricardo J. Bessa

1

, Marisa Reis

1

and Jethro Browell

2

1

INESC Technology and Science (INESC TEC), Campus da FEUP, Rua Dr. Roberto Frias, Porto 4200-465, Portugal

2

Royal College Building, University of Strathclyde, 204 George Street, Glasgow, Scotland

ABSTRACT

The deployment of smart grids and renewable energy dispatch centers motivates the development of forecasting techniques

that take advantage of near real-time measurements collected from geographically distributed sensors. This paper describes

a forecasting methodology that explores a set of different sparse structures for the vector autoregression (VAR) model using

the least absolute shrinkage and selection operator (LASSO) framework. The alternating direction method of multipliers is

applied to ﬁt the different LASSO-VAR variants and create a scalable forecasting method supported by parallel computing

and fast convergence, which can be used by system operators and renewable power plant operators. A test case with 66

wind power plants is used to show the improvement in forecasting skill from exploring distributed sparse structures. The

proposed solution outperformed the conventional autoregressive and vector autoregressive models, as well as a sparse VAR

KEYWORDS

wind power; vector autoregression; scalability; sparse; renewable energy; parallel computing

Correspondence

Ricardo J. Bessa, INESC Technology and Science (INESC TEC), Campus da FEUP, Rua Dr. Roberto Frias, Porto 4200-465, Portugal.

E-mail: ricardo.j.bessa@inesctec.pt

Received 8 March 2016; Revised 11 July 2016; Accepted 23 August 2016

1. INTRODUCTION

Operating a power system with high integration levels of wind power is challenging and demands for a continuous improve-

ment of wind power forecast tools.

1, 2

Furthermore, the participation of wind power in the electricity market also requires

accurate forecasts in order to mitigate ﬁnancial risks associated to energy imbalances.

3, 4

The recent advent of smart grid technologies will increase the monitoring capability of the electric power system.

5

Furthermore, the investment in renewable energy dispatch centers enables real-time acquisition of time series measurements

from wind power plants (WPPs).

6

The availability of the most recent WPP measurements improves the forecast skill during

the ﬁrst lead times, commonly called very short-term horizon.

7

For this time horizon, it is generally established that statistical models are more accurate than physical models, while

for longer time horizons the most relevant inputs come from numerical weather prediction (NWP) models.

7

Even recent

advances in physical models, such as the high-resolution rapid refresh model developed by U.S. National Oceanic and

Atmospheric Administration, are outperformed by statistical models that use recent WPP observations.

8

In the state of the art, a broad family of statistical models are available for the very short-term horizon. Two examples

are the conditional parametric autoregression (AR) and regime-switching models that incorporate online observed local

variables (i.e., wind speed and direction) to reduce the wind power forecast error for 10 min-ahead forecasting.

9

Another

example is the use of automatic self-tuning Kalman ﬁlters that incorporate NWP information.

10

In this context, information from WPP time series distributed in space can be used to improve the forecast skill of each

WPP. The ﬁrst results were presented by Gneiting et al. for 2 h-ahead wind speed forecasting.

11

The authors showed that

a regime-switching space-time diurnal model that takes advantage of temporal and spatial correlation from geographically

dispersed meteorological stations as off-site predictors can have a root mean square error (RMSE) 28.6% lower than the

persistence forecasts. Expert knowledge and empirical results were used to select the predictors. In Hering and Genton,

12

two additional statistical models are proposed: trigonometric direction diurnal model and bivariate skew-t model. These

657

LASSO vector autoregression structures for very short-term wind power forecasting L. Cavalcante

et al.

Figure 1. Groups of models from the state of the art.

results were generalized by Tastu et al. by studying the spatiotemporal propagation of wind power forecast errors.

13

The

authors showed evidences of cross-correlation functions with signiﬁcant dependency in lags of a few hours.

These works motivated the appearance of recent research that explores information from neighboring WPP. Figure 1

groups the state-of-the-art methods applied to wind power by category.

The ﬁrst group consists of machine learning methods, such as artiﬁcial neural networks. To the authors’ knowledge,

there is little research concerning the application of machine learning models to this problem. In Kou et al.,

14

it is described

a online sparse Bayesian model based on warped Gaussian process to generate probabilistic wind power forecasts. A spar-

siﬁcation strategy is used to reduce the computational cost, and the model includes wind speed observations from nearby

WPP and NWP data. Also in this category, but applied to solar power forecasting, in Vaz et al.,

15

multilayer perceptron

neural networks are used to combine measurements of neighboring PV systems, and in Bessa et al.,

16

component-wise

gradient boosting is used to explore PV observations from a smart grid.

The following limitations were identiﬁed for this ﬁrst group: (i) a separated model is ﬁtted to each location, which

increases the computational time; (ii) the scalability of the solution decreases when the number of predictor increases and

(iii) with the exception of the sparse Bayesian model, the others do not provide a sparse vector of coefﬁcients.

The second group consists of random ﬁelds. To the authors’ knowledge, the only work that explores this theory is from

Wytock and Kolter.

17

The model is based on sparse Gaussian conditional random ﬁeld and uses a new second-order active

set method to solve the problem. The main limitation of the method is that it requires a copula transformation in order to

have a Gaussian marginal distribution, which might not solve the boundary problem of variables with limited support (e.g.,

wind power between zero and rated power). Moreover, the computational time for a solution with high accuracy is around

160 min for a case study with seven WPPs.

18

The third group is related to classical time series theory. Tastu et al. extended their previous work in Tastu et al.

13

to

the multivariate framework,

19

i.e., from an AR to a vector AR (VAR) model. The VAR coefﬁcients are allowed to vary

with external variables, average wind direction in this case. The main limitation is a non-sparse matrix of coefﬁcients

since feature selection is not performed. A similar methodology was applied in Tastu et al.

20

to generate probabilistic

forecast based on geographically distributed sensors. Also in this case, the predictors are manually selected based on

cross-correlation analysis.

He et al. presents a two-stage approach:

21

(i) ofﬂine spatial–temporal analysis carried out on historical data with multiple

ﬁnite-state Markov chains and (ii) online forecasting by feeding a Markov chain with real-time measurements of the wind

turbines. Similar to previous works, different sparse structures of the spatial–temporal relations are not fully explored. The

same authors in He et al.

22

propose a different approach based on VAR model ﬁtting with sparsity-constrained maximum

likelihood. The main limitation of this approach is that the sparse coefﬁcients are not automatically deﬁned, instead, expert

knowledge and partial correlation analysis are employed.

Aiming to generate forecasts on a large spatial scale, e.g. hundreds of locations, Dowell and Pinson proposed the

sparse-VAR (sVAR) approach for 5 min-ahead forecasts.

23

The sVAR method generates probabilistic forecasts based on the

logit-normal distribution,

24

whose mean is estimated with a VAR model and variance by a modiﬁed exponential smooth-

ing. A state-of-the-art technique from Davis et al.

25

is employed to ﬁt a VAR model with a sparse coefﬁcient matrix. The

work proposed in the present paper is closely related to the sVAR and provides the following original contributions:

Wind Energ.

658

DOI: 10.1002/we

L. Cavalcante

et al.

LASSO vector autoregression structures for very short-term wind power forecasting

1. Explores a set of different sparse structures f or the VAR framework using the least absolute shrinkage and selection

operator (LASSO) framework;

26

2. Applies the alternating direction method of multipliers (ADMM)

27

to ﬁt the different LASSO-VAR variants;

3. Proposes a scalable forecasting method based on parallel computing, fast convergence optimization algorithm and

matrix calculations.

The proposed method will be compared with the sVAR approach in terms of advantages and limitations, applied to a

case study with 66 WPPs located in the same control area. It should be stressed that the proposed approach is compatible

with previous works from the literature. For instance, it can be used for spatial–temporal correction of forecast errors,

13

extended to conditional VAR,

19

or used to generate probabilistic forecasts based on the logit-normal distribution.

23

The paper is organized as follows. Section 2 presents the different sparse structures for the VAR model. Section 3

describes the application of the ADMM method to ﬁt the VAR model in its different LASSO variants. The test case results

are presented in Section 4. Section 5 presents the conclusion and future work.

2. SPARSE STRUCTURES FOR THE VAR MODEL

The VAR model allows a simultaneous forecast of the wind generation in several neighboring sites combining time series

information. However, forecasting with VAR models may be intractable for high-dimensional data since the non-sparse

coefﬁcients matrix grows quadratically with the number of series included in the model. In order to overcome this limitation,

in Hsu et al.,

28

it is proposed the combination of LASSO and VAR frameworks, which is further explored in this paper for

very short-term forecasting of wind power.

2.1. Formulation of the forecasting problem

The VAR model allows us to model the joint dynamic behavior of a collection of WPPs by capturing the linear interde-

pendencies between its time series. In this multivariate (or spatiotemporal) framework, the future trajectory of output from

each WPP in the model is based on its own past values (lagged values) and the past values of the other WPPs included in

the model.

Suppose y

i,t

is the time series containing the average power measured at WPP i and time interval t. Using an autore-

gressive (AR) process of order p (ARŒp) it is possible to describe a future trajectory based on its past observations

as

y

i,t

D  C

p

X

lD1

ˇ

.l/

 y

i,tl

C 

t

(1)

where ˇ

.1/

, :::, ˇ

.p/

are the model coefﬁcients,  is a constant (or intercept) term, p is the order of the AR model, and 

t

is a contemporaneous white noise (or residuals) with zero mean and constant variance 

2



.

Let fY

t

gDf.y

1,t

, y

2,t

, :::, y

k,t

/

0

g, denote a k-dimensional vector time series. Modeling it as a vector AR process of order

p (VAR

k

Œp), we obtain an expression relating the future observations at each of the k WPPs to the past observations of all

WPPs in the model, given by

Y

t

D  C

p

X

lD1

B

.l/

 Y

tl

C e

t

(2)

in which  is a vector of constant terms, each B

.l/

2 R

kk

represents a coefﬁcient matrix related to the lag l and e

t

 .0, †

e

/

denotes a white noise disturbance term.

In order to obtain a compact matrix notation, let Y D .Y

1

, Y

2

, :::, Y

T

/ deﬁne the k  T response matrix, B D



B

.1/

, B

.2/

, :::, B

.p/



the k  kp matrix of coefﬁcients, Z D .Z

1

, Z

2

, :::, Z

T

/ the kp  T matrix of explanatory (or pre-

dictors) variables, in which Z

t

D .Y

0

t1

, Y

0

t2

, :::, Y

0

tp

/ and E D .e

1

, e

2

, :::, e

T

/ the k  T error matrix. To simplify the

notation, consider m D kp. Then it is possible to express (2) as

Y D 1

0

C BZ C E (3)

with 1 denoting a T  1 vector of ones.

The matrix of unknown coefﬁcients needs to be correctly estimated to obtain the model that ‘best’ characterizes the

data. Commonly, this is achieved using the least squares statistical methodology by choosing the coefﬁcients that minimize

the sum of squared errors. The predictor that will be deduced gives, for a given sample, the in-sample forecasts of the

variable of interest.

Wind Energ.

DOI: 10.1002/we

659

LASSO vector autoregression structures for very short-term wind power forecasting L. Cavalcante

et al.

Usually this methodology is applied with centered variables instead of the original ones. This allows simpliﬁcations in

the calculation, including the model handling without intercept term. The intercept can be easily estimated after the model

has been ﬁtted. As a result, and assuming centered variables Y and Z,  will no longer appear in the least squares objective

function.

The multi-period forecasts can be generated with two alternative strategies, iterative or direct approach.

29

In this paper,

a direct approach, in which a speciﬁc model is created for each lead time, is adopted to generate 6 h-ahead wind power

forecasts.

2.2. Spar se structures with LASSO

This section presents a set of different sparse structures for the LASSO-VAR model, inspired by Nicholson et al.,

30

to

capture the dynamics of the underlying system.

The LASSO framework is powerful and convenient to use when handling high-dimensional data. The loss function is

a regularized version of least squares that introduces an L

1

penalty on the coefﬁcients. The penalty function shrink some

of the coefﬁcients to zero, performing variable selection and producing a sparse solution. Instead of assuming that all the

predictors are contributing to the model, this framework extracts the most important predictors, i.e., those with the s trongest

contribution to the prediction of the target variable.

Let k.k

r

represents both vector and matrix L

r

norms. The standard LASSO-VAR (sLV) loss function is expressed as

30

1

2

kY  BZk

2

C kBk

1

(4)

where >0 is a scalar regularization (or penalty) parameter controlling the amount of shrinkage.

The L

1

penalty works as a sparsity-inducing term over the individual entries of the coefﬁcient matrix B, zeroing some

of them in a element-wise manner.

Since the same predictors are available for each target variable (each WPP), the VAR coefﬁcients can be estimated with

ordinary least squares applied independently for the regression of each individual target variable.

31

The problem is then

re-formulated for each row of the matrix Y, with a different penalization parameter for each, resulting in a separable loss

function for each variable.

The main advantage of this approach, here called row LASSO-VAR (rLV), is the possibility of distributed computing,

since each equation can be solved in parallel. Its loss function can be expressed as

1

2

kY

i

 B

i

Zk

2

C kB

i

k

1

(5)

where Y

i

and B

i

, i D 1, :::, k, correspond to the ith rows of the Y and B matrices, respectively.

An alternative to deal with model’s coefﬁcients individually, which results in an unstructured sparsity pattern, is to make

some simple modiﬁcations to the sLV penalty in order to capture different sparsity patterns accordingly to the inherent

structure of the VAR.

30

These modiﬁcations produce more interpretable models that offer great ﬂexibility in the detection

of the true underlying dynamics of the system, which is especially fruitful in the high-dimensional context.

To take into account characteristics such as lag selection, within-group sparsity, delineation between a component’s own

lags and those of another component and evaluate which variables add forecast improvement, the following LASSO-VAR

sparse structures are explored: lag-group LASSO-VAR (lLV), lag-sparse-group LASSO-VAR (lsLV), own/other-group

LASSO-VAR (ooLV) and causality-group LASSO-VAR (cLV). These LASSO schemes look through the sparsity in distinct

group structures trying to ﬁnd the ideal sparsity pattern.

The lLV model considers the coefﬁcients grouped by their time lags and looks for time lags that add forecast

improvement. Its objective function is

1

2

kY  BZk

2

C 

p

X

lD1

kB

.l/

k

2

(6)

where each B

.l/

is a sub-matrix containing the lag l coefﬁcients.

This structure can be relevant if the interest is to perform lag selection. However, although it is advantageous when all

time series tend to exhibit similar dynamics, it might be too restrictive for certain applications since all the coefﬁcients of

some lags are not considered in the prediction, and sometimes inefﬁcient by including the entire lag if only few coefﬁcients

are signiﬁcant.

In an attempt to overcome some of these limitations, the lsLV model adds within-group (or lag) sparsity to the lLV

through the loss function

1

2k

kY  BZk

2

C .1  ˛/

p

X

lDl

kB

.l/

k

2

C ˛kBk

1

(7)

Wind Energ.

660

DOI: 10.1002/we

L. Cavalcante

et al.

LASSO vector autoregression structures for very short-term wind power forecasting

Figure 2. Example of sparsity patterns produced by LASSO-VAR structures.

where 0  ˛  1 is a parameter regulating the trade-off between the group and within-group importance.

As can be easily seen, the lLV and the sLV are obtained considering ˛ D 0and˛ D 1, respectively. Here, as proposed

in Nicholson et al.,

30

the wihin-group sparsity is estimated based on the number of time series/variables, and set as ˛ D

1=.k C 1/. In this sense, as the number of variables increases, the greater the group-wise sparsity and smaller the sparsity

within-group. This variant allows to explore the signiﬁcance of each lag and, at the same time, access the importance of

each coefﬁcient within each lag.

The ooLV model concerns with the possibility that, in many settings, the prediction of a variable is more inﬂuenced

by their own past observations than by past observations of other variables. To address this question in a lag context, the

coefﬁcients of each B

l

are grouped by the diagonal entries representing variable’s own lags, and by off-diagonal entries

representing cross dependencies with other variables, using the loss function

1

2

kY  BZk

2

C

p

k

p

X

lDl



diag



B

.l/





2

C

p

k.k  1/

p

X

lDl

kB

.l/

k

2

(8)

where B

.l/

DfŒB

.l/



ij

: i ¤ jg. Since the groups differ in cardinality, it is necessary to weight the penalty accordingly to

avoid favoring the larger groups of off-diagonal entries.

If all time series do not share the same dynamics, one may be interested in ﬁnding which of them do. Recent studies

have been addressing these question considering causal structures in multivariate series, also called Granger causality. The

idea is that a time series y

i

is Granger-caused by other time series y

j

if knowing the past values of y

j

helps to improve the

prediction of y

i

.

32

With the intention of learn a causal inference from the data, the cLV model

33

groups the coefﬁcients by the corresponding

variables (that they affect). Its loss function is

1

2

kY  BZk

2

C 

X

i¤j





B

.1/



ij



B

.2/



ij

:::



B

.p/



ij



2

(9)

The L

2

norm of ptuple of



B

.l/



ij

is a composite penalty that will force all p matrices B

.l/

’s to share the same sparsity

pattern, as can be observed in Figure 2. This structure can be useful to detect which locations can promote the forecasts at

some location.

For a better understanding of the presented LASSO-VAR variants, Figure 2 illustrates an example of corresponding

generated sparsity patterns.

Wind Energ.

DOI: 10.1002/we

661

LASSO vector autoregression structures for very short-term wind power forecasting

Figures

Citations

Improving Renewable Energy Forecasting With a Grid of Numerical Weather Predictions

Forecasting: theory and practice

Forecasting: theory and practice

Correlation-Constrained and Sparsity-Controlled Vector Autoregressive Model for Spatio-Temporal Wind Power Forecasting

The future of forecasting for renewable energy

References

Regression Shrinkage and Selection via the Lasso

Distributed Optimization and Statistical Learning Via the Alternating Direction Method of Multipliers

Comparing Predictive Accuracy

Comparing Predictive Accuracy

New Introduction to Multiple Time Series Analysis

Related Papers (5)

Very-Short-Term Probabilistic Wind Power Forecasts by Sparse Vector Autoregression

Calibrated Probabilistic Forecasting at the Stateline Wind Energy Center: The Regime-Switching Space-Time (RST) Method

Regression Shrinkage and Selection via the Lasso

Distributed Optimization and Statistical Learning Via the Alternating Direction Method of Multipliers

Review on probabilistic forecasting of wind power generation