Journal Article•DOI•

Out-of-Sample Forecast Tests Robust to the Choice of Window Size

Barbara Rossi¹, Atsushi Inoue²•Institutions (2)

Pompeu Fabra University¹, North Carolina State University²

01 Aug 2011-Journal of Business & Economic Statistics (Taylor & Francis Group)-Vol. 30, Iss: 3, pp 432-453

TL;DR: In this paper, the authors proposed new methodologies for evaluating economic models' out-of-sample forecasting performance that are robust to the choice of the estimation window size, and evaluated the predictive ability of forecasting models over a wide range of window sizes.

read less

Abstract: This article proposes new methodologies for evaluating economic models’ out-of-sample forecasting performance that are robust to the choice of the estimation window size. The methodologies involve evaluating the predictive ability of forecasting models over a wide range of window sizes. The study shows that the tests proposed in the literature may lack the power to detect predictive ability and might be subject to data snooping across different window sizes if used repeatedly. An empirical application shows the usefulness of the methodologies for evaluating exchange rate models’ forecasting ability.

...read moreread less

Summary (3 min read)

Jump to: [1 Introduction] – [2 Robust Tests of Predictive Accuracy When the Win-] – [2.1 Non-Nested Model Comparisons] – [2.2 Nested Models Comparison] – [2.3 Regression-Based Tests of Predictive Ability] – [3 Robust Tests of Predictive Accuracy When the Win-] – [5 Empirical evidence] and [6 Conclusions]

1 Introduction

This paper proposes new methodologies for evaluating the out-of-sample forecasting performance of economic models.
The novelty of the methodologies that the authors propose is that they are robust to the choice of the estimation and evaluation window size.
The choice of the estimation window size has always been a concern for practitioners, since the use of di¤erent window sizes may lead to di¤erent empirical results in practice.
The procedures that the authors propose ensure that this is the case by evaluating the models forecasting performance for a variety of estimation window sizes, and then taking summary statistics of this sequence.
The paper instead proposes to take summary statistics of tests of predictive ability computed over several estimation window sizes.

2 Robust Tests of Predictive Accuracy When the Win-

The authors assume that the researcher is interested in evaluating the performance of h steps-ahead direct forecasts for the scalar variable yt+h using a vector of predictors xt using either a rolling, recursive or xed window direct forecast scheme.
The methods proposed in this paper can be applied to out-of-sample tests of equal predictive ability, forecast rationality and unbiasedness.
First, if the researcher tries several window sizes and then reports the empirical evidence based on the window size that provides him the best empirical evidence in favor of predictive ability, his test may be oversized.
The following proposition states the general intuition behind the approach proposed in this paper.
For each of the cases that the authors consider.

2.1 Non-Nested Model Comparisons

Traditionally, researchers interested in doing inference about the relative forecasting performance of competing, non-nested models rely on the Diebold and Mariano s (1995), West s (1996) and McCracken s (2000) test statistics.
The test statistic that they propose relies on the sample average of the sequence of standardized out-of-sample loss di¤erences, eq. (1): LT (R) 1b RP 1=2 TX t=R Lt+h(b t;R; b t;R); (5) where b 2R is a consistent estimate of the long run variance matrix of the out-of-sample loss di¤erences.
Consistent estimates of 2 that take into account parameter estimation uncertainty in recursive windows are provided by West (1996) and in rolling and xed windows are provided by McCracken (2000, p. 203, eqs. 5 and 6).
In particular, a leading case where (6) can be used is when the same loss function is used for estimation and evaluation.
The asymptotic normality result does not hinge on whether or not two models are nested but rather on whether or not the disturbance terms of the two models are numerically identical in population under the null hypothesis.

2.2 Nested Models Comparison

For the case of nested models comparison, the authors follow Clark and McCracken (2001).
Let Model 1 be the parsimonious model, and Model 2 be the larger model that nests Model 1.
Let yt+h denote the variable to be forecast and let the period-t forecasts of yt+h from the two models be denoted by by1;t+h and by2;t+h: the rst ("small") model uses k1 regressors x1;t and the second ("large") model uses k1+ k2 = k regressors x1;t and x2;t.
In particular, their assumptions hold for one-step-ahead forecast errors (h = 1) from linear, homoskedastic models, OLS estimation, and MSE loss function (as discussed in Clark and McCracken (2001), the loss function used for estimation has to be the same as the loss function used for evaluation).

2.3 Regression-Based Tests of Predictive Ability

Under the widely used MSFE loss, optimal forecasts have a variety of properties.
The following are special cases of regression-based tests of predictive ability: (i) Forecast Unbiasedness Tests: bLt+h = bvt+h: (ii) Mincer-Zarnowitz s (1969) Tests (or E¢ ciency Tests): bLt+h = bvt+hXt, where Xt is a vector of predictors known at time t (see also Chao, Corradi and Swanson, 2001).
Which are similar to those discussed for eq. (5).
West and McCracken (1998) have shown that it is very important to allow for a general variance estimator that takes into account estimation uncertainty and/or correcting the statistics by the necessary adjustments.
The procedures that the authors propose can also be applied to Patton and Timmermann s (2007) generalized forecast error.

3 Robust Tests of Predictive Accuracy When the Win-

All the tests considered so far rely on the assumption that the window is a xed fraction of the total sample size, asymptotically.
When the window size diverges to in nity, the correlation between the rolling regression estimator and the regressor vanishes even when the regressor is not strictly exogenous.
When x1t is null, the second term on the right-hand side of equation (20) is zero even when x2t is not strictly exogenous, and their adjustment term and theirs become identical.

5 Empirical evidence

The poor forecasting ability of economic models of exchange rate determination has been recognized since the works by Meese and Rogo¤ (1983a,b), who established that a random walk forecasts exchange rates better than any economic models in the short run.
Let t denote the in ation rate in the home country, t denote the in ation rate in the foreign country, denote the target level of in ation in each country, ygapt denote the output gap in the home country and y gap t denote the output gap in the foreign country.
The benchmark model, against which the forecasts of both models (27) and (28) are evaluated, is the random walk, according to which the exchange rate changes are forecast to be zero.
Data on interest rates were incomplete for Portugal and the Netherlands, so the authors do not report UIRP results for these countries.
This suggests that the empirical evidence in favor of predictive ability may be driven by the existence of instabilities in the predictive ability, for which rolling windows of small size are advantageous.

6 Conclusions

This paper proposes new methodologies for evaluating economic models forecasting performance that are robust to the choice of the estimation window size.
These methodologies are noteworthy since they allow researchers to reach empirical conclusions that do not depend on a speci c estimation window size.
The authors show that tests traditionally used by forecasters su¤er from size distortions if researchers report, in reality, the best empirical result over various window sizes, but without taking into account the search procedure when doing inference in practice.
Traditional tests may also lack power to detect predictive ability when implemented for an "ad-hoc" choice of the window size.
Finally, their empirical results demonstrate that the recent empirical evidence in favor of exchange rate predictability is even stronger when allowing a wider search over window sizes.

Did you find this useful? Give us your feedback

Figures (13)

Table 1. Critical Values for Non-Nested Model Comparisons

Table 2. Critical Values for Nested Model Comparisons Using ENCNEW

Table 11. Data Mining Asymptotic Approximation Results

Table 10. Rejection Frequencies of Regression-Based Tests of Predictive Ability DGP3

Table 8. Rejection Frequencies of Nested Model Comparison Tests DGP1

Table 9. Rejection Frequencies of Non-nested Model Comparison Tests DGP2

Figure 1 plots the estimated Clark and McCracken (2001) ENCNEW test statistic for comparing the UIRP model with the random walk for the window sizes we consider (reported on the x-axis), together with 5% and 10% critical values of the RET test statistic. The test rejects when the largest value of the Clark and McCracken s (2001) test is above the critical value line. Countries are Canada (CAN), France (FRA), United Kingdom (GBP), Germany (GER), Italy (ITA), Japan (JAP).

Table 6. Size of Regression-Based Tests of Predictive Ability DGP3

Table 7. Size of Fixed Window Tests DGP 4

Table 5. Size of Non-Nested Model Comparison Tests DGP2

Table 4. Size of Nested Model Comparison Tests DGP1

Table 3. Critical Values for Regression-Based Forecasts Tests

Content maybe subject to copyright Report

Out-of-Sample Forecast Tests Robust to the Choice of

Window Size

Barbara Rossi and Atsushi Inoue

(ICREA,UPF,CREI,BGSE,Duke) (NC State)

April 1, 2012

Abstract

This paper proposes new methodologies for evaluating out-of-sample forecasting

performance that are robust to the choice of the estimation window size. The method-

ologies involve evaluating the predictive ability of forecasting models over a wide range

of window sizes. We show that the tests proposed in the literature may lack the power

to detect predictive ability and might be subject to data snooping across di¤erent

window sizes if used repeatedly. An empirical application shows the usefulness of the

methodologies for evaluating exchange rate models’forecasting ability.

Keywords: Predictive Ability Testing, Forecast Evaluation, Estimation Window.

Acknowledgments: We thank the editor, the asso ciate editor, two referees as well as

S. Burke, M.W. McCracken, J. Nason, A. Patton, K. Sill, D. Thornton and seminar par-

ticipants at the 2010 Econometrics Workshop at the St. Louis Fed, Bocconi University,

U. of Arizona, Pompeu Fabra U., Michigan State U., the 2010 Triangle Econometrics

Conference, the 2011 SNDE Conference, th e 2011 Conference in honor of Hal White,

the 2011 NBER Summer Institute and the 2011 Joint Statistical Meetings for useful

comments and suggestions. This research was supported by National Science Founda-

tion grants SES-1022125 and SES-1022159 and North Carolina Agricultural Research

Service Project NC02265.

J.E.L. Codes: C22, C52, C53

1 Introduction

This paper proposes new methodologies for evaluating the out-of-sample forecasting perfor-

mance of economic models. The novelty of the methodologies that we propose is that they

are robust to the choice of the estimation and evaluation window size. The choice of the

estimation window size has always been a concern for practitioners, since the use of di¤er-

ent window sizes may lead to di¤erent empirical results in practice. In addition, arbitrary

choices of window sizes have consequences about how the sample is split into in-sample and

out-of-sample portions. Notwithstanding the importance of the problem, no satisfactory

solution has been proposed so far, and in the forecasting literature it is common to only

report empirical results for one window size. For example, to illustrate the di¤erences in

the window sizes, we draw on the literature on forecasting exchange rates (the empirical

application we will focus on): Meese and Rogo¤ (1983a) use a window of 93 observations

in monthly data, Chinn (1991) a window size equal to 45 in quarterly data, Qi and Wu

(2003) use a window of 216 observations in monthly data, Cheung et al. (2005) consider

windows of 42 and 59 observations in quarterly data, Clark and West’s (2007) window is 120

observations in monthly data, Gourinchas and Rey (2007) consider a window of 104 obser-

vations in quarterly data, and Molo dtsova and Papell (2009) consider a window size of 120

observations in monthly data. This common practice raises two concerns. A …rst concern

is that the “ad hoc”window size used by the researcher may not detect signi…cant predic-

tive ability even if there would be signi…cant predictive ability for some other window size

choices. A second concern is the possibility that satisfactory results were obtained simply by

chance, after data snooping over window sizes. That is, the successful evidence in favor of

predictive ability might have been found after trying many window sizes, although only the

results for the successful window size were reported and the search process was not taken

into account when evaluating their statistical signi…cance. Only rarely do researchers check

the robustness of the empirical results to the choice of the window size by reporting results

for a selected choice of window sizes. Ultimately, however, the size of the estimation window

is not a parameter of interest for the researcher: the objective is rather to test predictive

ability and, ideally, researchers would like to reach empirical conclusions that are robust to

the choice of the estimation window size.

This paper views the estimation window as a “nuisance parameter”: we are not interested

in selecting the “best” window; rather we would like to propose predictive ability tests

that are “robust” to the choice of the estimation window size. The procedures that we

propose ensure that this is the case by evaluating the models’forecasting performance for

a variety of estimation window sizes, and then taking summary statistics of this sequence.

Our methodology can be applied to most tests of predictive ability that have been proposed

in the literature, such as Diebold and Mariano (1995), West (1996), McCracken (2000) and

Clark and McCracken (2001). We also propose methodologies that can be applied to Mincer

and Zarnowitz’s (1969) tests of forecast e¢ ciency, as well as more general tests of forecast

optimality. Our methodologies allow both for rolling as well as recursive window estimation

schemes and let the window size to be large relative to the total sample size. Finally, we also

discuss methodologies that can be used in the Giacomini and White’s (2005) and Clark and

West’s (2007) frameworks, where the estimation scheme is based on a rolling window with

…xed size.

This paper is closely related to the works by Pesaran and Timmermann (2007) and Clark

and McCracken (2009), and more distantly related to Pesaran, Pettenuzzo and Timmermann

(2006) and Giacomini and Rossi (2010). Pesaran and Timmermann (2007) propose cross val-

idation and forecast combination methods that identify the "ideal" window size using sample

information. In other words, Pesaran and Timmermann (2007) extend forecast averaging

pro cedures to deal with the uncertainty over the size of the estimation window, for example,

by averaging forecasts computed from the same model but over various estimation win-

dow sizes. Their main objective is to improve the model’s forecast. Similarly, Clark and

McCracken (2009) combine rolling and recursive forecasts in the attempt to improve the

forecasting model. Our paper instead proposes to take summary statistics of tests of predic-

tive ability computed over several estimation window sizes. Our objective is not to improve

the forecasting model nor to estimate the ideal window size. Rather, our objective is to

assess the robustness of conclusions of predictive ability tests to the choice of the estimation

window size. Pesaran, Pettenuzzo and Timmermann (2006) have exploited the existence of

multiple breaks to improve forecasting ability; in order to do so, they need to estimate the

pro cess driving the instability in the data. An attractive feature of the procedure we propose

is that it does not need to impose nor determine when the structural breaks have happened.

Giacomini and Rossi (2010) propose techniques to evaluate the relative performance of com-

peting forecasting models in unstable environments, assuming a “given”estimation window

size. In this paper, our goal is instead to ensure that forecasting ability tests be robust to the

choice of the estimation window size. That is, the procedures that we propose in this paper

are designed for determining whether …ndings of predictive ability are robust to the choice

of the window size, not to determine which point in time the predictive ability shows up:

the latter is a very di¤erent issue, important as well, and was discussed in Giacomini and

Rossi (2010). Finally, this paper is linked to the literature on data snooping: if researchers

report empirical results for just one window size (or a couple of them) when they actually

considered many possible window sizes prior to reporting their results, their inference will

be incorrect. This paper provides a way to account for data snooping over several window

sizes and removes the arbitrary decision of the choice of the window length.

After the …rst version of this paper was submitted, we became aware of independent

work by Hansen and Timmermann (2011). Hansen and Timmermann (2011) propose a

sup-type test similar to ours, although they focus on p-values of the Diebold and Mariano’s

(1995) test statistic estimated via a recursive window estimation procedure for nested models’

comparisons. They provide analytic power calculations for the test statistic. Our approach

is more generally applicable: it can be used for inference on out-of-sample models’forecast

comparisons and to test forecast optimality where the estimation scheme can be either rolling,

…xed or recursive, and the window size can be either a …xed fraction of the total sample size

or …nite. Also, Hansen and Timmermann (2011) do not consider the e¤ects of time-varying

predictive ability on the power of the test.

We show the usefulness of our methods in an empirical analysis. The analysis re-evaluates

the predictive ability of models of exchange rate determination by verifying the robustness

of the recent empirical evidence in favor of models of exchange rate determination (e.g.,

Molodtsova and Papell, 2009, and Engel, Mark and West, 2007) to the choice of the window

size. Our results reveal that the forecast improvements found in the literature are much

stronger when allowing for a search over several window sizes. As shown by Pesaran and

Timmermann (2005), the choice of the window size depends on the nature of the possible

model instability and the timing of the possible breaks. In particular, a large window is

preferable if the data generating process is stationary but comes at the cost of lower power,

since there are fewer observations in the evaluation window. Similarly, a shorter window may

be more robust to structural breaks, although it may not provide as precise an estimation as

larger windows if the data are stationary. The empirical evidence shows that instabilities are

widespread for exchange rate models (see Rossi, 2006), which might justify why in several

cases we …nd improvements in economic models’forecasting ability relative to the random

walk for small window sizes.

The paper is organized as follows. Section 2 proposes a framework for tests of predictive

ability when the window size is a …xed fraction of the total sample size. Section 3 presents

tests of predictive ability when the window size is a …xed constant relative to the total sample

size. Section 4 shows some Monte Carlo evidence on the performance of our procedures in

small samples, and Section 4 presents the empirical results. Section 5 concludes.

2 Robust Tests of Predictive Accuracy When the Win-

dow Size is Large

Let h  1 denote the (…nite) forecast horizon. We assume that the researcher is interested

in evaluating the performance of hsteps-ahead direct forecasts for the scalar variable y

t+h

using a vector of predictors x

using either a rolling, recursive or …xed window direct forecast

scheme. We assume that the researcher has P out-of-sample predictions available, where the

…rst prediction is made based on an estimate from a sample 1; 2; :::; R, such that the last out-

of-sample prediction is made based on an estimate from a sample of T R+1; :::; R+P  1 = T

where R+P +h1 = T +h is the size of the available sample. The methods proposed in this

pap er can be applied to out-of-sample tests of equal predictive ability, forecast rationality

and unbiasedness.

In order to present the main idea underlying the methods proposed in this paper, let us

focus on the case where researchers are interested in evaluating the forecasting performance of

two competing models: Model 1, involving parameters , and Model 2, involving parameters

. The parameters can be estimated either with a rolling, …xed or a recursive window

estimation scheme. In the rolling window forecast method, the true but unknown model’s

parameters 



and 



are estimated by



t;R

and b

t;R

using samples of R observations dated

tR+1; :::; t, for t = R; R+1; :::; T . In the recursive window estimation method, the model’s

parameters are instead estimated using samples of t observations dated 1; :::; t, for t = R;

R + 1; :::; T . In the …xed window estimation method, the model’s parameters are estimated

only once using observations dated 1; :::; R. Let

(1)

t+h





t;R

o

t=R

and

(2)

t+h



b

t;R



t=R

denote the sequence of loss functions of models 1 and 2 evaluating hsteps-ahead relative

out-of-sample forecast errors, and let

L

t+h





t;R

; b

t;R

o

t=R

denote their di¤erence.

Typically, researchers rely on the Diebold and Mariano (1995), West (1996), McCracken

(2000) or Clark and McCracken’s (2001) test statistics for inference on the forecast error

loss di¤erences. For example, in the case of the Diebold and Mariano’s (1995) and West’s

(1996) test, researchers evaluate the two models using the sample average of the sequence of

HTML Viewer

Frequently Asked Questions (11)

Q1. What contributions have the authors mentioned in the paper "Out-of-sample forecast tests robust to the choice of window size" ?

This paper proposes new methodologies for evaluating out-of-sample forecasting performance that are robust to the choice of the estimation window size. The authors show that the tests proposed in the literature may lack the power to detect predictive ability and might be subject to data snooping across di¤erent window sizes if used repeatedly.

Q2. What are the methods proposed in this paper?

The methods proposed in this paper can be applied to out-of-sample tests of equal predictive ability, forecast rationality and unbiasedness.

Q3. What is the way to avoid snooping over the choices of and?

To avoid data snooping over the choices of and , the authors recommend researchers to impose symmetry by xing = 1 , and to use = [0:15] in practice.

Q4. What is the novelty of the methodologies that the authors propose?

The novelty of the methodologies that the authors propose is that they are robust to the choice of the estimation and evaluation window size.

Q5. What is the framework for evaluating forecast errors?

The framework allows for linear and non-linear models estimated by any extremum estimator (e.g. OLS, GMM and MLE), the data to have serial correlation and heteroskedasticity as long as stationary is satis ed (which rules out unit roots and structural breaks), and forecast errors (which can be either one period or multi-period) evaluated using continuously di¤erentiable loss functions, such as MSE.

Q6. What is the simplest way to determine if t+h(R) has zero?

Assumption (a) is necessary for t+h(R) to have zero mean and is satis ed under the assumption discussed by Clark and West (x1t is not null) or under the assumption that x2t is strictly exogenous.

Q7. What is the asymptotic normality result of a model?

The asymptotic normality result does not hinge on whether or not two models are nested but rather on whether or not the disturbance terms of the two models are numerically identical in population under the null hypothesis.

Q8. What is the significance of the variance estimator?

West and McCracken (1998) have shown that it is very important to allow for a general variance estimator that takes into account estimation uncertainty and/or correcting the statistics by the necessary adjustments.

Q9. What does the evidence in favor of predictive ability suggest?

This suggests that the empirical evidence in favor of predictive ability may be driven by the existence of instabilities in the predictive ability, for which rolling windows of small size are advantageous.

Q10. What is the way to test the regressors?

Before the authors get into details, a word of caution: their setup requires strict exogeneity of the regressors, which is a very strong assumption in time series application.

Q11. What is the evidence for the ad-hoc window size?

The evidence highlights the sharp sensitivity of power of all the tests to the timing of the break relative to the forecast evaluation window, and shows that, in the presence of instabilities, their proposed tests tend to be more powerful than some of the tests based on an ad-hoc window size, whose power properties crucially depend on the window size.

Out-of-Sample Forecast Tests Robust to the Choice of Window Size

Summary (3 min read)

1 Introduction

2 Robust Tests of Predictive Accuracy When the Win-

2.1 Non-Nested Model Comparisons

2.2 Nested Models Comparison

2.3 Regression-Based Tests of Predictive Ability

3 Robust Tests of Predictive Accuracy When the Win-

5 Empirical evidence

6 Conclusions

Figures (13)

Citations

Cites background or methods from "Out-of-Sample Forecast Tests Robust..."

Cites methods from "Out-of-Sample Forecast Tests Robust..."

References

Additional excerpts

"Out-of-Sample Forecast Tests Robust..." refers background or methods in this paper

Related Papers (5)

Frequently Asked Questions (11)

Q1. What contributions have the authors mentioned in the paper "Out-of-sample forecast tests robust to the choice of window size" ?

Q2. What are the methods proposed in this paper?

Q3. What is the way to avoid snooping over the choices of and?

Q4. What is the novelty of the methodologies that the authors propose?

Q5. What is the framework for evaluating forecast errors?

Q6. What is the simplest way to determine if t+h(R) has zero?

Q7. What is the asymptotic normality result of a model?

Q8. What is the significance of the variance estimator?

Q9. What does the evidence in favor of predictive ability suggest?

Q10. What is the way to test the regressors?

Q11. What is the evidence for the ad-hoc window size?