What are the future works mentioned in the paper "Validation of species–climate impact models under climate change" ?

The high performance of complex nonlinear techniques suggests that relatively unexplored methodologies such as multivariate adaptive regression splines, adaptive logistic regression ( boosting ) and generalized multiplicative models ( for review see Hastie et al., 2001 ) might deserve future testing. Many studies have used good model fits on nonindependent validation data to support results pertaining to the potential impacts of future climate change on biodiversity ( see references in Table 1 ). There are many reasons, additionally to the effects of autocorrelation in the data, why good model fits on present-day distribution data ( i. e. nonindependent validation data ) do not necessarily translate into good predictions of future ranges. There are clearly limits to the ability of any model to predict the future distribution of species under climate change, and model validation thus becomes a conceptually difficult problem.

How many different climate parameters were used to calculate the mean values of six different variables in two different?

Average monthly temperature, precipitation and cloud cover of 1416 grid cells covering the area of the UK (71300 E–11400 W and 501N–611N) were used to calculate mean values of six different climate parameters in two different time slices (1967–1972, 1987–1991).

What factors may affect the prediction of future ranges?

Such factors may include the presence of spurious correlations between response (i.e. species) and predictor (i.e. climate) variables, which may translate into poor predictions on independent validation data (e.g. Guisan & Zimmermann, 2000).

What is the pattern of performance across modelling techniques?

This pattern of performance across modelling techniques is consistent with previous assessments of performance of species–climate envelope models with nonindependent data (for reviews see Olden & Jackson, 2002; Segurado & Araújo, 2004), and suggests that modelling techniques capable of summarising complex nonlinear relationships are more likely to provide useful projections of species responses to climate change.

Why does the lack of confidence in the prediction of future ranges lead to poor predictions?

There are many reasons, additionally to the effects of autocorrelation in the data, why good model fits on present-day distribution data (i.e. nonindependent validation data) do not necessarily translate into good predictions of future ranges.

Why do the authors expect that models’ performance will decrease as observed and modelled events become increasingly independent?

This is because the effect of inflated performance arising from modelling spatially and temporally autocorrelated data should decrease as observed and modelled events become increasingly independent from each other.

(Open Access) Validation of species-climate impact models under climate change (2005) | Miguel B. Araújo

Q: What is the way to estimate model accuracy?

As most assessments of model accuracy use nonindependent data, it is useful to estimate the degree to which predictive accuracy measured with nonindependent t1 distribution data provides a good surrogate for accuracy on t2 independent data.

Validation of species–climate impact models under

climate change

MIGUEL B. ARAU

w , RICHARD G. PEARSON

z, WILFRIED THUILLER§ and

MARKUS ERHARD}

Biodiversity Research Group, School of Geography and Environment, University of Oxford, Mansﬁeld Road, Oxford OX1 3TD,

UK, wBiogeography and Conservation Laboratory, Natural History Museum, Cromwell Road, London SW7 5BD, UK,

zMacroecology and Conservation Unit, University of E

vora, Estrada dos Leo

es, 7000-730 E

vora, Portugal, §Climate Change

Research Group, Kirstenbosch Research Centre, South African National Biodiversity Institute, Private Bag x7, Claremont 7735,

Cape Town, South Africa, }Institute for Meteorology and Climate Research, Forschungszentrum Karlsruhe, Postfach 3640, 76021

Karlsruhe, Germany

Abstract

Increasing concern over the implications of climate change for biodiversity has led to the

use of species–climate envelope models to project species extinction risk under climate-

change scenarios. However, recent studies have demonstrated signiﬁcant variability in

model predictions and there remains a pressing need to validate models and to reduce

uncertainties. Model validation is problematic as predictions are made for events that

have not yet occurred. Resubstituition and data partitioning of present-day data sets are,

therefore, commonly used to test the predictive performance of models. However, these

approaches suffer from the problems of spatial and temporal autocorrelation in the

calibration and validation sets. Using observed distribution shifts among 116 British

breeding-bird species over the past  20 years, we are able to provide a ﬁrst

independent validation of four envelope modelling techniques under climate change.

Results showed good to fair predictive performance on independent validation,

although rules used to assess model performance are difﬁcult to interpret in a

decision-planning context. We also showed that measures of performance on

nonindependent data provided optimistic estimates of models’ predictive ability on

independent data. Artiﬁcial neural networks and generalized additive models provided

generally more accurate predictions of species range shifts than generalized linear

models or classiﬁcation tree analysis. Data for independent model validation and

replication of this study are rare and we argue that perfect validation may not in fact be

conceptually possible. We also note that usefulness of models is contingent on both the

questions being asked and the techniques used. Implementations of species–climate

envelope models for testing hypotheses and predicting future events may prove wrong,

while being potentially useful if put into appropriate context.

Keywords: bioclimatic-envelope models, breeding birds, Britain, climate change, model accuracy,

uncertainty, validation

Received 3 November 2004; revised version received 24 January 2005; accepted 8 March 2005

Introduction

Attempts to predict climate-change impacts on biodi-

versity have often relied on the species–climate

‘envelope’ modelling approach (also known as ecolo-

gical niche models), whereby present day distributions

of species are combined with environmental variables

to project distributions of species under future climates

(for review, see Pearson & Dawson, 2003). In spite of the

Correspondence: Miguel B. Arau

jo, Departamento de

Biodiversidad y Biologia Evolutiva, Museo Nacional de Ciencias

Naturales, CSIC, C/Jose Gutierrez Abascal, 2, 28006 Madrid,

Spain, tel. 1 34 91411328, fax 1 34 915645078,

e-mail: maraujo@mncn.csic.es

Global Change Biology (2005) 11, 1504–1513, doi: 10.1111/j.1365-2486.2005.001000.x

1504 r 2005 Blackwell Publishing Ltd

inherent limitations of correlative models (for review,

see Guisan & Zimmermann, 2000), projections arising

from species–climate envelope models have been used

to support estimates of species’ extinction risk under

climate change for a variety of taxa and parts of the

world (e.g. Bakkenes et al., 2002; Erasmus et al., 2002;

Midgley et al., 2002; Peterson et al., 2002; Thomas et al.,

2004a). The impact of these estimates within political

and public debate is potentially high, yet there is great

deal of scope for misrepresenting the science behind

such studies (Ladle et al., 2004). Recent studies have

reported that projections arising from species–climate

models may be highly sensitive to the assumptions,

algorithms and parameterizations of different methods

(e.g. Thuiller, 2004; Thuiller et al.; 2004a, Pearson et al.,

2005). These studies have raised a number of metho-

dological issues that lead to a degree of uncertainty

which has been underestimated, or simply overlooked,

in previous assessments of climate impacts on biodi-

versity. We argue that when results of a particular

analysis contribute to the discussion of the weight of

evidence required to support important societal deci-

sions, the demand that models’ predictive accuracy be

assessed is eminently reasonable.

Nevertheless, validation (also referred to as evalua-

tion) of species–climate envelope models under climate

changes remains poorly explored. The reason is that

events being predicted have either been poorly docu-

mented or have yet not occurred. Consequently,

assessments of accuracy are usually limited to a process

of ‘resubstituition’, in which the data used to calibrate

(or train) models are also used to validate (test) them

(Fig. 1a; for review, see Table 1). A problem with the

resubstituition approach is that models may overﬁt to

the calibration data, leaving users unable to judge

whether high accuracy on nonindependent data reﬂect

good predictive accuracy on independent data sets.

Some authors also caution against possible bias in

estimates of model-prediction errors as the models are

optimized to deal with the ‘noise’ in the data and might

consequently lose generality outside the original data

(for discussion, see Olden & Jackson, 2000; Olden et al.,

2002). To address these problems, a growing number of

studies have used data partitioning methods for the

allocation of cases to calibration and validation data

sets. The most familiar technique is one-time data-

splitting, whereby data are split into calibration and

validation samples by random process (Fig. 1b, Table 1).

There are alternative techniques including grouped

cross-validation (also known as k fold partitioning, hold

out, or external method), bootstrapping, and jack-

kniﬁng (also known as leave-one-out) (for discussion,

see Harrell, 2001), but they all share the assumption

that randomly selected samples from original data

constitute independent observations, hence suitable for

model validation. Although these validation strategies

have generally been accepted to provide more robust

measures of predictive success than resubstituition (e.g.

Fielding & Bell, 1997), they may not avoid two of the

most important pitfalls of correlative models. The ﬁrst

is that of spatial autocorrelation in the distribution of

species and environmental variables (e.g. Hampe,

2004). This is a problem because modelling techniques

assume that modelled events are independent, which is

not true in the case of spatially autocorrelated data. This

problem is not overridden by resampling the original

data randomly, nor is it by carrying additional ﬁeld

sampling for testing models within the modelled

region, because any of these validation strategies would

use test data that is spatially autocorrelated with data to

calibrate models. The second is that of temporal

correlation in biological and environmental phenom-

ena. This is another form of autocorrelation in the data,

and implies that observations in time series are

100%

Environmental

envelope

Environmental

envelope

Environmental

envelope

Evaluation

Calibration

Projection

Same region

New region

New resolution

New time

Evaluation

Calibration

Projection

Same region

New region

New resolution

New time

Evaluation

Calibration

Projection

100%

New region

New resolution

New time

(a)

(b)

(c)

Fig. 1 Species-climate envelope modelling framework under

three calibration and validation strategies: (a) resubstituition; (b)

data splitting; and (c) independent validation.

VALIDATION OF IMPACT MODELS 1505

r 2005 Blackwell Publishing Ltd, Global Change Biology, 11, 1504–1513

nonrandom because of lack of independence between

data points that are adjacent in time. Consequently,

projections of observed current distributions closer in

time are likely to be more similar than projections made

further apart. The interplay of spatial and temporal

autocorrelation make it conceptually difﬁcult to discard

the possibility that models’ goodness-of-ﬁt to the data

represent an over-optimistic estimate of their predictive

ability outside the initial spatial and temporal condi-

tions deﬁning the training set (e.g. Beutel et al., 1999).

Thus, the number of degrees of freedom is over-

estimated, causing unrealistically small estimates of

the standard errors of the model outputs. In addition,

as temporal autocorrelation can introduce slow changes

(i.e. low-frequency variability) in the time series, it can

affect the estimate of the degree of estimated changes.

It may be argued that the predictive accuracy of

species–climate envelope models can only be fully

tested by means of validation studies using direct

comparison of model predictions with independent

empirical observations (Fig. 1c). Attempts to perform

such tests are relatively rare. A limited number of

studies have attempted independent validation using

known distributions in different regions (Beerling et al.,

1995; Fielding & Haworth, 1995; Peterson, 2003a), data

at different resolutions (Pearson et al., 2004; Arau

et al., 2005a), ﬁeld observations in previously un-

sampled regions where species’ occurrences are pre-

dicted (Raxworthy et al., 2003), fossil records of

mammal distributions under Pleistocene climates

(Martinez-Meyer et al., 2004), and visual comparison

between simulated and observed range changes for

butterﬂies in the UK over the 20th century (Hill et al.,

1999). However, statistical validation using indepen-

dent data describing range shifts under recent climate

change has not previously been undertaken.

As models projecting species’ distributional shifts

under future climate change are unlikely to be

validated in most circumstances because of data

limitations, it is important to improve understanding

Table 1 Four approaches used to validate species–climate envelope models under climate change

Reference Resubstituition Bootstrap Data-splitting

Independent

validation

Arau

jo et al. (2004) 1

Bakkenes et al. (2002) 1

Beaumont & Hughes (2002)

Berry et al. (2002) 1

Burns et al. (2003) 1

Erasmus et al. (2002) 1

Guisan & Theurillat (2000) 1

Huntley (1995) 1

Huntley et al. (1995) 1

Huntley et al. (2004) 1

Iverson & Prasad (1998) 1

Iverson et al. (1999) 1

Martinez-Meyer et al. (2004) 1

Midgley et al. (2002) 1

Midgley et al. (2003) 1

Miles et al. (2004) 1

Pearson et al. (2002) 1

Pearson et al. (2005) 1

Peterson (2003b) 1

Peterson et al. (2002) 1

Peterson et al. (2001) 1

Saetersdal et al. (1998)

Skov & Svenning (2004) 1

Sykes et al. (1996) 1

Teixeira & Arntzen (2002) 1

Thuiller (2003) 1

Thuiller (2004) 1

Thuiller et al. (2004a) 1

Thuiller et al. (2004b) 1

Few studies (

) have not attempted to validate the predictive accuracy of their models.

1506 M. B. ARAU

JO et al.

r 2005 Blackwell Publishing Ltd, Global Change Biology, 11, 1504–1513

of the underlying characteristics of data and methods

that contribute uncertainty to predictions. Because most

model evaluations assess accuracy to the calibration, or

nonindependent validation data (also referred to as

veriﬁcation), it is important to investigate the degree to

which these measures correlate with proper validations

on independent data sets. These questions can be

addressed only when independent data adequate for

model validation are available and this is a rare

circumstance for climate-change impact assessments.

We make a ﬁrst attempt to address these problems

using British-breeding bird distributional records in

two periods between the 1960s and the 1990s. We

assume these are independent events, although we

acknowledge that some degree of nonindependence

may arise given that data were recorded in the same

region and in two periods of time only 20-years apart.

However, they do constitute a rare record of observed

range shifts, and one of the few examples of species

range-shift data that allows direct comparison between

observations in each recording period, without the need

to correct for sampling bias. Furthermore, they also

have the advantage of including species reported to

shift northward in apparent response to recent regional

climate changes (Thomas & Lennon, 1999). The

unprecedented quality of these data allows researchers

to explore issues of bioclimate envelope model valida-

tion that have not yet been addressed in the literature.

In particular, we ask: (1) how well do models perform

on an independent validation dataset? (2) does valida-

tion using nonindependent distribution data provide a

good surrogate for accuracy on independent data? (3)

do particular modelling techniques perform consis-

tently better than others?

Data and methods

Species data

We used distributional records in Britain for 116 native

breeding-bird species recorded during the periods

1968–1972 (t

) and 1988–1991 (t

) (Sharrock, 1976,

Gibbons et al., 1993). Volunteer recorders achieved

100% cover of the British 2831 10 km squares, with the

total number of nonduplicate 10 km squares receiving

records for the second period being within 1% of the

217 615 10 km squares records received for 1968–1972.

This has allowed researchers to make comparisons

between occupancy of squares in each recording

period, without the need to correct for sampling bias

(e.g. Thomas & Lennon, 1999; Thomas et al., 2004b). Our

analyses of bird distributions did not include marine,

waterfowl, and aquatic shorebirds. Species with less

than 20 records in the ﬁrst recording period were also

excluded from analysis to avoid problems related to

modelling data with excessively small sample sizes

(e.g. Stockwell & Peterson, 2002). The minimum

number of records for a species in this period was 25,

the median number was 1560, and the maximum was

2405.

Climate data

A set of aggregated climate parameters were derived

from an updated version of the CRU (Climate Research

Unit at the University of East Anglia, UK) monthly

climate data (New et al., 2000). The updated data set

provides monthly values for the years 1901–2000 at

 10

spatial resolution (Mitchell et al., 2004).

Average monthly temperature, precipitation and cloud

cover of 1416 grid cells covering the area of the UK

(7130

E–1140

W and 501N–611N) were used to calculate

mean values of six different climate parameters in two

different time slices (1967–1972, 1987–1991). Variables

include mean annual temperature within time slices

( 1C), mean temperature of the coldest month ( 1 C),

mean temperature of the warmest month ( 1C), mean

annual summed precipitation (mm), and mean sum of

precipitation between July–September (mm), and grow-

ing season, deﬁned as the temperature sum of all

consecutive days with mean temperature greater than

5 1C. The six variables were selected on the basis that

they are known to impose constraints upon species

distributions as a result of widely shared physiological

limitations (Crick, 2004).

Species–climate modelling

Breeding bird species distribution records in Britain

were modelled using SPLUS-based BIOMOD (Thuiller,

2003). Modelling procedures included (1) generalized

linear models (GLM) with linear, quadratic and poly-

nomial terms (second and third order). A stepwise

procedure using the AIC criterion was used to select the

most signiﬁcant variables (Akaike, 1974); (2) general-

ized additive models (GAM) with cubic-smooth

splines. The degree of smoothness was bounded to

four for each variable. As for GLM, a stepwise

procedure was used to select the most parsimonious

model; (3) classiﬁcation tree analysis (CTA) using a 10-

fold cross-validation to select the best trade-off between

the number of leaves of the tree and the explained

deviance; and (4) feed-forward artiﬁcial neural net-

works (ANN) with seven hidden units in a single layer

and with weight decay equal to 0.03. Because of the

heuristic nature of ANN models were run 10 times and

the mean prediction was used. This procedure of

averaging predictions over the collection of networks

VALIDATION OF IMPACT MODELS 1507

r 2005 Blackwell Publishing Ltd, Global Change Biology, 11, 1504–1513

is often preferred to using the solution giving the

lowest error (Ripley, 1996).

Two runs were made with each modelling technique.

In the ﬁrst run, models were calibrated on a 70%

random sample of the original time t

data and

predictive accuracy was evaluated on the remaining

30% of the data (Fig. 1b). The size of the calibration set

was determined by application of a commonly used

heuristic for identifying the ratio of training and cross-

validation sets in presence and absence models:

[1 1 (p1)

1/2

]

1

, where p is the number of predictor

(here climate) variables (Fielding & Bell, 1997). In the

second run, models were calibrated using 100% of the

original time t

data and evaluated on the original time

data (Fig. 1c). In each run, we tested agreement

between observed and projected distributions by

calculating Cohen’s k statistic of similarity (k) and the

area under curve (AUC) of the receiver operating

characteristic (ROC) approach (Fielding & Bell, 1997).

We used the k approach after maximising the statistic

over a range of thresholds above which model outputs

are considered to represent species’ presence. We

calculated AUC using the nonparametric method based

on the derivation of the Wilcoxon statistic (Fielding &

Bell, 1997). Values of AUC range from  0.5 for models

with no predictive ability, to 1.0 for models giving

perfect predictions. k values range from 0.0 (no

predictive ability) to 1.0 (perfect predictive ability).

There are a number of rules-of-thumb available to help

interpreting measures of agreement between observed

and projected events. For example, when using the k

statistic approach, Landis & Koch (1977) suggest the

following ranges of agreement: excellent K40.75; good

0.404Ko0.75; and poor Ko0.40. When using the ROC

procedure, Swets (1988) recommends interpreting

range values as: excellent AUC40.90; good 0.804

AUCo0.90; fair 0.704AUCo0.80; poor 0.604AUCo

0.70; fail 0.504AUCo0.60.

Results

How well do models perform on an independent validation

dataset?

Our results demonstrate that models’ predictive accu-

racy on independent validation were good around

median values with AUC assessment (i.e. 0.80oAU-

Co0.90 except for CTA and GLM), but only fair near

the lower quartile distribution of accuracy values (i.e.

0.70oAUCo0.80, Table 2). With k assessment, models

also provided good agreement around median values

(i.e. 0.40oko0.75 except for GLM), while lower

quartile distribution values of accuracy were classiﬁed

as poor (i.e. ko0.40). In both cases, upper quartile

accuracy values were below ‘excellent’ threshold values

(i.e. AUCo0.90 and ko0.75).

Does validation on nonindependent distribution data

provide a good surrogate for accuracy on independent

data?

As most assessments of model accuracy use noninde-

pendent data, it is useful to estimate the degree to

which predictive accuracy measured with nonindepen-

dent t

distribution data provides a good surrogate for

accuracy on t

independent data. Our results show that

model accuracy evaluated on nonindependent 30%

subset of t

data was always higher than accuracy on

Table 2 Predictive accuracy of different modelling techniques (ANN, CTA, GAM and GLM), calibrated with 70% data from time t

and veriﬁed against remaining 30% data of time t

(Fig. 1b), or calibrated with 100% of time t

data and validated against 100% time

(Fig. 1c)

Calibration 70% t

Validation 30% t

Calibration 100% t

Validation 100% t

British breeding birds

ANN 0.59 (0.48, 0.70) 0.59 (0.43, 0.69) 0 0.60 (0.47, 0.69) 0.46 (0.26, 0.56) 0.14

CTA 0.57 (0.47, 0.67) 0.53 (0.38, 0.62) 0.04 0.57 (0.45, 0.66) 0.40 (0.25, 0.53) 0.17

GAM 0.53 (0.41, 0.66) 0.58 (0.40, 0.67) 0.05 0.53 (0.42, 0.66) 0.43 (0.29, 0.54) 0.10

GLM 0.53 (0.42, 0.66) 0.57 (0.41, 0.67) 0.04 0.54 (0.42, 0.66) 0.37 (0.22, 0.50) 0.17

AUC

ANN 0.92 (0.87, 0.94) 0.90 (0.85, 0.93) 0.02 0.92 (0.87, 0.94) 0.84 (0.78, 0.88) 0.08

CTA 0.88 (0.82, 0.91) 0.86 (0.78, 0.89) 0.02 0.87 (0.81, 0.91) 0.77 (0.70, 0.83) 0.10

GAM 0.91 (0.85, 0.94) 0.90 (0.85, 0.93) 0.01 0.91 (0.85, 0.94) 0.82 (0.75, 0.89) 0.09

GLM 0.91 (0.85, 0.93) 0.90 (0.85, 0.93) 0.01 0.91 (0.86, 0.93) 0.78 (0.68, 0.85) 0.13

Values correspond to median (lower quartile, upper quartile) accuracy measures (k and ROC) obtained for selected British breeding

birds (n 5 116); D values correspond to the difference between median accuracy measured on the 30% randomly chosen t

data or

100% time t

validation sets and median accuracy measured on calibration sets.

1508 M. B. ARAU

JO et al.

r 2005 Blackwell Publishing Ltd, Global Change Biology, 11, 1504–1513

Validation of species-climate impact models under climate change

Figures

Citations

Novel methods improve prediction of species' distributions from occurrence data

Predicting species distribution: offering more than simple habitat models.

Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation

Species Distribution Models: Ecological Explanation and Prediction Across Space and Time

Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS)

References

The measurement of observer agreement for categorical data

A new look at the statistical model identification

The Elements of Statistical Learning

The Elements of Statistical Learning

Measuring the accuracy of diagnostic systems

Related Papers (5)

Novel methods improve prediction of species' distributions from occurrence data

Maximum entropy modeling of species geographic distributions

Predicting species distribution: offering more than simple habitat models.

A review of methods for the assessment of prediction errors in conservation presence/absence models

Very high resolution interpolated climate surfaces for global land areas.

Frequently Asked Questions (10)

Q1. What are the contributions mentioned in the paper "Validation of species–climate impact models under climate change" ?

Q2. What are the future works mentioned in the paper "Validation of species–climate impact models under climate change" ?

Q3. How many different climate parameters were used to calculate the mean values of six different variables in two different?

Q4. What is the problem with the resubstituition approach?

Q5. What is the way to estimate model accuracy?

Q6. What factors may affect the prediction of future ranges?

Q7. What is the pattern of performance across modelling techniques?

Q8. Why does the lack of confidence in the prediction of future ranges lead to poor predictions?

Q9. Why do the authors expect that models’ performance will decrease as observed and modelled events become increasingly independent?

Q10. How can the model be fully tested?