scispace - formally typeset
Open AccessJournal ArticleDOI

Presence-only and presence-absence data for comparing species distribution modeling methods

TLDR
In this paper, point location records for 226 anonymised species from six regions of the world, with accompanying predictor variables in raster (grid) and point formats, are published as a benchmark for modeling approaches and for testing new ways to evaluate the accuracy of SDMs.
Abstract
Species distribution models (SDMs) are widely used to predict and study distributions of species. Many different modeling methods and associated algorithms are used and continue to emerge. It is important to understand how different approaches perform, particularly when applied to species occurrence records that were not gathered in structured surveys (e.g. opportunistic records). This need motivated a large-scale, collaborative effort, published in 2006, that aimed to create objective comparisons of algorithm performance. As a benchmark, and to facilitate future comparisons of approaches, here we publish that dataset: point location records for 226 anonymised species from six regions of the world, with accompanying predictor variables in raster (grid) and point formats. A particularly interesting characteristic of this dataset is that independent presence-absence survey data are available for evaluation alongside the presence-only species occurrence data intended for modeling. The dataset is available on Open Science Framework and as an R package and can be used as a benchmark for modeling approaches and for testing new ways to evaluate the accuracy of SDMs.

read more

Content maybe subject to copyright    Report

Biodiversity Informatics, 15, 2020, pp. 69-80
69
PRESENCE-ONLY AND PRESENCE-ABSENCE DATA FOR COMPARING
SPECIES DISTRIBUTION MODELING METHODS

1*

2

1

2

3
,

4

5

6

7

8
,

9

10

11

12

13
,

14

15

16

17
,

2
,

2
1
School of BioSciences, University of Melbourne, Australia.
2
Swiss Federal Research Institute
WSL, CH-8903 Birmensdorf, Switzerland.
3
CSIRO Land and Water, Cairns, Queensland, Australia.
4
CSIRO Land and Water, Canberra, Australian Capital Territory (ACT), Australia.
5
CSIRO Land and Water, Tropical Forest Research Centre, Atherton, Queensland, Australia.
6
University of Lausanne, 1015 Lausanne, Switzerland.
7
University of California, Davis, USA.
8
EWHALE Lab, Institute of Arctic Biology, Biology & Wildlife Department, University of Alaska
Fairbanks, Fairbanks Alaska 99775 USA.
9
Universidade de São Paulo, Brazil.
10
College of
Agricultural and Life Sciences, University of Florida, USA.
11
Research School of Biology & Center
for Biodiversity Analysis, Australian National University, Australia.
12
Manaaki Whenua—Landcare
Research, Hamilton, New Zealand (current address: PANTHERA, Floor 18, 8 West 40 St, New
York, USA 10018.
13
Biodiversity Institute, University of Kansas, Lawrence, Kansas 66045, USA.
14
Center for Biodiversity and Conservation, American Museum of Natural History, New York,
USA.
15
Department of Geography, Planning and Environment, Concordia University, Montreal,
Canada.
16
Centre for Tropical Environmental and Sustainability Science, James Cook University,
Townsville, Australia.
17
Manaaki Whenua—Landcare Research, Lincoln, New Zealand
*Corresponding Author: j.elith@unimelb.edu.au
Abstract. 

-








      
      
-
-

environmental data and can be used to predict distri-


-
  



     
     
       -
-
    
-
     
    
       

      -


Jane Elith et al. – Presence-only and Presence-absence Data for Comparing Species Distribution Modeling Methods
70

  

 
-

         


for PA sites were provided, so modelers could predict


  
  


-

(detailed in Supplementary Information 1,
1

        

1


more transparent and repeatable (National Academy
-


   
-
-






 
  
  
-

-

-





Jane Elith et al. – Presence-only and Presence-absence Data for Comparing Species Distribution Modeling Methods
71
-

model trained and

      





none are problem-free, and SDM evaluation remains

-

in some areas simply leads to a more or less precise
     
-




be used to calculate a broader suite of evaluation sta-





-


-



-


-


-



  
       -










2
-

   
 




-

spatial resolution (smallest raster cell size) available,

2

-





    -


 


-
  


-


      -




 


       
     -
       



2


Jane Elith et al. – Presence-only and Presence-absence Data for Comparing Species Distribution Modeling Methods
72

      



       
-

    

2



-
marised in Table 1, and details of variables in Sup-

3
Records span species of


-


to 19 120 PA evaluation sites (Table 1 and detailed

 
-



  
      
      
-
-





-






-
      
     -
3
http://hdl.handle.net/1808/30582


below, are available openly
4
 OSF data
   -
 

 561 MB in total and many users will not want to
-



-











1. Environmental rasters
  /data/Environment folder at
       

  

 /data/Environment   

and details of coordinate reference systems, units and

2. Presence-only data—locations and envi-
ronmental samples
/data/Records/train_po fold-
-
        

-


3. Background data—locations and environ-
mental samples
-

/data/Records/train_bg folder at
4


Jane Elith et al. – Presence-only and Presence-absence Data for Comparing Species Distribution Modeling Methods
73
Code
Region details
Area
(‘000 km
2
)
Area location red polygons
show locations within countries
/ continents
No. env vars
(no.
categorical)
Approx.
grid cell
resolution
(m)
Biological groups & number
species
Mean no.
records per
species
No.sites:
PA
PO
PA
AWT
Australian Wet
Tropics,
Queensland,
Australia
23.97
13 (0)
80
b: birds: 20
155
97
340
p: vascular plants: 20
35
30
102
CAN
Ontario, Canada
979.34
11 (1)
1 000
birds: 30
253
1 282
14 571
NSW
North-east New
South Wales,
Australia
76.18
13 (1)
100
db: diurnal birds: 8
189
57
702
nb: nocturnal birds: 2
134
142
1 137
ot: open-forest trees: 8
42
164
2 075
ou: open-forest understorey
vascular plants: 8
21
358
1 309
rt: rainforest trees: 7
9
212
1 036
ru: rainforest understorey
vascular plants: 6
18
93
909
sr: small reptiles: 8
84
63
1 008


Citations
More filters
Journal ArticleDOI

Modelling species presence-only data with random forests

TL;DR: In this article, the authors show that class overlap is an important driver of poor performance, alongside class imbalance, and propose several approaches to fitting RF that ameliorate the effects of imbalance and overlap, and allow excellent predictive performance.
Proceedings ArticleDOI

Species Distribution Modeling for Machine Learning Practitioners: A Review

TL;DR: In this article, the authors introduce key SDM concepts and terminology, review standard models, discuss data availability, and highlight technical challenges and pitfalls, and provide computer scientists with the necessary background to read the SDM literature and develop ecologically useful ML-based SDM algorithms.
Journal ArticleDOI

Predicted range shifts of invasive giant hogweed (Heracleum mantegazzianum) in Europe.

TL;DR: In this paper , the authors identify the most important climatic factors for the distribution of Heracleum mantegazzianum in Europe, and recognize areas that will be suitable and unsuitable for future climate scenarios to prioritize management action.
Posted ContentDOI

Modelling species presence-only data with random forests

TL;DR: This work aims to understand the drivers of poor performance of RF with presence-background data, and shows several approaches to fitting RF that ameliorate the effects of imbalance and overlap, and allow excellent predictive performance.
References
More filters
Book

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.
BookDOI

Regression modeling strategies : with applications to linear models, logistic regression, and survival analysis

TL;DR: In this article, the authors present a case study in least squares fitting and interpretation of a linear model, where they use nonparametric transformations of X and Y to fit a linear regression model.
Journal ArticleDOI

A statistical explanation of MaxEnt for ecologists

TL;DR: A new statistical explanation of MaxEnt is described, showing that the model minimizes the relative entropy between two probability densities defined in covariate space, which is likely to be a more accessible way to understand the model than previous ones that rely on machine learning concepts.
Journal ArticleDOI

Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis

TL;DR: The basic Bayesian framework must be constrained, use of the step function in computing the probability that a team would rank best or worst in a league, and implementation of a Dirichlet process prior are presented.
Related Papers (5)
Frequently Asked Questions (4)
Q1. What contributions have the authors mentioned in the paper "Presence-only and presence-absence data for comparing species distribution modeling methods" ?

A particularly interesting characteristic of this dataset is that independent presence-absence survey data are available for evaluation alongside the presence-only species occurrence data intended for modeling. The authors of this current paper are the subset of the NCEAS working group who gathered and processed the data described here, alongside suppliers of those data ; referred to here as “ the NCEAS data group. ” The data come from six regions of the world ( Fig. 1 ). The authors generated random locations for each study region, referred to as “ background ” ( or elsewhere “ pseudo-absence ” ) samples. The NCEAS working group designed a “ baseline ” study to compare 16 modeling algorithms ( Elith Jane Elith et al. – Presence-only and Presence-absence Data for Comparing Species Distribution Modeling Methods 70 et al. 2006 ), and also several experimental treatments that manipulated the datasets to explore the effects of sample size ( Wisz et al. 2008 ), spatial resolution ( grain ) of environmental data ( Guisan et al. 2007 ), error in PO location ( Graham et al. 2008 ), bias in records ( Dudik and Phillips 2009 ; Phillips et al. 2009 ) and treatment of BG data ( Phillips et al. 2009 ) on model performance. The environmental data for PA sites were provided, so modelers could predict environmental suitability for all species at these sites. 2008 ; Amano and Sutherland 2013 ; Isaac and Pocock 2015 ) and thus, may not be representative of the species distribution in the study area. Jane Elith et al. – Presence-only and Presence-absence Data for Comparing Species Distribution Modeling Methods 71 wrongly emphasising the suitability of some environments and under-reporting the suitability of others. Some of the data preparation methods were reported in the original baseline modeling paper ( Elith et al. 2006 ), but the authors describe them here in full detail, to gather all the information in one place, and to ensure the descriptions are adequate for data re-use. This manuscript and the accompanying metadata should be treated as the authoritative description of the data supplied here. All datasets were cleaned by JE and CG to these common properties agreed to by the group: ( a ) all data projected to a common projection for that region ; ( b ) all raster data for a region aligned to the same extent and resolution, and only rasters with close to complete coverage in the region of interest retained ; ( c ) species records reduced to a maximum of one record per raster cell using the following protocol: for PO data: if there is at least one presence record in a cell, retain one presence record for that cell ; for PA data: reduce to one record per cell using the rule: if presence ( s ) and absence ( s ) both occur in the same cell, retain one presence ; ( d ) records checked and rectified if necessary to ensure that PO and PA locations do not co-occur in a grid cell ; ( e ) species records from locations with no environmental data removed. Many SDMs contrast the environment at locations of known occurrence of a species to that at a set of random locations in the study region ( background, quadrature, or pseudo-absence points: ( Phillips et al. The authors sourced datasets from six regions of the world ( Figure 1 and Table 1 ) ; the regions are hereafter referred to by the initials provided in Figure 1 and in column 1, Table 1. This provided a diverse and representative data set for the NCEAS studies ( Supplementary Information 1 ), and a benchmark set that the authors anticipate being broadly useful into the future. The different data sources used different sampling designs and methods which can provide insights into how data quality influences model outcomes/accuracy. These variations are typical of what is seen in ecological datasets further making this dataset a useful benchmark for SDM modelers. 

In the publications shown in the table in Supplementary Information 1, the PA evaluation (test) data were kept independent as a “blind evaluation” set, that is, they were not used to tune models. 

txt file adds authors responsible for data preparation, and details of coordinate reference systems, units and raster cell sizes. 

The authors kindly request that each user (even students within teaching exercises) download the data or R package individually because some data providers would like to track data downloads, to enable reporting on data usage as required by their funding agencies.