Presence-only and presence-absence data for comparing species distribution modeling methods
Jane Elith,Catherine H. Graham,Roozbeh Valavi,Meinrad Abegg,Caroline Bruce,Andrew Ford,Antoine Guisan,Robert J. Hijmans,Falk Huettmann,Lúcia G. Lohmann,Bette A. Loiselle,Craig Moritz,Jake J.M. Overton,A. Townsend Peterson,Steven J. Phillips,Karen Richardson,Stephen E. Williams,Susan K. Wiser,Thomas Wohlgemuth,Niklaus E. Zimmermann +19 more
TLDR
In this paper, point location records for 226 anonymised species from six regions of the world, with accompanying predictor variables in raster (grid) and point formats, are published as a benchmark for modeling approaches and for testing new ways to evaluate the accuracy of SDMs.Abstract:
Species distribution models (SDMs) are widely used to predict and study distributions of species. Many different modeling methods and associated algorithms are used and continue to emerge. It is important to understand how different approaches perform, particularly when applied to species occurrence records that were not gathered in structured surveys (e.g. opportunistic records). This need motivated a large-scale, collaborative effort, published in 2006, that aimed to create objective comparisons of algorithm performance. As a benchmark, and to facilitate future comparisons of approaches, here we publish that dataset: point location records for 226 anonymised species from six regions of the world, with accompanying predictor variables in raster (grid) and point formats. A particularly interesting characteristic of this dataset is that independent presence-absence survey data are available for evaluation alongside the presence-only species occurrence data intended for modeling. The dataset is available on Open Science Framework and as an R package and can be used as a benchmark for modeling approaches and for testing new ways to evaluate the accuracy of SDMs.read more
Biodiversity Informatics, 15, 2020, pp. 69-80
69
PRESENCE-ONLY AND PRESENCE-ABSENCE DATA FOR COMPARING
SPECIES DISTRIBUTION MODELING METHODS
1*
2
1
2
3
,
4
5
6
7
8
,
9
10
11
12
13
,
14
15
16
17
,
2
,
2
1
School of BioSciences, University of Melbourne, Australia.
2
Swiss Federal Research Institute
WSL, CH-8903 Birmensdorf, Switzerland.
3
CSIRO Land and Water, Cairns, Queensland, Australia.
4
CSIRO Land and Water, Canberra, Australian Capital Territory (ACT), Australia.
5
CSIRO Land and Water, Tropical Forest Research Centre, Atherton, Queensland, Australia.
6
University of Lausanne, 1015 Lausanne, Switzerland.
7
University of California, Davis, USA.
8
EWHALE Lab, Institute of Arctic Biology, Biology & Wildlife Department, University of Alaska
Fairbanks, Fairbanks Alaska 99775 USA.
9
Universidade de São Paulo, Brazil.
10
College of
Agricultural and Life Sciences, University of Florida, USA.
11
Research School of Biology & Center
for Biodiversity Analysis, Australian National University, Australia.
12
Manaaki Whenua—Landcare
Research, Hamilton, New Zealand (current address: PANTHERA, Floor 18, 8 West 40 St, New
York, USA 10018.
13
Biodiversity Institute, University of Kansas, Lawrence, Kansas 66045, USA.
14
Center for Biodiversity and Conservation, American Museum of Natural History, New York,
USA.
15
Department of Geography, Planning and Environment, Concordia University, Montreal,
Canada.
16
Centre for Tropical Environmental and Sustainability Science, James Cook University,
Townsville, Australia.
17
Manaaki Whenua—Landcare Research, Lincoln, New Zealand
*Corresponding Author: j.elith@unimelb.edu.au
Abstract.
-
-
-
environmental data and can be used to predict distri-
-
-
-
-
-
Jane Elith et al. – Presence-only and Presence-absence Data for Comparing Species Distribution Modeling Methods
70
-
for PA sites were provided, so modelers could predict
-
(detailed in Supplementary Information 1,
1
1
more transparent and repeatable (National Academy
-
-
-
-
-
-
Jane Elith et al. – Presence-only and Presence-absence Data for Comparing Species Distribution Modeling Methods
71
-
model trained and
none are problem-free, and SDM evaluation remains
-
in some areas simply leads to a more or less precise
-
be used to calculate a broader suite of evaluation sta-
-
-
-
-
-
-
2
-
-
spatial resolution (smallest raster cell size) available,
2
-
-
-
-
-
-
2
Jane Elith et al. – Presence-only and Presence-absence Data for Comparing Species Distribution Modeling Methods
72
-
2
-
marised in Table 1, and details of variables in Sup-
3
Records span species of
-
to 19 120 PA evaluation sites (Table 1 and detailed
-
-
-
-
-
-
3
http://hdl.handle.net/1808/30582
below, are available openly
4
OSF data
-
561 MB in total and many users will not want to
-
-
1. Environmental rasters
/data/Environment folder at
/data/Environment
and details of coordinate reference systems, units and
2. Presence-only data—locations and envi-
ronmental samples
/data/Records/train_po fold-
-
-
3. Background data—locations and environ-
mental samples
-
/data/Records/train_bg folder at
4
Jane Elith et al. – Presence-only and Presence-absence Data for Comparing Species Distribution Modeling Methods
73
Code
Region details
Area
(‘000 km
2
)
Area location – red polygons
show locations within countries
/ continents
No. env vars
(no.
categorical)
Approx.
grid cell
resolution
(m)
Biological groups & number
species
Mean no.
records per
species
No.sites:
PA
PO
PA
AWT
Australian Wet
Tropics,
Queensland,
Australia
23.97
13 (0)
80
b: birds: 20
155
97
340
p: vascular plants: 20
35
30
102
CAN
Ontario, Canada
979.34
11 (1)
1 000
birds: 30
253
1 282
14 571
NSW
North-east New
South Wales,
Australia
76.18
13 (1)
100
ba: bats: 7
27
76
570
db: diurnal birds: 8
189
57
702
nb: nocturnal birds: 2
134
142
1 137
ot: open-forest trees: 8
42
164
2 075
ou: open-forest understorey
vascular plants: 8
21
358
1 309
rt: rainforest trees: 7
9
212
1 036
ru: rainforest understorey
vascular plants: 6
18
93
909
sr: small reptiles: 8
84
63
1 008
Citations
More filters
Journal ArticleDOI
Predictive performance of presence-only species distribution models: a benchmark study with reproducible code
Journal ArticleDOI
Modelling species presence-only data with random forests
TL;DR: In this article, the authors show that class overlap is an important driver of poor performance, alongside class imbalance, and propose several approaches to fitting RF that ameliorate the effects of imbalance and overlap, and allow excellent predictive performance.
Proceedings ArticleDOI
Species Distribution Modeling for Machine Learning Practitioners: A Review
TL;DR: In this article, the authors introduce key SDM concepts and terminology, review standard models, discuss data availability, and highlight technical challenges and pitfalls, and provide computer scientists with the necessary background to read the SDM literature and develop ecologically useful ML-based SDM algorithms.
Journal ArticleDOI
Predicted range shifts of invasive giant hogweed (Heracleum mantegazzianum) in Europe.
TL;DR: In this paper , the authors identify the most important climatic factors for the distribution of Heracleum mantegazzianum in Europe, and recognize areas that will be suitable and unsuitable for future climate scenarios to prioritize management action.
Posted ContentDOI
Modelling species presence-only data with random forests
TL;DR: This work aims to understand the drivers of poor performance of RF with presence-background data, and shows several approaches to fitting RF that ameliorate the effects of imbalance and overlap, and allow excellent predictive performance.
References
More filters
Book
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.
Journal ArticleDOI
Novel methods improve prediction of species' distributions from occurrence data
Jane Elith,Catherine H. Graham,Robert P. Anderson,Miroslav Dudík,Simon Ferrier,Antoine Guisan,Robert J. Hijmans,Falk Huettmann,John R. Leathwick,Anthony Lehmann,Jin Li,Lúcia G. Lohmann,Bette A. Loiselle,Glenn Manion,Craig Moritz,Miguel Nakamura,Yoshinori Nakazawa,Jacob C. M. Mc Overton,A. Townsend Peterson,Steven J. Phillips,Karen Richardson,Ricardo Scachetti-Pereira,Robert E. Schapire,Jorge Soberón,Stephen E. Williams,Mary S. Wisz,Niklaus E. Zimmermann +26 more
TL;DR: This work compared 16 modelling methods over 226 species from 6 regions of the world, creating the most comprehensive set of model comparisons to date and found that presence-only data were effective for modelling species' distributions for many species and regions.
BookDOI
Regression modeling strategies : with applications to linear models, logistic regression, and survival analysis
TL;DR: In this article, the authors present a case study in least squares fitting and interpretation of a linear model, where they use nonparametric transformations of X and Y to fit a linear regression model.
Journal ArticleDOI
A statistical explanation of MaxEnt for ecologists
TL;DR: A new statistical explanation of MaxEnt is described, showing that the model minimizes the relative entropy between two probability densities defined in covariate space, which is likely to be a more accessible way to understand the model than previous ones that rely on machine learning concepts.
Journal ArticleDOI
Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis
TL;DR: The basic Bayesian framework must be constrained, use of the step function in computing the probability that a team would rank best or worst in a league, and implementation of a Dirichlet process prior are presented.
Related Papers (5)
Novel methods improve prediction of species' distributions from occurrence data
Jane Elith,Catherine H. Graham,Robert P. Anderson,Miroslav Dudík,Simon Ferrier,Antoine Guisan,Robert J. Hijmans,Falk Huettmann,John R. Leathwick,Anthony Lehmann,Jin Li,Lúcia G. Lohmann,Bette A. Loiselle,Glenn Manion,Craig Moritz,Miguel Nakamura,Yoshinori Nakazawa,Jacob C. M. Mc Overton,A. Townsend Peterson,Steven J. Phillips,Karen Richardson,Ricardo Scachetti-Pereira,Robert E. Schapire,Jorge Soberón,Stephen E. Williams,Mary S. Wisz,Niklaus E. Zimmermann +26 more
Frequently Asked Questions (4)
Q2. What was the purpose of the PA evaluation data?
In the publications shown in the table in Supplementary Information 1, the PA evaluation (test) data were kept independent as a “blind evaluation” set, that is, they were not used to tune models.
Q3. What is the name of the txt file?
txt file adds authors responsible for data preparation, and details of coordinate reference systems, units and raster cell sizes.
Q4. What is the purpose of this article?
The authors kindly request that each user (even students within teaching exercises) download the data or R package individually because some data providers would like to track data downloads, to enable reporting on data usage as required by their funding agencies.