scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Spatial bias in the GBIF database and its effect on modeling species' geographic distributions

01 Jan 2014-Ecological Informatics (Elsevier)-Vol. 19, pp 10-15
TL;DR: A subsampling routine is used as an exemplar taxon to provide evidence that range model quality is decreasing due to the spatial clustering of distributional records in GBIF and shows that data with less spatial bias produce better predictive models even though they are based on less input data.
About: This article is published in Ecological Informatics.The article was published on 2014-01-01. It has received 424 citations till now.
Citations
More filters
Journal ArticleDOI
12 May 2014-PLOS ONE
TL;DR: The ability of methods to correct the initial sampling bias varied greatly depending on bias type, bias intensity and species, but the simple systematic sampling of records consistently ranked among the best performing across the range of conditions tested, whereas other methods performed more poorly in most cases.
Abstract: MAXENT is now a common species distribution modeling (SDM) tool used by conservation practitioners for predicting the distribution of a species from a set of records and environmental predictors. However, datasets of species occurrence used to train the model are often biased in the geographical space because of unequal sampling effort across the study area. This bias may be a source of strong inaccuracy in the resulting model and could lead to incorrect predictions. Although a number of sampling bias correction methods have been proposed, there is no consensual guideline to account for it. We compared here the performance of five methods of bias correction on three datasets of species occurrence: one “virtual” derived from a land cover map, and two actual datasets for a turtle (Chrysemys picta) and a salamander (Plethodon cylindraceus). We subjected these datasets to four types of sampling biases corresponding to potential types of empirical biases. We applied five correction methods to the biased samples and compared the outputs of distribution models to unbiased datasets to assess the overall correction performance of each method. The results revealed that the ability of methods to correct the initial sampling bias varied greatly depending on bias type, bias intensity and species. However, the simple systematic sampling of records consistently ranked among the best performing across the range of conditions tested, whereas other methods performed more poorly in most cases. The strong effect of initial conditions on correction performance highlights the need for further research to develop a step-by-step guideline to account for sampling bias. However, this method seems to be the most efficient in correcting sampling bias and should be advised in most cases.

775 citations

Journal ArticleDOI
TL;DR: In this article, the authors used the Essential Biodiversity Variable framework to describe the range of biodiversity data needed to track progress towards global biodiversity targets, and assessed strengths and gaps in geographical and taxonomic coverage.

460 citations

Journal ArticleDOI
TL;DR: Open databases and integrative bioinformatic tools allow a rapid approximation of large‐scale patterns of biodiversity across space and altitudinal ranges, and it is found that geographic inaccuracy affects diversity patterns more than taxonomic uncertainties.
Abstract: Aim Massive digitalization of natural history collections is now leading to a steep accumulation of publicly available species distribution data. However, taxonomic errors and geographical uncertainty of species occurrence records are now acknowledged by the scientific community – putting into question to what extent such data can be used to unveil correct patterns of biodiversity and distribution. We explore this question through quantitative and qualitative analyses of uncleaned versus manually verified datasets of species distribution records across different spatial scales. Location The American tropics. Methods As test case we used the plant tribe Cinchoneae (Rubiaceae). We compiled four datasets of species occurrences: one created manually and verified through classical taxonomic work, and the rest derived from GBIF under different cleaning and filling schemes. We used new bioinformatic tools to code species into grids, ecoregions, and biomes following WWF's classification. We analysed species richness and altitudinal ranges of the species. Results Altitudinal ranges for species and genera were correctly inferred even without manual data cleaning and filling. However, erroneous records affected spatial patterns of species richness. They led to an overestimation of species richness in certain areas outside the centres of diversity in the clade. The location of many of these areas comprised the geographical midpoint of countries and political subdivisions, assigned long after the specimens had been collected. Main conclusion Open databases and integrative bioinformatic tools allow a rapid approximation of large-scale patterns of biodiversity across space and altitudinal ranges. We found that geographic inaccuracy affects diversity patterns more than taxonomic uncertainties, often leading to false positives, i.e. overestimating species richness in relatively species poor regions. Public databases for species distribution are valuable and should be more explored, but under scrutiny and validation by taxonomic experts. We suggest that database managers implement easy ways of community feedback on data quality.

264 citations


Cites background from "Spatial bias in the GBIF database a..."

  • ...…Biodiversity Information Facility (GBIF, http://www.gbif.org/) is at the moment one of the largest and most widely used biodiversity databases (Beck et al., 2012, 2014; Jetz et al., 2012), with the objective to ‘make the world’s primary data on biodiversity freely and universally available…...

    [...]

Journal ArticleDOI
TL;DR: It is found that occurrence records are complementary to presence-absence data and using both in combination yields considerably higher recall of 96% along with improved ranking metrics, and a spatio-temporal prior can substantially expedite the overall identification problem.
Abstract: Predicting a list of plant taxa most likely to be observed at a given geographical location and time is useful for many scenarios in biodiversity informatics. Since efficient plant species identification is impeded mainly by the large number of possible candidate species, providing a shortlist of likely candidates can help significantly expedite the task. Whereas species distribution models heavily rely on geo-referenced occurrence data, such information still remains largely unused for plant taxa identification tools. In this paper, we conduct a study on the feasibility of computing a ranked shortlist of plant taxa likely to be encountered by an observer in the field. We use the territory of Germany as case study with a total of 7.62M records of freely available plant presence-absence data and occurrence records for 2.7k plant taxa. We systematically study achievable recommendation quality based on two types of source data: binary presence-absence data and individual occurrence records. Furthermore, we study strategies for aggregating records into a taxa recommendation based on location and date of an observation. We evaluate recommendations using 28k geo-referenced and taxa-labeled plant images hosted on the Flickr website as an independent test dataset. Relying on location information from presence-absence data alone results in an average recall of 82%. However, we find that occurrence records are complementary to presence-absence data and using both in combination yields considerably higher recall of 96% along with improved ranking metrics. Ultimately, by reducing the list of candidate taxa by an average of 62%, a spatio-temporal prior can substantially expedite the overall identification problem.

257 citations


Cites background from "Spatial bias in the GBIF database a..."

  • ...Prediction results were found to strongly depend on sampling bias [17], sampling size [18, 19], and location uncertainty [20] decreasing the confidence in SDM results [21, 22]....

    [...]

Journal ArticleDOI
22 Jan 2021
TL;DR: In this article, the authors analyzed publicly available worldwide occurrence records from the Global Biodiversity Information Facility spanning over a century and found that after the 1990s, the number of collected bee species declines steeply such that approximately 25% fewer species were reported between 2006 and 2015 than before 1990s.
Abstract: Summary Wild and managed bees are key pollinators, ensuring or enhancing the reproduction of a large fraction of the world's wild flowering plants and the yield of ∼85% of all cultivated crops. Recent reports of wild bee decline and its potential consequences are thus worrisome. However, evidence is mostly based on local or regional studies; the global status of bee decline has not been assessed yet. To fill this gap, we analyzed publicly available worldwide occurrence records from the Global Biodiversity Information Facility spanning over a century. We found that after the 1990s, the number of collected bee species declines steeply such that approximately 25% fewer species were reported between 2006 and 2015 than before the 1990s. Although these trends must be interpreted cautiously given the heterogeneous nature of the dataset and potential biases in data collection and reporting, results suggest the need for swift actions to avoid further pollinator decline.

185 citations

References
More filters
Journal ArticleDOI
TL;DR: In this paper, the use of the maximum entropy method (Maxent) for modeling species geographic distributions with presence-only data was introduced, which is a general-purpose machine learning method with a simple and precise mathematical formulation.

13,120 citations


"Spatial bias in the GBIF database a..." refers methods in this paper

  • ...As a second, independent metric of SDM prediction quality for Switzerland, A. Erhardt, an expert for the ecology, behaviour and distribution of the Swiss Lepidoptera visually AC C EP TE D M AN U SC R IP T interpreted and evaluated predictive maps (Maxent logistic output, identical colour scheme for all maps), applying the Swiss highschool grading system (1-6, in steps of 0.5; best grade is 6)....

    [...]

  • ...All models and AUCMaxent values presented are averages from 5 replicate runs with different random separations of data into “test” and “training”....

    [...]

  • ...Because AUC calculation for presence-only data, as provided by Maxent, replaces missing commission error data with predicted area size, we denote this as AUCMaxent to distinguish it from true AUC (Brown & Davis, 2006; see below)....

    [...]

  • ...Using less points (i.e., removing spatial bias) lead to a decrease of model quality as measured internally by AUCMaxent (linear regression of records vs. AUCMaxent: r = 0.765, p 0.0001)....

    [...]

  • ...2.2 Distribution models and internal evaluation To create SDMs, we used the most widely utilized method and software, Maxent (v. 3.3.2; Phillips et al., 2006; Phillips & Dudík, 2008; see also Joppa et al., 2013)....

    [...]

Journal ArticleDOI
TL;DR: This work compared 16 modelling methods over 226 species from 6 regions of the world, creating the most comprehensive set of model comparisons to date and found that presence-only data were effective for modelling species' distributions for many species and regions.
Abstract: Prediction of species' distributions is central to diverse applications in ecology, evolution and conservation science. There is increasing electronic access to vast sets of occurrence records in museums and herbaria, yet little effective guidance on how best to use this information in the context of numerous approaches for modelling distributions. To meet this need, we compared 16 modelling methods over 226 species from 6 regions of the world, creating the most comprehensive set of model comparisons to date. We used presence-only data to fit models, and independent presence-absence data to evaluate the predictions. Along with well-established modelling methods such as generalised additive models and GARP and BIOCLIM, we explored methods that either have been developed recently or have rarely been applied to modelling species' distributions. These include machine-learning methods and community models, both of which have features that may make them particularly well suited to noisy or sparse information, as is typical of species' occurrence data. Presence-only data were effective for modelling species' distributions for many species and regions. The novel methods consistently outperformed more established methods. The results of our analysis are promising for the use of data from museums and herbaria, especially as methods suited to the noise inherent in such data improve.

7,589 citations


"Spatial bias in the GBIF database a..." refers methods in this paper

  • ...Maxent was found to be a very good SDM method in critical comparisons of presence-only data modelling AC C EP TE D M AN U SC R IP T approaches (Elith et al., 2006; but see Fitzpatrick et al., 2013)....

    [...]

Journal ArticleDOI
TL;DR: This paper presents a tuning method that uses presence-only data for parameter tuning, and introduces several concepts that improve the predictive accuracy and running time of Maxent and describes a new logistic output format that gives an estimate of probability of presence.
Abstract: Accurate modeling of geographic distributions of species is crucial to various applications in ecology and conservation. The best performing techniques often require some parameter tuning, which may be prohibitively time-consuming to do separately for each species, or unreliable for small or biased datasets. Additionally, even with the abundance of good quality data, users interested in the application of species models need not have the statistical knowledge required for detailed tuning. In such cases, it is desirable to use "default settings", tuned and validated on diverse datasets. Maxent is a recently introduced modeling technique, achieving high predictive accuracy and enjoying several additional attractive properties. The performance of Maxent is influenced by a moderate number of parameters. The first contribution of this paper is the empirical tuning of these parameters. Since many datasets lack information about species absence, we present a tuning method that uses presence-only data. We evaluate our method on independently collected high-quality presence-absence data. In addition to tuning, we introduce several concepts that improve the predictive accuracy and running time of Maxent. We introduce "hinge features" that model more complex relationships in the training data; we describe a new logistic output format that gives an estimate of probability of presence; finally we explore "background sampling" strategies that cope with sample selection bias and decrease model-building time. Our evaluation, based on a diverse dataset of 226 species from 6 regions, shows: 1) default settings tuned on presence-only data achieve performance which is almost as good as if they had been tuned on the evaluation data itself; 2) hinge features substantially improve model performance; 3) logistic output improves model calibration, so that large differences in output values correspond better to large differences in suitability; 4) "target-group" background sampling can give much better predictive performance than random background sampling; 5) random background sampling results in a dramatic decrease in running time, with no decrease in model performance.

5,314 citations


"Spatial bias in the GBIF database a..." refers methods in this paper

  • ...2.2 Distribution models and internal evaluation To create SDMs, we used the most widely utilized method and software, Maxent (v. 3.3.2; Phillips et al., 2006; Phillips & Dudík, 2008; see also Joppa et al., 2013)....

    [...]

Journal ArticleDOI
TL;DR: Species distribution models (SDMs) as mentioned in this paper are numerical tools that combine observations of species occurrence or abundance with environmental estimates, and are used to gain ecological and evolutionary insights and to predict distributions across landscapes, sometimes requiring extrapolation in space and time.
Abstract: Species distribution models (SDMs) are numerical tools that combine observations of species occurrence or abundance with environmental estimates. They are used to gain ecological and evolutionary insights and to predict distributions across landscapes, sometimes requiring extrapolation in space and time. SDMs are now widely used across terrestrial, freshwater, and marine realms. Differences in methods between disciplines reflect both differences in species mobility and in “established use.” Model realism and robustness is influenced by selection of relevant predictors and modeling method, consideration of scale, how the interplay between environmental and geographic factors is handled, and the extent of extrapolation. Current linkages between SDM practice and ecological theory are often weak, hindering progress. Remaining challenges include: improvement of methods for modeling presence-only data and for model selection and evaluation; accounting for biotic interactions; and assessing model uncertainty.

5,076 citations


"Spatial bias in the GBIF database a..." refers background in this paper

  • ...Ecological niche modelling or species distribution modelling (SDM; Elith & Leathwick, 2009) is a quantitative way of estimating species geographic ranges from occurrence records and the environmental conditions found there....

    [...]

Journal ArticleDOI
TL;DR: The area under the receiver operating characteristic (ROC) curve, known as the AUC, is currently considered to be the standard method to assess the accuracy of predictive distribution models as discussed by the authors.
Abstract: The area under the receiver operating characteristic (ROC) curve, known as the AUC, is currently considered to be the standard method to assess the accuracy of predictive distribution models. It avoids the supposed subjectivity in the threshold selection process, when continuous probability derived scores are converted to a binary presence‐absence variable, by summarizing overall model performance over all possible thresholds. In this manuscript we review some of the features of this measure and bring into question its reliability as a comparative measure of accuracy between model results. We do not recommend using AUC for five reasons: (1) it ignores the predicted probability values and the goodness-of-fit of the model; (2) it summarises the test performance over regions of the ROC space in which one would rarely operate; (3) it weights omission and commission errors equally; (4) it does not give information about the spatial distribution of model errors; and, most importantly, (5) the total extent to which models are carried out highly influences the rate of well-predicted absences and the AUC scores.

2,711 citations


"Spatial bias in the GBIF database a..." refers methods in this paper

  • ...…by modelling extent, for being not consistent with other evaluation criteria, and (if applied to presence-only data) for not representing “true” AUC (Lobo et al., 2008; Peterson et al., 2008; Jiménez-Valverde, 2011; Barve et al., 2011; L. Ballesteros-Mejia, I.J. Kitching & J. Beck, unpubl.)....

    [...]