scispace - formally typeset
Open AccessJournal ArticleDOI

SoilGrids250m: Global gridded soil information based on machine learning

TLDR
Improvements in the relative accuracy considering the amount of variation explained, in comparison to the previous version of SoilGrids at 1 km spatial resolution, range from 60 to 230%.
Abstract
This paper describes the technical development and accuracy assessment of the most recent and improved version of the SoilGrids system at 250m resolution (June 2016 update). SoilGrids provides global predictions for standard numeric soil properties (organic carbon, bulk density, Cation Exchange Capacity (CEC), pH, soil texture fractions and coarse fragments) at seven standard depths (0, 5, 15, 30, 60, 100 and 200 cm), in addition to predictions of depth to bedrock and distribution of soil classes based on the World Reference Base (WRB) and USDA classification systems (ca. 280 raster layers in total). Predictions were based on ca. 150,000 soil profiles used for training and a stack of 158 remote sensing-based soil covariates (primarily derived from MODIS land products, SRTM DEM derivatives, climatic images and global landform and lithology maps), which were used to fit an ensemble of machine learning methods-random forest and gradient boosting and/or multinomial logistic regression-as implemented in the R packages ranger, xgboost, nnet and caret. The results of 10-fold cross-validation show that the ensemble models explain between 56% (coarse fragments) and 83% (pH) of variation with an overall average of 61%. Improvements in the relative accuracy considering the amount of variation explained, in comparison to the previous version of SoilGrids at 1 km spatial resolution, range from 60 to 230%. Improvements can be attributed to: (1) the use of machine learning instead of linear regression, (2) to considerable investments in preparing finer resolution covariate layers and (3) to insertion of additional soil profiles. Further development of SoilGrids could include refinement of methods to incorporate input uncertainties and derivation of posterior probability distributions (per pixel), and further automation of spatial modeling so that soil maps can be generated for potentially hundreds of soil variables. Another area of future research is the development of methods for multiscale merging of SoilGrids predictions with local and/or national gridded soil products (e.g. up to 50 m spatial resolution) so that increasingly more accurate, complete and consistent global soil information can be produced. SoilGrids are available under the Open Data Base License.

read more

Content maybe subject to copyright    Report

RESEARCH ARTICLE
SoilGrids250m: Global gridded soil
information based on machine learning
Tomislav Hengl
1
*, Jorge Mendes de Jesus
1
, Gerard B. M. Heuvelink
1
, Maria Ruiperez
Gonzalez
1
, Milan Kilibarda
2
, Aleksandar Blagotić
3
, Wei Shangguan
4
, Marvin N. Wright
5
,
Xiaoyuan Geng
6
, Bernhard Bauer-Marschallinger
7
, Mario Antonio Guevara
8
,
Rodrigo Vargas
8
, Robert A. MacMillan
9
, Niels H. Batjes
1
, Johan G. B. Leenaars
1
,
Eloi Ribeiro
1
, Ichsani Wheeler
10
, Stephan Mantel
1
, Bas Kempen
1
1 ISRIC World Soil Information, Wageningen, the Netherlands, 2 Faculty of Civil Engineering, University of
Belgrade, Belgrade, Serbia, 3 GILab Ltd, Belgrade, Serbia, 4 School of Atmospheric Sciences, Sun Yat-sen
University, Guangzhou, China, 5 Institut fu¨r Medizinische Biometrie und Statistik, Lu¨beck, Germany,
6 Agriculture and Agri-Food Canada, Ottawa (Ontario), Canada, 7 Department of Geodesy and
Geoinformation, Vienna University of Technology, Vienna, Austria, 8 University of Delaware, Newark (DE),
United States of America, 9 LandMapper Environmental Solutions Inc., Edmonton (Alberta), Canada,
10 Envirometrix Inc., Wageningen, the Netherlands
*
tom.hengl@isric.org
Abstract
This paper describes the technical development and accuracy assessment of the most
recent and improved version of the SoilGrids system at 250m resolution (June 2016
update). SoilGrids provides global predictions for standard numeric soil properties
(organic carbon, bulk density, Cation Exchange Capacity (CEC), pH, soil texture fractions
and coarse fragments) at seven standard depths (0, 5, 15, 30, 60, 100 and 200 cm), in
addition to predictions of depth to bedrock and distribution of soil classes based on the
World Reference Base (WRB) and USDA classification systems (ca. 280 raster layers in
total). Predictions were based on ca. 150,000 soil profiles used for training and a stack of
158 remote sensing-based soil covariates (primarily derived from MODIS land products,
SRTM DEM derivatives, climatic images and global landform and lithology maps), which
were used to fit an ensemble of machine learning methods—random forest and gradient
boosting and/or multinomial logistic regression—as implemented in the R packages
ranger, xgboost, nnet and caret. The results of 10–fold cross-validation show that
the ensemble models explain between 56% (coarse fragments) and 83% (pH) of variation
with an overall average of 61%. Improvements in the relative accuracy considering the
amount of variation explained, in comparison to the previous version of SoilGrids at 1 km
spatial resolution, range from 60 to 230%. Improvements can be attributed to: (1) the use
of machine learning instead of linear regression, (2) to considerable investments in prepar-
ing finer resolution covariate layers and (3) to insertion of additional soil profiles. Further
development of SoilGrids could include refinement of methods to incorporate input uncer-
tainties and derivation of posterior probability distributions (per pixel), and further automa-
tion of spatial modeling so that soil maps can be generated for potentially hundreds of soil
variables. Another area of future research is the development of methods for multiscale
merging of SoilGrids predictions with local and/or national gridded soil products (e.g. up to
PLOS ONE | DOI:10.1371/journal.pone.0169748 February 16, 2017 1 / 40
a1111111111
a1111111111
a1111111111
a1111111111
a1111111111
OPEN ACCESS
Citation: Hengl T, Mendes de Jesus J, Heuvelink
GBM, Ruiperez Gonzalez M, Kilibarda M, Blagotić
A, et al. (2017) SoilGrids250m: Global gridded soil
information based on machine learning. PLoS ONE
12(2): e0169748. doi:10.1371/journal.
pone.0169748
Editor: Ben Bond-Lamberty, Pacific Northwest
National Laboratory, UNITED STATES
Received: August 1, 2016
Accepted: December 21, 2016
Published: February 16, 2017
Copyright: © 2017 Hengl et al. This is an open
access article distributed under the terms of the
Creative Commons Attribution License, which
permits unrestricted use, distribution, and
reproduction in any medium, provided the original
author and source are credited.
Data Availability Statement: SoilGrids are
available under the Open Database License (ODbl)
v1.0 and can be downloaded from
www.soilgrids.
org and/or ftp.soilgrids.org without restrictions.
SoilGrids250m data has already been released in
July 2016 (see:
http://www.isric.org/content/isric-
releases-upgraded-soilgrids-system-tw o-times-
improved-accuracy-predictions
) Access to
SoilGrids maps is provided via a soil web mapping
portal at SoilGrids.org; through a Web Coverage
Service (WCS); and via the SoilInfo App, hence
access to data is without restrictions. All the code

50 m spatial resolution) so that increasingly more accurate, complete and consistent
global soil information can be produced. SoilGrids are available under the Open Data
Base License.
Introduction
There is a growing demand for detailed soil information, especially for global estimation of
soil organic carbon [
13] and for modeling agricultural productivity [4, 5]. Spatial information
about soil water parameters is likely to become increasingly critical in areas affected by climate
change [
6]. Soils and soil information are also particularly relevant for the Sustainable Devel-
opment goal target 15.3 of achieving Land Degradation Neutrality (LDN), as specified by the
United Nations Convention to Combat Desertification (UNCCD;
http://www.unccd.int), and
are one of the main areas of interest of the FAO’s Global Soil Partnership initiative [7]. Fol-
berth et al. [
8] have recently discovered that accurate soil information might be the key to pre-
dicting either buffering or amplifying impacts of climate change on food production.
To reduce the gap between soil data demand and availability, ISRIC (International Soil
Reference Information Centre)—World Soil Information released a Global Soil Information
system called “SoilGrids. The first version of SoilGrids (predictions at 1 km spatial resolu-
tion released in 2014), was, at the time, a ‘proof of concept’ demonstrating that global compi-
lations of soil profiles can be used in an automated framework to produce complete and
consistent spatial predictions of soil properties and classes [
9]. Since the launch of the system
in 2014, several colleagues have recognized and reported some of the limitations of the first
version of the system. Mulder et al. [
10] observed, using more detailed soil profile data and
maps, that SoilGrids likely overestimated all low values for organic carbon content in France.
Likewise, Griffiths et al. [
11] reported underestimation of the pH in comparison to UK
national data. The overestimation of low values happened mainly as an effect of limited fit-
ting success (so that both high and low values are smoothed out). In addition, many of the
artifacts visible in the Harmonized World Soil Database (HWSD) [
12], which was used as
one of the covariates to produce the first version of SoilGrids, e.g. country borders, were
propagated to SoilGrids1km. Some users have also expressed concerns that the first version
of SoilGrids did not provide predictions for arid and desert areas and hence can be consid-
ered an incomplete product [
13].
To address these criticisms and concerns, we have re-designed and re-implemented Soil-
Grids with a particular emphasis on addressing methodological limitations of SoilGrids1km.
Hence, our main objective was to build a more robust system with improved output data qual-
ity; especially considering spatial detail and attribute accuracy of spatial predictions. We imple-
mented the following six key improvements:
1. We replaced linear models with tree-based, non-linear machine learning models to account
for non-linear relationships—especially for modeling soil property–depth relationships—
but also to be able to better represent local soil–covariate relationships. Predictions are now
primarily data-driven. Much less time is spent on choosing models, which also reduces the
complexity of producing updates.
2. We replaced single prediction models with an ensemble framework i.e. we use at least two
methods for each soil variable to reduce overshooting effects.
SoilGrids250m: Global gridded soil information
PLOS ONE | DOI:10.1371/journal.pone.0169748 February 16, 2017 2 / 40
used to generate SoilGrids250m predictions is fully
documented via: https://github.com/
ISRICWorldSoil/SoilGrids250m/
.
Funding: ISRIC is a non-profit organization
primarily funded by the Dutch government. The
funders had no role in study design, data collection
and analysis, decision to publish, or preparation of
the manuscript. GILAB DOO provided support in
the form of salaries for author AB, but did not have
any additional role in the study design, data
collection and analysis, decision to publish, or
preparation of the manuscript. The specific roles of
this author are articulated in the ‘author
contributions’ section.
Competing interests: Aleksandar Blagotić is
employee and web-developer of GILAB DOO. There
are no patents, products in development or
marketed products to declare. This does not alter
our adherence to all the PLOS ONE policies on
sharing data and materials.

3. We extended the initial list of covariates to include a wider diversity of MODIS land prod-
ucts and to better represent factors of soil formation. The spatial resolution of covariates
was increased from 1 km to 250 m with the expectation that finer resolution will help
increase the prediction accuracy.
4. We re-implemented the global soil mask using state-of-the-art land cover products [
14].
The current soil mask now includes all previously excluded dryland and sand dune areas so
that most of the land mask (> 95%) is represented.
5. The global compilation of soil profiles and samples used for model training was also
extended. We added extra points for the Russian Federation, Brazil, Mexico and the Arctic
circle; and re-visited data harmonization issues.
6. We created and inserted expert-based pseudo-points for a selection of parameters to mini-
mize extrapolation effects in undersampled geographic areas lacking field observations,
such as deserts, semi-deserts, glaciers and permafrost areas.
We present here the technical development and accuracy assessment of the updated Soil-
Grids system at 250 m resolution. In the following sections we describe the workflows used to
generate spatial predictions and report results of model fitting and accuracy assessment based
on 10–fold cross-validation. We conclude the article by suggesting some possible applications
of this new data set and identifying possible future improvements. SoilGrids250m map layers
are available for download via
www.SoilGrids.org under the Open Database License (ODbL).
GeoTiffs can also be obtained from
ftp://ftp.soilgrids.org/data/.
Methods and materials
Target variables
SoilGrids provides predictions for the following list of standard soil properties and classes [
9]:
Soil organic carbon content in (g kg
1
),
Soil pH in H
2
O and KCl solution,
Sand, silt and clay (weight %),
Bulk density (kg m
3
) of the fine earth fraction (< 2 mm),
Cation-exchange capacity (cmol + /kg) of the fine earth fraction,
Coarse fragments (volumetric %),
Depth to bedrock (cm) and occurrence of R horizon,
World Reference Base (WRB) class—at present, we map 118 unique soil classes, e.g. Plinthic
Acrisols, Albic Arenosols, Haplic Cambisols (Chromic), Calcic Gleysols and similar [
15].
This is about four times as many classes as in the previous version of SoilGrids,
United States Department of Agriculture (USDA) Soil Taxonomy suborders—i.e. 67 soil
classes [16].
We generated predictions at seven standard depths for all numeric soil properties (except
for depth to bedrock and soil organic carbon stock): 0 cm, 5 cm, 15 cm, 30 cm, 60 cm, 100 cm
and 200 cm, following the vertical discretisation as specified in the GlobalSoilMap specifica-
tions [
17]. Averages over (standard) depth intervals, e.g. 0–5 cm or 0–30 cm, can be derived by
taking a weighted average of the predictions within the depth interval using numerical
SoilGrids250m: Global gridded soil information
PLOS ONE | DOI:10.1371/journal.pone.0169748 February 16, 2017 3 / 40

integration, such as the trapezoidal rule:
1
b a
Z
b
a
f ðxÞ dx
1
ðb aÞ
1
2
X
N1
k¼1
x
kþ1
x
k
f ðx
k
Þ þ f ðx
kþ1
Þ
ð1Þ
where N is the number of depths, x
k
is the k-th depth and f(x
k
) is the value of the target variable
(i.e., soil property) at depth x
k
. For example, for the 0–30 cm depth interval, with soil pH values
at the first four standard depths equal to 4.5, 5.0, 5.3 and 5.0, the pH is estimated as
1
302
5 0ð Þ 4:5 þ 5:0ð Þ þ 15 5ð Þ 5:0 þ 5:3ð Þ þ 30 15ð Þ 5:3 þ 5:0ð Þ½ =30 0:5 ¼ 5:083
(
Fig 1).
Based on predictions of soil organic carbon content, bulk density, and coarse fragments, we
also derived soil organic carbon stock (tha
1
) for the six GlobalSoilMap standard depth inter-
vals following the standard approach [9, 18]. Fig 2 shows an example of observed vs predicted
values and corresponding derived soil organic carbon stock for 0–1 m and 1–2 m depths.
Model fitting and spatial prediction of depth to bedrock is based also on water well drilling
data. Model fitting and spatial prediction of soil depth to bedrock variables is explained in
detail in Shangguan et al. [
19].
We set the reference soil surface at the air/soil boundary, as per FAO [
20], hence all soil
material is included. Some national soil survey teams (and also earlier versions of the FAO
Fig 1. Standard soil depths following the GlobalSoilMap.net specifications and example of numerical
integration following the trapezoidal rule.
doi:10.1371/journal.pone.0169748.g001
SoilGrids250m: Global gridded soil information
PLOS ONE | DOI:10.1371/journal.pone.0169748 February 16, 2017 4 / 40

standards) define 0 cm depth at the start of the mineral soil, i.e. just below the O or the P (peat)
horizon. Consider for example the following sample soil profile from Canada [21]:
hor top bottom bd orgcarb
LFH -12 0 0.07 48.1
Ae 0 11 1.3 0.6
AB 11 25 1.53 0.4
Bt 25 44 1.62 0.4
which shows that the vertical coordinates of the organic layer of this soil site are negative (LFH
indicates Litter—Fermentation—Humus); orgcarb indicates soil organic carbon, bd is the
bulk density and top and bottom are the upper and lower horizon depth in cm). Therefore,
to avoid vertical mismatches between different national systems, all systems that put the zero
level at the start of the mineral soil have been adjusted to a reference with the zero level at the
air/soil boundary. For the example soil profile from Canada this means that 12 cm was added
to all top and bottom values (in the example above, there is a significant discontinuity in values
in organic carbon that drops from 48.1% to 0.6% within 12 cm of depth).
Fig 2. Example of soil variable-depth curves: Original sampled soil profiles (black rectangles) vs predicted SoilGrids values at seven standard
depths (broken red line), and predicted soil organic carbon stock for depth intervals 0–100 and 100–200 cm. Locations of points from the USDA
National Cooperative Soil Survey Soil Characterization database: mineral soil S1991CA055001 (-122.37˚W, 38.25˚N), and an organic soil profile
S2012CA067002 (-121.62˚W, 38.13˚N).
doi:10.1371/journal.pone.0169748.g002
SoilGrids250m: Global gridded soil information
PLOS ONE | DOI:10.1371/journal.pone.0169748 February 16, 2017 5 / 40

Citations
More filters

Modern Applied Statistics With S

TL;DR: The modern applied statistics with s is universally compatible with any devices to read, and is available in the digital library an online access to it is set as public so you can download it instantly.
Journal ArticleDOI

Deep learning and process understanding for data-driven Earth system science

TL;DR: It is argued that contextual cues should be used as part of deep learning to gain further process understanding of Earth system science problems, improving the predictive ability of seasonal forecasting and modelling of long-range spatial connections across multiple timescales.
Journal ArticleDOI

The global tree restoration potential.

TL;DR: There is room for an extra 0.9 billion hectares of canopy cover, which could store 205 gigatonnes of carbon in areas that would naturally support woodlands and forests, which highlights global tree restoration as one of the most effective carbon drawdown solutions to date.

Global Multi-resolution Terrain Elevation Data 2010 (GMTED2010)

TL;DR: The GMTED2010 layer extents (minimum and maximum latitude and longitude) are a result of the coordinate system inherited from the 1-arcsecond SRTM.
References
More filters
BookDOI

Modern Applied Statistics with S

TL;DR: A guide to using S environments to perform statistical analyses providing both an introduction to the use of S and a course in modern statistical methods.
Journal ArticleDOI

Very high resolution interpolated climate surfaces for global land areas.

TL;DR: In this paper, the authors developed interpolated climate surfaces for global land areas (excluding Antarctica) at a spatial resolution of 30 arc s (often referred to as 1-km spatial resolution).
Proceedings ArticleDOI

XGBoost: A Scalable Tree Boosting System

TL;DR: XGBoost as discussed by the authors proposes a sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning to achieve state-of-the-art results on many machine learning challenges.
Related Papers (5)
Trending Questions (1)