SoilGrids250m: Global gridded soil information based on machine learning

doi:10.1371/JOURNAL.PONE.0169748

RESEARCH ARTICLE

SoilGrids250m: Global gridded soil

information based on machine learning

Tomislav Hengl

1

*, Jorge Mendes de Jesus

1

, Gerard B. M. Heuvelink

1

, Maria Ruiperez

Gonzalez

1

, Milan Kilibarda

2

, Aleksandar Blagotić

3

, Wei Shangguan

4

, Marvin N. Wright

5

,

Xiaoyuan Geng

6

, Bernhard Bauer-Marschallinger

7

, Mario Antonio Guevara

8

,

Rodrigo Vargas

8

, Robert A. MacMillan

9

, Niels H. Batjes

1

, Johan G. B. Leenaars

1

,

Eloi Ribeiro

1

, Ichsani Wheeler

10

, Stephan Mantel

1

, Bas Kempen

1

1 ISRIC — World Soil Information, Wageningen, the Netherlands, 2 Faculty of Civil Engineering, University of

Belgrade, Belgrade, Serbia, 3 GILab Ltd, Belgrade, Serbia, 4 School of Atmospheric Sciences, Sun Yat-sen

University, Guangzhou, China, 5 Institut fu¨r Medizinische Biometrie und Statistik, Lu¨beck, Germany,

6 Agriculture and Agri-Food Canada, Ottawa (Ontario), Canada, 7 Department of Geodesy and

Geoinformation, Vienna University of Technology, Vienna, Austria, 8 University of Delaware, Newark (DE),

United States of America, 9 LandMapper Environmental Solutions Inc., Edmonton (Alberta), Canada,

10 Envirometrix Inc., Wageningen, the Netherlands

*

tom.hengl@isric.org

Abstract

This paper describes the technical development and accuracy assessment of the most

recent and improved version of the SoilGrids system at 250m resolution (June 2016

update). SoilGrids provides global predictions for standard numeric soil properties

(organic carbon, bulk density, Cation Exchange Capacity (CEC), pH, soil texture fractions

and coarse fragments) at seven standard depths (0, 5, 15, 30, 60, 100 and 200 cm), in

addition to predictions of depth to bedrock and distribution of soil classes based on the

World Reference Base (WRB) and USDA classification systems (ca. 280 raster layers in

total). Predictions were based on ca. 150,000 soil profiles used for training and a stack of

158 remote sensing-based soil covariates (primarily derived from MODIS land products,

SRTM DEM derivatives, climatic images and global landform and lithology maps), which

were used to fit an ensemble of machine learning methods—random forest and gradient

boosting and/or multinomial logistic regression—as implemented in the R packages

ranger, xgboost, nnet and caret. The results of 10–fold cross-validation show that

the ensemble models explain between 56% (coarse fragments) and 83% (pH) of variation

with an overall average of 61%. Improvements in the relative accuracy considering the

amount of variation explained, in comparison to the previous version of SoilGrids at 1 km

spatial resolution, range from 60 to 230%. Improvements can be attributed to: (1) the use

of machine learning instead of linear regression, (2) to considerable investments in prepar-

ing finer resolution covariate layers and (3) to insertion of additional soil profiles. Further

development of SoilGrids could include refinement of methods to incorporate input uncer-

tainties and derivation of posterior probability distributions (per pixel), and further automa-

tion of spatial modeling so that soil maps can be generated for potentially hundreds of soil

variables. Another area of future research is the development of methods for multiscale

merging of SoilGrids predictions with local and/or national gridded soil products (e.g. up to

PLOS ONE | DOI:10.1371/journal.pone.0169748 February 16, 2017 1 / 40

a1111111111

OPEN ACCESS

Citation: Hengl T, Mendes de Jesus J, Heuvelink

GBM, Ruiperez Gonzalez M, Kilibarda M, Blagotić

A, et al. (2017) SoilGrids250m: Global gridded soil

information based on machine learning. PLoS ONE

12(2): e0169748. doi:10.1371/journal.

pone.0169748

Editor: Ben Bond-Lamberty, Pacific Northwest

National Laboratory, UNITED STATES

Received: August 1, 2016

Accepted: December 21, 2016

Published: February 16, 2017

access article distributed under the terms of the

Creative Commons Attribution License, which

permits unrestricted use, distribution, and

reproduction in any medium, provided the original

author and source are credited.

Data Availability Statement: SoilGrids are

available under the Open Database License (ODbl)

v1.0 and can be downloaded from

www.soilgrids.

org and/or ftp.soilgrids.org without restrictions.

SoilGrids250m data has already been released in

July 2016 (see:

http://www.isric.org/content/isric-

releases-upgraded-soilgrids-system-tw o-times-

improved-accuracy-predictions

) Access to

SoilGrids maps is provided via a soil web mapping

portal at SoilGrids.org; through a Web Coverage

Service (WCS); and via the SoilInfo App, hence

access to data is without restrictions. All the code

50 m spatial resolution) so that increasingly more accurate, complete and consistent

global soil information can be produced. SoilGrids are available under the Open Data

Base License.

Introduction

There is a growing demand for detailed soil information, especially for global estimation of

soil organic carbon [

1–3] and for modeling agricultural productivity [4, 5]. Spatial information

about soil water parameters is likely to become increasingly critical in areas affected by climate

change [

6]. Soils and soil information are also particularly relevant for the Sustainable Devel-

opment goal target 15.3 of achieving Land Degradation Neutrality (LDN), as specified by the

United Nations Convention to Combat Desertification (UNCCD;

http://www.unccd.int), and

are one of the main areas of interest of the FAO’s Global Soil Partnership initiative [7]. Fol-

berth et al. [

8] have recently discovered that accurate soil information might be the key to pre-

dicting either buffering or amplifying impacts of climate change on food production.

To reduce the gap between soil data demand and availability, ISRIC (International Soil

Reference Information Centre)—World Soil Information released a Global Soil Information

system called “SoilGrids”. The first version of SoilGrids (predictions at 1 km spatial resolu-

tion released in 2014), was, at the time, a ‘proof of concept’ demonstrating that global compi-

lations of soil profiles can be used in an automated framework to produce complete and

consistent spatial predictions of soil properties and classes [

9]. Since the launch of the system

in 2014, several colleagues have recognized and reported some of the limitations of the first

version of the system. Mulder et al. [

10] observed, using more detailed soil profile data and

maps, that SoilGrids likely overestimated all low values for organic carbon content in France.

Likewise, Griffiths et al. [

11] reported underestimation of the pH in comparison to UK

national data. The overestimation of low values happened mainly as an effect of limited fit-

ting success (so that both high and low values are smoothed out). In addition, many of the

artifacts visible in the Harmonized World Soil Database (HWSD) [

12], which was used as

one of the covariates to produce the first version of SoilGrids, e.g. country borders, were

propagated to SoilGrids1km. Some users have also expressed concerns that the first version

of SoilGrids did not provide predictions for arid and desert areas and hence can be consid-

ered an incomplete product [

13].

To address these criticisms and concerns, we have re-designed and re-implemented Soil-

Grids with a particular emphasis on addressing methodological limitations of SoilGrids1km.

Hence, our main objective was to build a more robust system with improved output data qual-

ity; especially considering spatial detail and attribute accuracy of spatial predictions. We imple-

mented the following six key improvements:

1. We replaced linear models with tree-based, non-linear machine learning models to account

for non-linear relationships—especially for modeling soil property–depth relationships—

but also to be able to better represent local soil–covariate relationships. Predictions are now

primarily data-driven. Much less time is spent on choosing models, which also reduces the

complexity of producing updates.

2. We replaced single prediction models with an ensemble framework i.e. we use at least two

methods for each soil variable to reduce overshooting effects.

SoilGrids250m: Global gridded soil information

PLOS ONE | DOI:10.1371/journal.pone.0169748 February 16, 2017 2 / 40

used to generate SoilGrids250m predictions is fully

documented via: https://github.com/

ISRICWorldSoil/SoilGrids250m/

.

Funding: ISRIC is a non-profit organization

primarily funded by the Dutch government. The

funders had no role in study design, data collection

and analysis, decision to publish, or preparation of

the manuscript. GILAB DOO provided support in

the form of salaries for author AB, but did not have

any additional role in the study design, data

collection and analysis, decision to publish, or

preparation of the manuscript. The specific roles of

this author are articulated in the ‘author

contributions’ section.

Competing interests: Aleksandar Blagotić is

employee and web-developer of GILAB DOO. There

are no patents, products in development or

marketed products to declare. This does not alter

our adherence to all the PLOS ONE policies on

sharing data and materials.

3. We extended the initial list of covariates to include a wider diversity of MODIS land prod-

ucts and to better represent factors of soil formation. The spatial resolution of covariates

was increased from 1 km to 250 m with the expectation that finer resolution will help

increase the prediction accuracy.

4. We re-implemented the global soil mask using state-of-the-art land cover products [

14].

The current soil mask now includes all previously excluded dryland and sand dune areas so

that most of the land mask (> 95%) is represented.

5. The global compilation of soil profiles and samples used for model training was also

extended. We added extra points for the Russian Federation, Brazil, Mexico and the Arctic

circle; and re-visited data harmonization issues.

6. We created and inserted expert-based pseudo-points for a selection of parameters to mini-

mize extrapolation effects in undersampled geographic areas lacking field observations,

such as deserts, semi-deserts, glaciers and permafrost areas.

We present here the technical development and accuracy assessment of the updated Soil-

Grids system at 250 m resolution. In the following sections we describe the workflows used to

generate spatial predictions and report results of model fitting and accuracy assessment based

on 10–fold cross-validation. We conclude the article by suggesting some possible applications

of this new data set and identifying possible future improvements. SoilGrids250m map layers

are available for download via

www.SoilGrids.org under the Open Database License (ODbL).

GeoTiffs can also be obtained from

ftp://ftp.soilgrids.org/data/.

Methods and materials

Target variables

SoilGrids provides predictions for the following list of standard soil properties and classes [

9]:

• Soil organic carbon content in ‰ (g kg

−1

),

• Soil pH in H

2

O and KCl solution,

• Sand, silt and clay (weight %),

• Bulk density (kg m

−3

) of the fine earth fraction (< 2 mm),

• Cation-exchange capacity (cmol + /kg) of the fine earth fraction,

• Coarse fragments (volumetric %),

• Depth to bedrock (cm) and occurrence of R horizon,

• World Reference Base (WRB) class—at present, we map 118 unique soil classes, e.g. Plinthic

Acrisols, Albic Arenosols, Haplic Cambisols (Chromic), Calcic Gleysols and similar [

15].

This is about four times as many classes as in the previous version of SoilGrids,

• United States Department of Agriculture (USDA) Soil Taxonomy suborders—i.e. 67 soil

classes [16].

We generated predictions at seven standard depths for all numeric soil properties (except

for depth to bedrock and soil organic carbon stock): 0 cm, 5 cm, 15 cm, 30 cm, 60 cm, 100 cm

and 200 cm, following the vertical discretisation as specified in the GlobalSoilMap specifica-

tions [

17]. Averages over (standard) depth intervals, e.g. 0–5 cm or 0–30 cm, can be derived by

taking a weighted average of the predictions within the depth interval using numerical

SoilGrids250m: Global gridded soil information

PLOS ONE | DOI:10.1371/journal.pone.0169748 February 16, 2017 3 / 40

integration, such as the trapezoidal rule:

1

b  a

Z

b

a

f ðxÞ dx 

1

ðb  aÞ

1

2

X

N1

k¼1

x

kþ1

 x

k

 

f ðx

k

Þ þ f ðx

kþ1

Þ

 

ð1Þ

where N is the number of depths, x

k

is the k-th depth and f(x

k

) is the value of the target variable

(i.e., soil property) at depth x

k

. For example, for the 0–30 cm depth interval, with soil pH values

at the first four standard depths equal to 4.5, 5.0, 5.3 and 5.0, the pH is estimated as

1

302



5  0ð Þ  4:5 þ 5:0ð Þ þ 15  5ð Þ  5:0 þ 5:3ð Þ þ 30  15ð Þ  5:3 þ 5:0ð Þ½ =30  0:5 ¼ 5:083

(

Fig 1).

Based on predictions of soil organic carbon content, bulk density, and coarse fragments, we

also derived soil organic carbon stock (tha

−1

) for the six GlobalSoilMap standard depth inter-

vals following the standard approach [9, 18]. Fig 2 shows an example of observed vs predicted

values and corresponding derived soil organic carbon stock for 0–1 m and 1–2 m depths.

Model fitting and spatial prediction of depth to bedrock is based also on water well drilling

data. Model fitting and spatial prediction of soil depth to bedrock variables is explained in

detail in Shangguan et al. [

19].

We set the reference soil surface at the air/soil boundary, as per FAO [

20], hence all soil

material is included. Some national soil survey teams (and also earlier versions of the FAO

Fig 1. Standard soil depths following the GlobalSoilMap.net specifications and example of numerical

integration following the trapezoidal rule.

doi:10.1371/journal.pone.0169748.g001

SoilGrids250m: Global gridded soil information

PLOS ONE | DOI:10.1371/journal.pone.0169748 February 16, 2017 4 / 40

standards) define 0 cm depth at the start of the mineral soil, i.e. just below the O or the P (peat)

horizon. Consider for example the following sample soil profile from Canada [21]:

hor top bottom bd orgcarb

LFH -12 0 0.07 48.1

Ae 0 11 1.3 0.6

AB 11 25 1.53 0.4

Bt 25 44 1.62 0.4

which shows that the vertical coordinates of the organic layer of this soil site are negative (LFH

indicates Litter—Fermentation—Humus); orgcarb indicates soil organic carbon, bd is the

bulk density and top and bottom are the upper and lower horizon depth in cm). Therefore,

to avoid vertical mismatches between different national systems, all systems that put the zero

level at the start of the mineral soil have been adjusted to a reference with the zero level at the

air/soil boundary. For the example soil profile from Canada this means that 12 cm was added

to all top and bottom values (in the example above, there is a significant discontinuity in values

in organic carbon that drops from 48.1% to 0.6% within 12 cm of depth).

Fig 2. Example of soil variable-depth curves: Original sampled soil profiles (black rectangles) vs predicted SoilGrids values at seven standard

depths (broken red line), and predicted soil organic carbon stock for depth intervals 0–100 and 100–200 cm. Locations of points from the USDA

National Cooperative Soil Survey Soil Characterization database: mineral soil S1991CA055001 (-122.37˚W, 38.25˚N), and an organic soil profile

S2012CA067002 (-121.62˚W, 38.13˚N).

doi:10.1371/journal.pone.0169748.g002

SoilGrids250m: Global gridded soil information

PLOS ONE | DOI:10.1371/journal.pone.0169748 February 16, 2017 5 / 40

SoilGrids250m: Global gridded soil information based on machine learning

Citations

Modern Applied Statistics With S

Deep learning and process understanding for data-driven Earth system science

A global atlas of the dominant bacteria found in soil

The global tree restoration potential.

Global Multi-resolution Terrain Elevation Data 2010 (GMTED2010)

References

Modern Applied Statistics with S

Very high resolution interpolated climate surfaces for global land areas.

XGBoost: A Scalable Tree Boosting System

Total carbon organic carbon and organic matter

Methods of soil analysis.

Related Papers (5)

Random Forests

WorldClim 2: new 1-km spatial resolution climate surfaces for global land areas

Very high resolution interpolated climate surfaces for global land areas.

R: A language and environment for statistical computing.

High-Resolution Global Maps of 21st-Century Forest Cover Change

Trending Questions (1)