scispace - formally typeset
Search or ask a question

Showing papers in "Scientific Data in 2020"


Journal ArticleDOI
TL;DR: The construction of a major new version of CRU TS (Climatic Research Unit gridded Time Series), updated to span 1901–2018 by the inclusion of additional station observations, and it will be updated annually.
Abstract: CRU TS (Climatic Research Unit gridded Time Series) is a widely used climate dataset on a 0.5° latitude by 0.5° longitude grid over all land domains of the world except Antarctica. It is derived by the interpolation of monthly climate anomalies from extensive networks of weather station observations. Here we describe the construction of a major new version, CRU TS v4. It is updated to span 1901-2018 by the inclusion of additional station observations, and it will be updated annually. The interpolation process has been changed to use angular-distance weighting (ADW), and the production of secondary variables has been revised to better suit this approach. This implementation of ADW provides improved traceability between each gridded value and the input observations, and allows more informative diagnostics that dataset users can utilise to assess how dataset quality might vary geographically.

1,689 citations


Journal ArticleDOI
Gilberto Pastorello1, Carlo Trotta2, E. Canfora2, Housen Chu1  +300 moreInstitutions (119)
TL;DR: The FLUXNET2015 dataset provides ecosystem-scale data on CO 2 , water, and energy exchange between the biosphere and the atmosphere, and other meteorological and biological measurements, from 212 sites around the globe, and is detailed in this paper.
Abstract: The FLUXNET2015 dataset provides ecosystem-scale data on CO2, water, and energy exchange between the biosphere and the atmosphere, and other meteorological and biological measurements, from 212 sites around the globe (over 1500 site-years, up to and including year 2014). These sites, independently managed and operated, voluntarily contributed their data to create global datasets. Data were quality controlled and processed using uniform methods, to improve consistency and intercomparability across sites. The dataset is already being used in a number of applications, including ecophysiology studies, remote sensing studies, and development of ecosystem and Earth system models. FLUXNET2015 includes derived-data products, such as gap-filled time series, ecosystem respiration and photosynthetic uptake estimates, estimation of uncertainties, and metadata about the measurements, presented for the first time in this paper. In addition, 206 of these sites are for the first time distributed under a Creative Commons (CC-BY 4.0) license. This paper details this enhanced dataset and the processing methods, now made available as open-source codes, making the dataset more accessible, transparent, and reproducible.

681 citations


Journal ArticleDOI
TL;DR: The China Meteorological Forcing Dataset (CMFD) is the first high spatial-temporal resolution gridded near-surface meteorological dataset developed specifically for studies of land surface processes in China and is one of the most widely-used climate datasets for China.
Abstract: The China Meteorological Forcing Dataset (CMFD) is the first high spatial-temporal resolution gridded near-surface meteorological dataset developed specifically for studies of land surface processes in China. The dataset was made through fusion of remote sensing products, reanalysis datasets and in-situ station data. Its record begins in January 1979 and is ongoing (currently up to December 2018) with a temporal resolution of three hours and a spatial resolution of 0.1°. Seven near-surface meteorological elements are provided in the CMFD, including 2-meter air temperature, surface pressure, and specific humidity, 10-meter wind speed, downward shortwave radiation, downward longwave radiation and precipitation rate. Validations against observations measured at independent stations show that the CMFD is of superior quality than the GLDAS (Global Land Data Assimilation System); this is because a larger number of stations are used to generate the CMFD than are utilised in the GLDAS. Due to its continuous temporal coverage and consistent quality, the CMFD is one of the most widely-used climate datasets for China.

583 citations


Journal ArticleDOI
TL;DR: This dataset of individual tree-core characteristics including ring-width series and whole-core wood density was collected for seven ecologically and economically important European tree species, covering most of the geographical and climatic range occupied by the selected species.
Abstract: The dataset presented here was collected by the GenTree project (EU-Horizon 2020), which aims to improve the use of forest genetic resources across Europe by better understanding how trees adapt to their local environment. This dataset of individual tree-core characteristics including ring-width series and whole-core wood density was collected for seven ecologically and economically important European tree species: silver birch (Betula pendula), European beech (Fagus sylvatica), Norway spruce (Picea abies), European black poplar (Populus nigra), maritime pine (Pinus pinaster), Scots pine (Pinus sylvestris), and sessile oak (Quercus petraea). Tree-ring width measurements were obtained from 3600 trees in 142 populations and whole-core wood density was measured for 3098 trees in 125 populations. This dataset covers most of the geographical and climatic range occupied by the selected species. The potential use of it will be highly valuable for assessing ecological and evolutionary responses to environmental conditions as well as for model development and parameterization, to predict adaptability under climate change scenarios.

467 citations


Journal ArticleDOI
TL;DR: This study constructs the most up-to-date CO2 emission inventories for China and its 30 provinces, as well as their energy inventories, for the years 2016 and 2017 and provides key updates and supplements to the previous emission dataset for 1997–2015.
Abstract: Despite China's emissions having plateaued in 2013, it is still the world's leading energy consumer and CO2 emitter, accounting for approximately 30% of global emissions. Detailed CO2 emission inventories by energy and sector have great significance to China's carbon policies as well as to achieving global climate change mitigation targets. This study constructs the most up-to-date CO2 emission inventories for China and its 30 provinces, as well as their energy inventories for the years 2016 and 2017. The newly compiled inventories provide key updates and supplements to our previous emission dataset for 1997-2015. Emissions are calculated based on IPCC (Intergovernmental Panel on Climate Change) administrative territorial scope that covers all anthropogenic emissions generated within an administrative boundary due to energy consumption (i.e. energy-related emissions from 17 fossil fuel types) and industrial production (i.e. process-related emissions from cement production). The inventories are constructed for 47 economic sectors consistent with the national economic accounting system. The data can be used as inputs to climate and integrated assessment models and for analysis of emission patterns of China and its regions.

397 citations


Journal ArticleDOI
TL;DR: This new database brings together official data on the extent of PCR testing over time for 94 countries and aims to facilitate the incorporation of this crucial information into epidemiological studies, as well as track a key component of countries’ responses to COVID-19.
Abstract: Our understanding of the evolution of the COVID-19 pandemic is built upon data concerning confirmed cases and deaths. This data, however, can only be meaningfully interpreted alongside an accurate understanding of the extent of virus testing in different countries. This new database brings together official data on the extent of PCR testing over time for 94 countries. We provide a time series for the daily number of tests performed, or people tested, together with metadata describing data quality and comparability issues needed for the interpretation of the time series. The database is updated regularly through a combination of automated scraping and manual collection and verification, and is entirely replicable, with sources provided for each observation. In providing accessible cross-country data on testing output, it aims to facilitate the incorporation of this crucial information into epidemiological studies, as well as track a key component of countries' responses to COVID-19.

359 citations


Journal ArticleDOI
TL;DR: To aid the analysis and tracking of the COVID-19 epidemic, individual-level data from national, provincial, and municipal health reports, as well as additional information from online reports are collected and curated.
Abstract: Cases of a novel coronavirus were first reported in Wuhan, Hubei province, China, in December 2019 and have since spread across the world. Epidemiological studies have indicated human-to-human transmission in China and elsewhere. To aid the analysis and tracking of the COVID-19 epidemic we collected and curated individual-level data from national, provincial, and municipal health reports, as well as additional information from online reports. All data are geo-coded and, where available, include symptoms, key dates (date of onset, admission, and confirmation), and travel history. The generation of detailed, real-time, and robust data for emerging disease outbreaks is important and can help to generate robust evidence that will support and inform public health decision making.

349 citations


Journal ArticleDOI
TL;DR: PTB-XL is put forward, the to-date largest freely accessible clinical 12-lead ECG-waveform dataset comprising 21837 records from 18885 patients of 10 seconds length, which turns the dataset into a rich resource for the development and the evaluation of automatic ECG interpretation algorithms.
Abstract: Electrocardiography (ECG) is a key non-invasive diagnostic tool for cardiovascular diseases which is increasingly supported by algorithms based on machine learning. Major obstacles for the development of automatic ECG interpretation algorithms are both the lack of public datasets and well-defined benchmarking procedures to allow comparison s of different algorithms. To address these issues, we put forward PTB-XL, the to-date largest freely accessible clinical 12-lead ECG-waveform dataset comprising 21837 records from 18885 patients of 10 seconds length. The ECG-waveform data was annotated by up to two cardiologists as a multi-label dataset, where diagnostic labels were further aggregated into super and subclasses. The dataset covers a broad range of diagnostic classes including, in particular, a large fraction of healthy records. The combination with additional metadata on demographics, additional diagnostic statements, diagnosis likelihoods, manually annotated signal properties as well as suggested folds for splitting training and test sets turns the dataset into a rich resource for the development and the evaluation of automatic ECG interpretation algorithms. Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.12098055

322 citations


Journal ArticleDOI
TL;DR: This work creates a harmonized emission temporal distribution to be applied to any emission database as input for atmospheric models, thus promoting homogeneity in inter-comparison exercises.
Abstract: Emissions into the atmosphere from human activities show marked temporal variations, from inter-annual to hourly levels. The consolidated practice of calculating yearly emissions follows the same temporal allocation of the underlying annual statistics. However, yearly emissions might not reflect heavy pollution episodes, seasonal trends, or any time-dependant atmospheric process. This study develops high-time resolution profiles for air pollutants and greenhouse gases co- emitted by anthropogenic sources in support of atmospheric modelling, Earth observation communities and decision makers. The key novelties of the Emissions Database for Global Atmospheric Research (EDGAR) temporal profiles are the development of (i) country/region- and sector- specific yearly profiles for all sources, (ii) time dependent yearly profiles for sources with inter-annual variability of their seasonal pattern, (iii) country- specific weekly and daily profiles to represent hourly emissions, (iv) a flexible system to compute hourly emissions including input from different users. This work creates a harmonized emission temporal distribution to be applied to any emission database as input for atmospheric models, thus promoting homogeneity in inter-comparison exercises. Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.12052887

287 citations


Journal ArticleDOI
TL;DR: A colocalization analysis is applied to identify genes underlying the GWAS association peaks for schizophrenia and identify a potentially novel gene colocalized with lncRNA RP11-677M14.
Abstract: The availability of high-quality RNA-sequencing and genotyping data of post-mortem brain collections from consortia such as CommonMind Consortium (CMC) and the Accelerating Medicines Partnership for Alzheimer’s Disease (AMP-AD) Consortium enable the generation of a large-scale brain cis-eQTL meta-analysis. Here we generate cerebral cortical eQTL from 1433 samples available from four cohorts (identifying >4.1 million significant eQTL for >18,000 genes), as well as cerebellar eQTL from 261 samples (identifying 874,836 significant eQTL for >10,000 genes). We find substantially improved power in the meta-analysis over individual cohort analyses, particularly in comparison to the Genotype-Tissue Expression (GTEx) Project eQTL. Additionally, we observed differences in eQTL patterns between cerebral and cerebellar brain regions. We provide these brain eQTL as a resource for use by the research community. As a proof of principle for their utility, we apply a colocalization analysis to identify genes underlying the GWAS association peaks for schizophrenia and identify a potentially novel gene colocalization with lncRNA RP11-677M14.2 (posterior probability of colocalization 0.975).

279 citations


Journal ArticleDOI
TL;DR: A particle swarm optimization-back propagation algorithm was employed to unify the scale of DMSP/OLS and NPP/VIIRS satellite imagery and estimate the CO 2 emissions in 2,735 Chinese counties during 1997–2017, and the county-level carbon sequestration value of terrestrial vegetation was calculated.
Abstract: With the implementation of China's top-down CO2 emissions reduction strategy, the regional differences should be considered. As the most basic governmental unit in China, counties could better capture the regional heterogeneity than provinces and prefecture-level city, and county-level CO2 emissions could be used for the development of strategic policies tailored to local conditions. However, most of the previous accounts of CO2 emissions in China have only focused on the national, provincial, or city levels, owing to limited methods and smaller-scale data. In this study, a particle swarm optimization-back propagation (PSO-BP) algorithm was employed to unify the scale of DMSP/OLS and NPP/VIIRS satellite imagery and estimate the CO2 emissions in 2,735 Chinese counties during 1997-2017. Moreover, as vegetation has a significant ability to sequester and reduce CO2 emissions, we calculated the county-level carbon sequestration value of terrestrial vegetation. The results presented here can contribute to existing data gaps and enable the development of strategies to reduce CO2 emissions in China.

Journal ArticleDOI
TL;DR: Daily time-series of three different aggregated mobility metrics: the origin-destination movements between Italian provinces, the radius of gyration, and the average degree of a spatial proximity network are presented to monitor the impact of the lockdown on the epidemic trajectory and inform future public health decision making.
Abstract: Italy has been severely affected by the COVID-19 pandemic, reporting the highest death toll in Europe as of April 2020. Following the identification of the first infections, on February 21, 2020, national authorities have put in place an increasing number of restrictions aimed at containing the outbreak and delaying the epidemic peak. On March 12, the government imposed a national lockdown. To aid the evaluation of the impact of interventions, we present daily time-series of three different aggregated mobility metrics: the origin-destination movements between Italian provinces, the radius of gyration, and the average degree of a spatial proximity network. All metrics were computed by processing a large-scale dataset of anonymously shared positions of about 170,000 de-identified smartphone users before and during the outbreak, at the sub-national scale. This dataset can help to monitor the impact of the lockdown on the epidemic trajectory and inform future public health decision making.

Journal ArticleDOI
TL;DR: The HyperKvasir dataset is presented, the largest image and video dataset of the gastrointestinal tract available today and can play a valuable role in developing better algorithms and computer-assisted examination systems not only for gastro- and colonoscopy, but also for other fields in medicine.
Abstract: Artificial intelligence is currently a hot topic in medicine. However, medical data is often sparse and hard to obtain due to legal restrictions and lack of medical personnel for the cumbersome and tedious process to manually label training data. These constraints make it difficult to develop systems for automatic analysis, like detecting disease or other lesions. In this respect, this article presents HyperKvasir, the largest image and video dataset of the gastrointestinal tract available today. The data is collected during real gastro- and colonoscopy examinations at Baerum Hospital in Norway and partly labeled by experienced gastrointestinal endoscopists. The dataset contains 110,079 images and 374 videos, and represents anatomical landmarks as well as pathological and normal findings. The total number of images and video frames together is around 1 million. Initial experiments demonstrate the potential benefits of artificial intelligence-based computer-assisted diagnosis systems. The HyperKvasir dataset can play a valuable role in developing better algorithms and computer-assisted examination systems not only for gastro- and colonoscopy, but also for other fields in medicine.

Journal ArticleDOI
TL;DR: The generated global DMSP NTL time-series data (1992–2018) show consistent temporal trends and provides valuable support for various studies related to human activities such as electricity consumption and urban extent dynamics.
Abstract: Nighttime light (NTL) data from the Defense Meteorological Satellite Program (DMSP)/Operational Linescan System (OLS) and the Visible Infrared Imaging Radiometer Suite (VIIRS) on the Suomi National Polar-orbiting Partnership satellite provide a great opportunity for monitoring human activities from regional to global scales. Despite the valuable records of nightscape from DMSP (1992-2013) and VIIRS (2012-2018), the potential of the historical archive of NTL observations has not been fully explored because of the severe inconsistency between DMSP and VIIRS. In this study, we generated an integrated and consistent NTL dataset at the global scale by harmonizing the inter-calibrated NTL observations from the DMSP data and the simulated DMSP-like NTL observations from the VIIRS data. The generated global DMSP NTL time-series data (1992-2018) show consistent temporal trends. This temporally extended DMSP NTL dataset provides valuable support for various studies related to human activities such as electricity consumption and urban extent dynamics.

Journal ArticleDOI
TL;DR: A technical evaluation of the bias-correction method using a ‘perfect sibling’ framework and it is shown that it reduces climate model bias by 50–70%.
Abstract: Projections of climate change are available at coarse scales (70–400 km). But agricultural and species models typically require finer scale climate data to model climate change impacts. Here, we present a global database of future climates developed by applying the delta method –a method for climate model bias correction. We performed a technical evaluation of the bias-correction method using a ‘perfect sibling’ framework and show that it reduces climate model bias by 50–70%. The data include monthly maximum and minimum temperatures and monthly total precipitation, and a set of bioclimatic indices, and can be used for assessing impacts of climate change on agriculture and biodiversity. The data are publicly available in the World Data Center for Climate (WDCC; cera- www.dkrz.de ), as well as in the CCAFS-Climate data portal ( http://ccafs-climate.org ). The database has been used up to date in more than 350 studies of ecosystem and agricultural impact assessment. Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.11353664

Journal ArticleDOI
TL;DR: This paper presents the development of the global database through systematic digitisation of satellite imagery globally by a small team and highlights the various approaches to bias estimation and to validation of the data.
Abstract: By presenting the most comprehensive GlObal geOreferenced Database of Dams to date containing more than 38,000 dams as well as their associated catchments, we enable new and improved global analyses of the impact of dams on society and environment and the impact of environmental change (for example land use and climate change) on the catchments of dams. This paper presents the development of the global database through systematic digitisation of satellite imagery globally by a small team and highlights the various approaches to bias estimation and to validation of the data. The following datasets are provided (a) raw digitised coordinates for the location of dam walls (that may be useful for example in machine learning approaches to dam identification from imagery), (b) a global vector file of the watershed for each dam. Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.10538486

Journal ArticleDOI
TL;DR: Five different statistical methods were applied to reconstruct the GMST of the past 12,000 years (Holocene) and the results were aggregated to generate a multi-method ensemble of plausible GMST and latitudinal-zone temperature reconstructions with a realistic range of uncertainties.
Abstract: An extensive new multi-proxy database of paleo-temperature time series (Temperature 12k) enables a more robust analysis of global mean surface temperature (GMST) and associated uncertainties than was previously available. We applied five different statistical methods to reconstruct the GMST of the past 12,000 years (Holocene). Each method used different approaches to averaging the globally distributed time series and to characterizing various sources of uncertainty, including proxy temperature, chronology and methodological choices. The results were aggregated to generate a multi-method ensemble of plausible GMST and latitudinal-zone temperature reconstructions with a realistic range of uncertainties. The warmest 200-year-long interval took place around 6500 years ago when GMST was 0.7 °C (0.3, 1.8) warmer than the 19th Century (median, 5th, 95th percentiles). Following the Holocene global thermal maximum, GMST cooled at an average rate −0.08 °C per 1000 years (−0.24, −0.05). The multi-method ensembles and the code used to generate them highlight the utility of the Temperature 12k database, and they are now available for future use by studies aimed at understanding Holocene evolution of the Earth system.

Journal ArticleDOI
TL;DR: The Materials Cloud as mentioned in this paper is a platform designed to enable open and seamless sharing of resources for computational science, driven by applications in materials modelling, and it hosts archival and dissemination services for raw and curated data, together with their provenance graph, modelling services and virtual machines.
Abstract: Materials Cloud is a platform designed to enable open and seamless sharing of resources for computational science, driven by applications in materials modelling. It hosts (1) archival and dissemination services for raw and curated data, together with their provenance graph, (2) modelling services and virtual machines, (3) tools for data analytics, and pre-/post-processing, and (4) educational materials. Data is citable and archived persistently, providing a comprehensive embodiment of entire simulation pipelines (calculations performed, codes used, data generated) in the form of graphs that allow retracing and reproducing any computed result. When an AiiDA database is shared on Materials Cloud, peers can browse the interconnected record of simulations, download individual files or the full database, and start their research from the results of the original authors. The infrastructure is agnostic to the specific simulation codes used and can support diverse applications in computational science that transcend its initial materials domain.

Journal ArticleDOI
TL;DR: The World Settlement Footprint 2015 (WSF2015) dataset as mentioned in this paper is a large scale dataset of human settlements on Earth for the year 2015, which was generated by means of an advanced classification system which exploits open-and-free optical and radar satellite imagery.
Abstract: Human settlements are the cause and consequence of most environmental and societal changes on Earth; however, their location and extent is still under debate. We provide here a new 10 m resolution (0.32 arc sec) global map of human settlements on Earth for the year 2015, namely the World Settlement Footprint 2015 (WSF2015). The raster dataset has been generated by means of an advanced classification system which, for the first time, jointly exploits open-and-free optical and radar satellite imagery. The WSF2015 has been validated against 900,000 samples labelled by crowdsourcing photointerpretation of very high resolution Google Earth imagery and outperforms all other similar existing layers; in particular, it considerably improves the detection of very small settlements in rural regions and better outlines scattered suburban areas. The dataset can be used at any scale of observation in support to all applications requiring detailed and accurate information on human presence (e.g., socioeconomic development, population distribution, risks assessment, etc.). Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.12424970

Journal ArticleDOI
TL;DR: This newly inaugurated research database for 12-lead electrocardiogram signals was created under the auspices of Chapman University and Shaoxing People’s Hospital and aims to enable the scientific community in conducting new studies on arrhythmia and other cardiovascular conditions.
Abstract: This newly inaugurated research database for 12-lead electrocardiogram signals was created under the auspices of Chapman University and Shaoxing People’s Hospital (Shaoxing Hospital Zhejiang University School of Medicine) and aims to enable the scientific community in conducting new studies on arrhythmia and other cardiovascular conditions. Certain types of arrhythmias, such as atrial fibrillation, have a pronounced negative impact on public health, quality of life, and medical expenditures. As a non-invasive test, long term ECG monitoring is a major and vital diagnostic tool for detecting these conditions. This practice, however, generates large amounts of data, the analysis of which requires considerable time and effort by human experts. Advancement of modern machine learning and statistical tools can be trained on high quality, large data to achieve exceptional levels of automated diagnostic accuracy. Thus, we collected and disseminated this novel database that contains 12-lead ECGs of 10,646 patients with a 500 Hz sampling rate that features 11 common rhythms and 67 additional cardiovascular conditions, all labeled by professional experts. The dataset consists of 10-second, 12-dimension ECGs and labels for rhythms and other conditions for each subject. The dataset can be used to design, compare, and fine-tune new and classical statistical and machine learning techniques in studies focused on arrhythmia and other cardiovascular conditions. Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.11698521

Journal ArticleDOI
Darrell S. Kaufman1, Nicholas P. McKay1, Cody C. Routson1, M. P. Erb1, Basil A. S. Davis2, Oliver Heiri3, Samuel L Jaccard4, Jessica E. Tierney5, Christoph Dätwyler6, Yarrow Axford7, Thomas Brussel8, Olivier Cartapanis4, Brian M. Chase9, Andria Dawson10, Anne de Vernal11, Stefan Engels12, Lukas Jonkers13, Jeremiah Marsicek14, Paola Moffa-Sanchez15, Carrie Morrill16, Anais Orsi17, Kira Rehfeld18, Krystyna M. Saunders19, Philipp Sommer2, Elizabeth K. Thomas20, Marcela Sandra Tonello21, Mónika Tóth, Richard S. Vachula22, Andrei Andreev23, Sebastien Bertrand24, Boris K. Biskaborn23, Manuel Bringué25, Stephen J. Brooks26, Magaly Caniupán27, Manuel Chevalier2, Les C. Cwynar28, Julien Emile-Geay29, John M. Fegyveresi1, Angelica Feurdean30, Walter Finsinger9, Marie Claude Fortin31, Louise C. Foster32, Louise C. Foster33, Mathew Fox5, Konrad Gajewski31, Martin Grosjean6, Sonja Hausmann, Markus Heinrichs34, Naomi Holmes35, Boris P. Ilyashuk36, Elena A. Ilyashuk36, Steve Juggins32, Deborah Khider29, Karin A. Koinig36, Peter G. Langdon37, Isabelle Larocque-Tobler, Jianyong Li38, André F. Lotter4, Tomi P. Luoto39, Anson W. Mackay40, Enikö Magyari41, Steven B. Malevich5, Bryan G. Mark42, Julieta Massaferro43, Vincent Montade9, Larisa Nazarova44, Elena Novenko45, Petr Pařil46, Emma J. Pearson32, Matthew Peros47, Reinhard Pienitz48, Mateusz Płóciennik49, David F. Porinchu50, Aaron P. Potito51, Andrew P. Rees52, Scott A. Reinemann53, Stephen J. Roberts33, Nicolas Rolland54, Sakari Salonen39, Angela Self55, Heikki Seppä39, Shyhrete Shala56, Jeannine Marie St-Jacques57, Barbara Stenni58, Liudmila Syrykh59, Pol Tarrats60, Karen J. Taylor61, Karen J. Taylor51, Valerie van den Bos52, Gaute Velle, Eugene R. Wahl62, Ian R. Walker63, Janet M. Wilmshurst64, Enlou Zhang65, Snezhana Zhilich66 
Northern Arizona University1, University of Lausanne2, University of Basel3, University of Bern4, University of Arizona5, Oeschger Centre for Climate Change Research6, Northwestern University7, University of Utah8, Centre national de la recherche scientifique9, Mount Royal University10, Université du Québec à Montréal11, Birkbeck, University of London12, University of Bremen13, University of Wisconsin-Madison14, Durham University15, Cooperative Institute for Research in Environmental Sciences16, Université Paris-Saclay17, Heidelberg University18, Australian Nuclear Science and Technology Organisation19, University at Buffalo20, National University of Mar del Plata21, Brown University22, Alfred Wegener Institute for Polar and Marine Research23, Ghent University24, Geological Survey of Canada25, American Museum of Natural History26, University of Concepción27, University of New Brunswick28, University of Southern California29, Goethe University Frankfurt30, University of Ottawa31, Newcastle University32, British Antarctic Survey33, Okanagan College34, Sheffield Hallam University35, University of Innsbruck36, University of Southampton37, Northwest University (China)38, University of Helsinki39, University College London40, Eötvös Loránd University41, Ohio State University42, National Scientific and Technical Research Council43, University of Potsdam44, Moscow State University45, Masaryk University46, Bishop's University47, Laval University48, University of Łódź49, University of Georgia50, National University of Ireland, Galway51, Victoria University of Wellington52, Sinclair Community College53, Fisheries and Oceans Canada54, Natural History Museum55, Stockholm University56, Concordia University Wisconsin57, Ca' Foscari University of Venice58, Pedagogical University59, University of Barcelona60, University College Cork61, National Oceanic and Atmospheric Administration62, University of British Columbia63, Landcare Research64, Chinese Academy of Sciences65, Russian Academy of Sciences66
TL;DR: A global compilation of quality-controlled, published, temperature-sensitive proxy records extending back 12,000 years through the Holocene, which can be used to reconstruct the spatiotemporal evolution of Holocene temperature at global to regional scales, is presented.
Abstract: A comprehensive database of paleoclimate records is needed to place recent warming into the longer-term context of natural climate variability. We present a global compilation of quality-controlled, published, temperature-sensitive proxy records extending back 12,000 years through the Holocene. Data were compiled from 679 sites where time series cover at least 4000 years, are resolved at sub-millennial scale (median spacing of 400 years or finer) and have at least one age control point every 3000 years, with cut-off values slackened in data-sparse regions. The data derive from lake sediment (51%), marine sediment (31%), peat (11%), glacier ice (3%), and other natural archives. The database contains 1319 records, including 157 from the Southern Hemisphere. The multi-proxy database comprises paleotemperature time series based on ecological assemblages, as well as biophysical and geochemical indicators that reflect mean annual or seasonal temperatures, as encoded in the database. This database can be used to reconstruct the spatiotemporal evolution of Holocene temperature at global to regional scales, and is publicly available in Linked Paleo Data (LiPD) format.

Journal ArticleDOI
TL;DR: This study estimates China’s provincial population from 2010 to 2100 by age, sex, and educational levels and takes into account fertility promoting policies and population ceiling restrictions of megacities that have been implemented in China in recent years to reduce systematic biases in current studies.
Abstract: In response to a growing demand for subnational and spatially explicit data on China's future population, this study estimates China's provincial population from 2010 to 2100 by age (0-100+), sex (male and female) and educational levels (illiterate, primary school, junior-high school, senior-high school, college, bachelor's, and master's and above) under different shared socioeconomic pathways (SSPs). The provincial projection takes into account fertility promoting policies and population ceiling restrictions of megacities that have been implemented in China in recent years to reduce systematic biases in current studies. The predicted provincial population is allocated to spatially explicit population grids for each year at 30 arc-seconds resolution based on representative concentration pathway (RCP) urban grids and historical population grids. The provincial projection data were validated using population data in 2017 from China's Provincial Statistical Yearbook, and the accuracy of the population grids in 2015 was evaluated. These data have numerous potential uses and can serve as inputs in climate policy research with requirements for precise administrative or spatial population data in China.

Journal ArticleDOI
TL;DR: AiiDA as mentioned in this paper is an open-source high-throughput infrastructure addressing the challenges arising from the needs of automated workflow management and data provenance recording, supporting throughputs of tens of thousands of processes/hour.
Abstract: The ever-growing availability of computing power and the sustained development of advanced computational methods have contributed much to recent scientific progress. These developments present new challenges driven by the sheer amount of calculations and data to manage. Next-generation exascale supercomputers will harden these challenges, such that automated and scalable solutions become crucial. In recent years, we have been developing AiiDA (aiida.net), a robust open-source high-throughput infrastructure addressing the challenges arising from the needs of automated workflow management and data provenance recording. Here, we introduce developments and capabilities required to reach sustained performance, with AiiDA supporting throughputs of tens of thousands processes/hour, while automatically preserving and storing the full data provenance in a relational database making it queryable and traversable, thus enabling high-performance data analytics. AiiDA's workflow language provides advanced automation, error handling features and a flexible plugin model to allow interfacing with external simulation software. The associated plugin registry enables seamless sharing of extensions, empowering a vibrant user community dedicated to making simulations more robust, user-friendly and reproducible.

Journal ArticleDOI
TL;DR: A specific hierarchical coding scheme for NPIs is developed and a comprehensive structured dataset of government interventions and their respective timelines of implementation is generated via an open library to improve transparency and motivate collaborative validation process.
Abstract: In response to the COVID-19 pandemic, governments have implemented a wide range of non-pharmaceutical interventions (NPIs). Monitoring and documenting government strategies during the COVID-19 crisis is crucial to understand the progression of the epidemic. Following a content analysis strategy of existing public information sources, we developed a specific hierarchical coding scheme for NPIs. We generated a comprehensive structured dataset of government interventions and their respective timelines of implementation. To improve transparency and motivate collaborative validation process, information sources are shared via an open library. We also provide codes that enable users to visualise the dataset. Standardization and structure of the dataset facilitate inter-country comparison and the assessment of the impacts of different NPI categories on the epidemic parameters, population health indicators, the economy, and human rights, among others. This dataset provides an in-depth insight of the government strategies and can be a valuable tool for developing relevant preparedness plans for pandemic. We intend to further develop and update this dataset until the end of December 2020.

Journal ArticleDOI
TL;DR: The scRNA-seq data of 23,366 high-quality cells from the kidneys of three human donors provide a reliable reference for studies on renal cell biology and kidney disease.
Abstract: A comprehensive cellular anatomy of normal human kidney is crucial to address the cellular origins of renal disease and renal cancer. Some kidney diseases may be cell type-specific, especially renal tubular cells. To investigate the classification and transcriptomic information of the human kidney, we rapidly obtained a single-cell suspension of the kidney and conducted single-cell RNA sequencing (scRNA-seq). Here, we present the scRNA-seq data of 23,366 high-quality cells from the kidneys of three human donors. In this dataset, we show 10 clusters of normal human renal cells. Due to the high quality of single-cell transcriptomic information, proximal tubule (PT) cells were classified into three subtypes and collecting ducts cells into two subtypes. Collectively, our data provide a reliable reference for studies on renal cell biology and kidney disease.

Journal ArticleDOI
TL;DR: InvaCost as discussed by the authors is a comprehensive, comprehensive, harmonised and robust compilation and description of economic cost estimates associated with biological invasions worldwide, which provides an essential basis for worldwide research, management efforts and, ultimately, for data-driven and evidence-based policymaking.
Abstract: Biological invasions are responsible for tremendous impacts globally, including huge economic losses and management expenditures. Efficiently mitigating this major driver of global change requires the improvement of public awareness and policy regarding its substantial impacts on our socio-ecosystems. One option to contribute to this overall objective is to inform people on the economic costs linked to these impacts; however, until now, a reliable synthesis of invasion costs has never been produced at a global scale. Here, we introduce InvaCost as the most up-to-date, comprehensive, harmonised and robust compilation and description of economic cost estimates associated with biological invasions worldwide. We have developed a systematic, standardised methodology to collect information from peer-reviewed articles and grey literature, while ensuring data validity and method repeatability for further transparent inputs. Our manuscript presents the methodology and tools used to build and populate this living and publicly available database. InvaCost provides an essential basis (2419 cost estimates currently compiled) for worldwide research, management efforts and, ultimately, for data-driven and evidence-based policymaking.

Journal ArticleDOI
TL;DR: As information and communication technology has become pervasive in their society, the authors are increasingly dependent on both digital data and repositories that provide access to and enable the use of such resources.
Abstract: As information and communication technology has become pervasive in our society, we are increasingly dependent on both digital data and repositories that provide access to and enable the use of such resources. Repositories must earn the trust of the communities they intend to serve and demonstrate that they are reliable and capable of appropriately managing the data they hold.

Journal ArticleDOI
TL;DR: Deep coverage HiFi datasets for five complex samples including the two inbred model genomes Mus musculus and Zea mays, as well as two complex genomes, octoploid Fragaria × ananassa and the diploid anuran Rana muscosa are presented.
Abstract: The PacBio® HiFi sequencing method yields highly accurate long-read sequencing datasets with read lengths averaging 10-25 kb and accuracies greater than 99.5%. These accurate long reads can be used to improve results for complex applications such as single nucleotide and structural variant detection, genome assembly, assembly of difficult polyploid or highly repetitive genomes, and assembly of metagenomes. Currently, there is a need for sample data sets to both evaluate the benefits of these long accurate reads as well as for development of bioinformatic tools including genome assemblers, variant callers, and haplotyping algorithms. We present deep coverage HiFi datasets for five complex samples including the two inbred model genomes Mus musculus and Zea mays, as well as two complex genomes, octoploid Fragaria × ananassa and the diploid anuran Rana muscosa. Additionally, we release sequence data from a mock metagenome community. The datasets reported here can be used without restriction to develop new algorithms and explore complex genome structure and evolution. Data were generated on the PacBio Sequel II System.

Journal ArticleDOI
TL;DR: This work developed an approach for harmonizing vegetation-specific maps of both above and belowground biomass into a single, comprehensive representation of each using ancillary maps of percent tree cover and landcover, and a rule-based decision schema.
Abstract: Remotely sensed biomass carbon density maps are widely used for myriad scientific and policy applications, but all remain limited in scope. They often only represent a single vegetation type and rarely account for carbon stocks in belowground biomass. To date, no global product integrates these disparate estimates into an all-encompassing map at a scale appropriate for many modelling or decision-making applications. We developed an approach for harmonizing vegetation-specific maps of both above and belowground biomass into a single, comprehensive representation of each. We overlaid input maps and allocated their estimates in proportion to the relative spatial extent of each vegetation type using ancillary maps of percent tree cover and landcover, and a rule-based decision schema. The resulting maps consistently and seamlessly report biomass carbon density estimates across a wide range of vegetation types in 2010 with quantified uncertainty. They do so for the globe at an unprecedented 300-meter spatial resolution and can be used to more holistically account for diverse vegetation carbon stocks in global analyses and greenhouse gas inventories.

Journal ArticleDOI
TL;DR: The ANI-1x and ANi-1ccx ML-based general-purpose potentials for organic molecules were developed through active learning; an automated data diversification process, and are provided to aid research and development of ML models for chemistry.
Abstract: Maximum diversification of data is a central theme in building generalized and accurate machine learning (ML) models. In chemistry, ML has been used to develop models for predicting molecular properties, for example quantum mechanics (QM) calculated potential energy surfaces and atomic charge models. The ANI-1x and ANI-1ccx ML-based general-purpose potentials for organic molecules were developed through active learning; an automated data diversification process. Here, we describe the ANI-1x and ANI-1ccx data sets. To demonstrate data diversity, we visualize it with a dimensionality reduction scheme, and contrast against existing data sets. The ANI-1x data set contains multiple QM properties from 5 M density functional theory calculations, while the ANI-1ccx data set contains 500 k data points obtained with an accurate CCSD(T)/CBS extrapolation. Approximately 14 million CPU core-hours were expended to generate this data. Multiple QM calculated properties for the chemical elements C, H, N, and O are provided: energies, atomic forces, multipole moments, atomic charges, etc. We provide this data to the community to aid research and development of ML models for chemistry. Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.12046440