scispace - formally typeset
Search or ask a question

Showing papers in "Scientific Data in 2019"


Journal ArticleDOI
TL;DR: A large dataset of 227,835 imaging studies for 65,379 patients presenting to the Beth Israel Deaconess Medical Center Emergency Department between 2011–2016 is described, making freely available to facilitate and encourage a wide range of research in computer vision, natural language processing, and clinical data mining.
Abstract: Chest radiography is an extremely powerful imaging modality, allowing for a detailed inspection of a patient's chest, but requires specialized training for proper interpretation. With the advent of high performance general purpose computer vision algorithms, the accurate automated analysis of chest radiographs is becoming increasingly of interest to researchers. Here we describe MIMIC-CXR, a large dataset of 227,835 imaging studies for 65,379 patients presenting to the Beth Israel Deaconess Medical Center Emergency Department between 2011-2016. Each imaging study can contain one or more images, usually a frontal view and a lateral view. A total of 377,110 images are available in the dataset. Studies are made available with a semi-structured free-text radiology report that describes the radiological findings of the images, written by a practicing radiologist contemporaneously during routine clinical care. All images and reports have been de-identified to protect patient privacy. The dataset is made freely available to facilitate and encourage a wide range of research in computer vision, natural language processing, and clinical data mining.

504 citations


Journal ArticleDOI
TL;DR: This work proposes four clinical prediction benchmarks using data derived from the publicly available Medical Information Mart for Intensive Care (MIMIC-III) database, covering a range of clinical problems including modeling risk of mortality, forecasting length of stay, detecting physiologic decline, and phenotype classification.
Abstract: Health care is one of the most exciting frontiers in data mining and machine learning. Successful adoption of electronic health records (EHRs) created an explosion in digital clinical data available for analysis, but progress in machine learning for healthcare research has been difficult to measure because of the absence of publicly available benchmark data sets. To address this problem, we propose four clinical prediction benchmarks using data derived from the publicly available Medical Information Mart for Intensive Care (MIMIC-III) database. These tasks cover a range of clinical problems including modeling risk of mortality, forecasting length of stay, detecting physiologic decline, and phenotype classification. We propose strong linear and neural baselines for all four tasks and evaluate the effect of deep supervision, multitask training and data-specific architectural modifications on the performance of neural models.

504 citations


Journal ArticleDOI
TL;DR: This is the first set of consistently dated marine sediment cores enabling paleoclimate scientists to evaluate leads/lags between circulation and climate changes over vast regions of the Atlantic Ocean.
Abstract: Rapid changes in ocean circulation and climate have been observed in marine-sediment and ice cores over the last glacial period and deglaciation, highlighting the non-linear character of the climate system and underlining the possibility of rapid climate shifts in response to anthropogenic greenhouse gas forcing. To date, these rapid changes in climate and ocean circulation are still not fully explained. One obstacle hindering progress in our understanding of the interactions between past ocean circulation and climate changes is the difficulty of accurately dating marine cores. Here, we present a set of 92 marine sediment cores from the Atlantic Ocean for which we have established age-depth models that are consistent with the Greenland GICC05 ice core chronology, and computed the associated dating uncertainties, using a new deposition modeling technique. This is the first set of consistently dated marine sediment cores enabling paleoclimate scientists to evaluate leads/lags between circulation and climate changes over vast regions of the Atlantic Ocean. Moreover, this data set is of direct use in paleoclimate modeling studies.

399 citations


Journal ArticleDOI
TL;DR: This work presents BioWordVec: an open set of biomedical word vectors/embeddings that combines subword information from unlabeled biomedical text with a widely-used biomedical controlled vocabulary called Medical Subject Headings (MeSH).
Abstract: Distributed word representations have become an essential foundation for biomedical natural language processing (BioNLP), text mining and information retrieval. Word embeddings are traditionally computed at the word level from a large corpus of unlabeled text, ignoring the information present in the internal structure of words or any information available in domain specific structured resources such as ontologies. However, such information holds potentials for greatly improving the quality of the word representation, as suggested in some recent studies in the general domain. Here we present BioWordVec: an open set of biomedical word vectors/embeddings that combines subword information from unlabeled biomedical text with a widely-used biomedical controlled vocabulary called Medical Subject Headings (MeSH). We assess both the validity and utility of our generated word embeddings over multiple NLP tasks in the biomedical domain. Our benchmarking results demonstrate that our word embeddings can result in significantly improved performance over the previous state of the art in those challenging tasks. Machine-accessible metadata file describing the reported data (ISA-Tab format)

329 citations


Journal ArticleDOI
TL;DR: In this article, the authors use machine learning to merge energy flux measurements from FLUXNET eddy covariance towers with remote sensing and meteorological data to estimate global gridded net radiation, latent and sensible heat and their uncertainties.
Abstract: Although a key driver of Earth’s climate system, global land-atmosphere energy fluxes are poorly constrained. Here we use machine learning to merge energy flux measurements from FLUXNET eddy covariance towers with remote sensing and meteorological data to estimate global gridded net radiation, latent and sensible heat and their uncertainties. The resulting FLUXCOM database comprises 147 products in two setups: (1) 0.0833° resolution using MODIS remote sensing data (RS) and (2) 0.5° resolution using remote sensing and meteorological data (RS + METEO). Within each setup we use a full factorial design across machine learning methods, forcing datasets and energy balance closure corrections. For RS and RS + METEO setups respectively, we estimate 2001–2013 global (±1 s.d.) net radiation as 75.49 ± 1.39 W m−2 and 77.52 ± 2.43 W m−2, sensible heat as 32.39 ± 4.17 W m−2 and 35.58 ± 4.75 W m−2, and latent heat flux as 39.14 ± 6.60 W m−2 and 39.49 ± 4.51 W m−2 (as evapotranspiration, 75.6 ± 9.8 × 103 km3 yr−1 and 76 ± 6.8 × 103 km3 yr−1). FLUXCOM products are suitable to quantify global land-atmosphere interactions and benchmark land surface model simulations. Machine-accessible metadata file describing the reported data (ISA-Tab format)

319 citations


Journal ArticleDOI
TL;DR: The resolution of V2-V3 and V3-V4 16S rRNA regions are compared for the purposes of estimating microbial community diversity using paired-end Illumina MiSeq reads, and it is shown that the fragment has higher resolution for lower-rank taxa (genera and species).
Abstract: In this work, we compare the resolution of V2-V3 and V3-V4 16S rRNA regions for the purposes of estimating microbial community diversity using paired-end Illumina MiSeq reads, and show that the fragment, including V2 and V3 regions, has higher resolution for lower-rank taxa (genera and species). It allows for a more precise distance-based clustering of reads into species-level OTUs. Statistically convergent estimates of the diversity of major species (defined as those that together are covered by 95% of reads) can be achieved at the sample sizes of 10000 to 15000 reads. The relative error of the Shannon index estimate for this condition is lower than 4%.

235 citations


Journal ArticleDOI
TL;DR: The FAIR Data Principles as discussed by the authors are a set of data reuse principles that focus on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals.
Abstract: There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders-representing academia, industry, funding agencies, and scholarly publishers-have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.

220 citations


Journal ArticleDOI
TL;DR: The HydroATLAS database provides a standardized compendium of descriptive hydro-environmental information for all watersheds and rivers of the world at high spatial resolution and is fully compatible with other products of the overarching HydroSHEDS project enabling versatile hydro-ecological assessments for a broad user community.
Abstract: The HydroATLAS database provides a standardized compendium of descriptive hydro-environmental information for all watersheds and rivers of the world at high spatial resolution. Version 1.0 of HydroATLAS offers data for 56 variables, partitioned into 281 individual attributes and organized in six categories: hydrology; physiography; climate; land cover & use; soils & geology; and anthropogenic influences. HydroATLAS derives the hydro-environmental characteristics by aggregating and reformatting original data from well-established global digital maps, and by accumulating them along the drainage network from headwaters to ocean outlets. The attributes are linked to hierarchically nested sub-basins at multiple scales, as well as to individual river reaches, both extracted from the global HydroSHEDS database at 15 arc-second (~500 m) resolution. The sub-basin and river reach information is offered in two companion datasets: BasinATLAS and RiverATLAS. The standardized format of HydroATLAS ensures easy applicability while the inherent topological information supports basic network functionality such as identifying up- and downstream connections. HydroATLAS is fully compatible with other products of the overarching HydroSHEDS project enabling versatile hydro-ecological assessments for a broad user community.

210 citations


Journal ArticleDOI
TL;DR: Several AMPs in clinical trials are described, including their properties, indications and clinicaltrials.gov identifiers, to provide the applications of DRAMP in the development of AMPs.
Abstract: Data Repository of Antimicrobial Peptides (DRAMP, http://dramp.cpu-bioinfor.org/ ) is an open-access comprehensive database containing general, patent and clinical antimicrobial peptides (AMPs). Currently DRAMP has been updated to version 2.0, it contains a total of 19,899 entries (newly added 2,550 entries), including 5,084 general entries, 14,739 patent entries, and 76 clinical entries. The update covers new entries, structures, annotations, classifications and downloads. Compared with APD and CAMP, DRAMP contains 14,040 (70.56% in DRAMP) non-overlapping sequences. In order to facilitate users to trace original references, PubMed_ID of references have been contained in activity information. The data of DRAMP can be downloaded by dataset and activity, and the website source code is also available on dedicatedly designed download webpage. Although thousands of AMPs have been reported, only a few parts have entered clinical stage. In the paper, we described several AMPs in clinical trials, including their properties, indications and clinicaltrials.gov identifiers. Finally, we provide the applications of DRAMP in the development of AMPs.

193 citations


Journal ArticleDOI
TL;DR: A climate data record of global sea surface temperature (SST) spanning 1981–2016 has been developed from 4 × 1012 satellite measurements of thermal infra-red radiance, and target applications include: climate and ocean model evaluation; quantification of marine change and variability;Climate and ocean-atmosphere processes; and specific applications in ocean ecology, oceanography and geophysics.
Abstract: A climate data record of global sea surface temperature (SST) spanning 1981-2016 has been developed from 4 × 1012 satellite measurements of thermal infra-red radiance. The spatial area represented by pixel SST estimates is between 1 km2 and 45 km2. The mean density of good-quality observations is 13 km-2 yr-1. SST uncertainty is evaluated per datum, the median uncertainty for pixel SSTs being 0.18 K. Multi-annual observational stability relative to drifting buoy measurements is within 0.003 K yr-1 of zero with high confidence, despite maximal independence from in situ SSTs over the latter two decades of the record. Data are provided at native resolution, gridded at 0.05° latitude-longitude resolution (individual sensors), and aggregated and gap-filled on a daily 0.05° grid. Skin SSTs, depth-adjusted SSTs de-aliased with respect to the diurnal cycle, and SST anomalies are provided. Target applications of the dataset include: climate and ocean model evaluation; quantification of marine change and variability (including marine heatwaves); climate and ocean-atmosphere processes; and specific applications in ocean ecology, oceanography and geophysics.

192 citations


Journal ArticleDOI
TL;DR: In Table 3 of this Data Descriptor the units of Mean_N2O and Mean_CH4 are incorrectly stated as “Nanomolar (μM)” and this should instead read “nM”.
Abstract: In Table 3 of this Data Descriptor the units of Mean_N2O and Mean_CH4 are incorrectly stated as "Nanomolar (μM)". This should instead read "Nanomolar (nM)".

Journal ArticleDOI
TL;DR: This globally calibrated and cross-validated dataset provides a single point of storage for all altimeter missions in a consistent format andQuantile-quantile comparisons between altimeter and buoy data as well as between altimeters are undertaken to test consistency of probability distributions and extreme value performance.
Abstract: This dataset consists of 33 years (1985 to 2018), of global significant wave height and wind speed obtained from 13 altimeters, namely: GEOSAT, ERS-1, TOPEX, ERS-2, GFO, JASON-1, ENVISAT, JASON-2, CRYOSAT-2, HY-2A, SARAL, JASON-3 and SENTINEL-3A. The altimeter data have been calibrated and validated against National Oceanographic Data Center (NODC) buoy data. Differences between altimeter and buoy data as a function of time are investigated for long-term stability. A cross validation between altimeters is also carried out in order to check the stability and consistency of the calibrations developed. Quantile-quantile comparisons between altimeter and buoy data as well as between altimeters are undertaken to test consistency of probability distributions and extreme value performance. The data were binned into 1° by 1° bins globally, to provide convenient access for users to download only the regions of interest. All data are quality controlled. This globally calibrated and cross-validated dataset provides a single point of storage for all altimeter missions in a consistent format.

Journal ArticleDOI
TL;DR: An ultra-high resolution MRI dataset of an ex vivo human brain specimen donated by a 58-year-old woman who had no history of neurological disease and died of non-neurological causes is presented.
Abstract: We present an ultra-high resolution MRI dataset of an ex vivo human brain specimen. The brain specimen was donated by a 58-year-old woman who had no history of neurological disease and died of non-neurological causes. After fixation in 10% formalin, the specimen was imaged on a 7 Tesla MRI scanner at 100 µm isotropic resolution using a custom-built 31-channel receive array coil. Single-echo multi-flip Fast Low-Angle SHot (FLASH) data were acquired over 100 hours of scan time (25 hours per flip angle), allowing derivation of synthesized FLASH volumes. This dataset provides an unprecedented view of the three-dimensional neuroanatomy of the human brain. To optimize the utility of this resource, we warped the dataset into standard stereotactic space. We now distribute the dataset in both native space and stereotactic space to the academic community via multiple platforms. We envision that this dataset will have a broad range of investigational, educational, and clinical applications that will advance understanding of human brain anatomy in health and disease. Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.9958688

Journal ArticleDOI
Anahit Babayan1, Anahit Babayan2, Miray Erbey2, Miray Erbey1, Deniz Kumral2, Deniz Kumral1, Janis Reinelt1, Andrea M. F. Reiter, Josefin Röbbig1, H. Lina Schaare1, Marie Uhlig1, Alfred Anwander1, Pierre-Louis Bazin1, Pierre-Louis Bazin3, Annette Horstmann4, Annette Horstmann1, Leonie Lampe1, Vadim V. Nikulin1, Hadas Okon-Singer1, Hadas Okon-Singer5, Sven Preusser1, André Pampel1, Christiane Rohr1, Julia Sacher1, Angelika Thöne-Otto1, Angelika Thöne-Otto4, Sabrina Trapp1, Till Nierhaus1, Denise Altmann1, Katrin Arélin1, Maria Blöchl1, Maria Blöchl4, Edith Bongartz1, Patric Breig1, Elena Cesnaite1, Sufang Chen1, Roberto Cozatl1, Saskia Czerwonatis1, Gabriele Dambrauskaite1, Maria Dreyer1, Jessica Enders1, Melina Engelhardt1, Marie Michele Fischer1, Norman Forschack1, Johannes Golchert1, Laura Golz1, C Alexandrina Guran1, Susanna Hedrich1, Nicole Hentschel1, Daria I Hoffmann1, Julia M. Huntenburg1, Rebecca Jost1, Anna Kosatschek1, Stella Kunzendorf1, Hannah Lammers1, Mark E. Lauckner1, Keyvan Mahjoory1, Ahmad S. Kanaan1, Natacha Mendes1, Ramona Menger1, Enzo Morino1, Karina Näthe1, Jennifer Neubauer1, Handan Noyan1, Sabine Oligschläger1, Patricia Panczyszyn-Trzewik1, Dorothee Poehlchen1, Nadine Putzke1, Sabrina Roski1, Marie-Catherine Schaller1, Anja Schieferbein1, Benito Schlaak1, Robert Schmidt4, Krzysztof J. Gorgolewski6, Hanna Maria Schmidt1, Anne Schrimpf1, Sylvia Stasch1, Maria Voss1, Annett Wiedemann1, Daniel S. Margulies1, Michael Gaebler2, Michael Gaebler1, Michael Gaebler4, Arno Villringer2, Arno Villringer1 
TL;DR: A publicly available dataset of 227 healthy participants comprising a young and elderly group acquired cross-sectionally in Leipzig, Germany, between 2013 and 2015 to study mind-body-emotion interactions is presented.
Abstract: We present a publicly available dataset of 227 healthy participants comprising a young (N=153, 25.1±3.1 years, range 20-35 years, 45 female) and an elderly group (N=74, 67.6±4.7 years, range 59-77 years, 37 female) acquired cross-sectionally in Leipzig, Germany, between 2013 and 2015 to study mind-body-emotion interactions. During a two-day assessment, participants completed MRI at 3 Tesla (resting-state fMRI, quantitative T1 (MP2RAGE), T2-weighted, FLAIR, SWI/QSM, DWI) and a 62-channel EEG experiment at rest. During task-free resting-state fMRI, cardiovascular measures (blood pressure, heart rate, pulse, respiration) were continuously acquired. Anthropometrics, blood samples, and urine drug tests were obtained. Psychiatric symptoms were identified with Standardized Clinical Interview for DSM IV (SCID-I), Hamilton Depression Scale, and Borderline Symptoms List. Psychological assessment comprised 6 cognitive tests as well as 21 questionnaires related to emotional behavior, personality traits and tendencies, eating behavior, and addictive behavior. We provide information on study design, methods, and details of the data. This dataset is part of the larger MPI Leipzig Mind-Brain-Body database.

Journal ArticleDOI
TL;DR: An extension to BIDS for electroencephalography (EEG) data, EEG-BIDS, is presented, along with tools and references to a series of public EEG datasets organized using this new standard.
Abstract: The Brain Imaging Data Structure (BIDS) project is a rapidly evolving effort in the human brain imaging research community to create standards allowing researchers to readily organize and share study data within and between laboratories. Here we present an extension to BIDS for electroencephalography (EEG) data, EEG-BIDS, along with tools and references to a series of public EEG datasets organized using this new standard.

Journal ArticleDOI
TL;DR: An outline of the hosted datasets and features available on the CHRS Data Portal, an examination of the necessity of easily accessible public data, a comprehensive overview of the PERSIANN algorithms and datasets, and a walk-through of the procedure to access and obtain the data are presented.
Abstract: The Center for Hydrometeorology and Remote Sensing (CHRS) has created the CHRS Data Portal to facilitate easy access to the three open data licensed satellite-based precipitation datasets generated by our Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks (PERSIANN) system: PERSIANN, PERSIANN-Cloud Classification System (CCS), and PERSIANN-Climate Data Record (CDR). These datasets have the potential for widespread use by various researchers, professionals including engineers, city planners, and so forth, as well as the community at large. Researchers at CHRS created the CHRS Data Portal with an emphasis on simplicity and the intention of fostering synergistic relationships with scientists and experts from around the world. The following paper presents an outline of the hosted datasets and features available on the CHRS Data Portal, an examination of the necessity of easily accessible public data, a comprehensive overview of the PERSIANN algorithms and datasets, and a walk-through of the procedure to access and obtain the data.

Journal ArticleDOI
TL;DR: A new open repository for chemical reactions on catalytic surfaces, available at https://www.catalysis-hub.org, that seeks to accelerate the discovery of catalytic materials for sustainable energy applications by enabling researchers to efficiently use the data as a basis for new calculations and model generation.
Abstract: We present a new open repository for chemical reactions on catalytic surfaces, available at https://www.catalysis-hub.org . The featured database for surface reactions contains more than 100,000 chemisorption and reaction energies obtained from electronic structure calculations, and is continuously being updated with new datasets. In addition to providing quantum-mechanical results for a broad range of reactions and surfaces from different publications, the database features a systematic, large-scale study of chemical adsorption and hydrogenation on bimetallic alloy surfaces. The database contains reaction specific information, such as the surface composition and reaction energy for each reaction, as well as the surface geometries and calculational parameters, essential for data reproducibility. By providing direct access via the web-interface as well as a Python API, we seek to accelerate the discovery of catalytic materials for sustainable energy applications by enabling researchers to efficiently use the data as a basis for new calculations and model generation.

Journal ArticleDOI
TL;DR: In this paper, a large dataset of 2D materials, with more than 6,000 monolayer structures, obtained from both top-down and bottom-up discovery procedures, is presented.
Abstract: Two-dimensional (2D) materials have been a hot research topic in the last decade, due to novel fundamental physics in the reduced dimension and appealing applications. Systematic discovery of functional 2D materials has been the focus of many studies. Here, we present a large dataset of 2D materials, with more than 6,000 monolayer structures, obtained from both top-down and bottom-up discovery procedures. First, we screened all bulk materials in the database of Materials Project for layered structures by a topology-based algorithm and theoretically exfoliated them into monolayers. Then, we generated new 2D materials by chemical substitution of elements in known 2D materials by others from the same group in the periodic table. The structural, electronic and energetic properties of these 2D materials are consistently calculated, to provide a starting point for further material screening, data mining, data analysis and artificial intelligence applications. We present the details of computational methodology, data record and technical validation of our publicly available data ( http://www.2dmatpedia.org/ ).

Journal ArticleDOI
TL;DR: PEST-CHEMGRIDS is introduced, a comprehensive database of the 20 most used pesticide active ingredients on 6 dominant crops and 4 aggregated crop classes at 5 arc-min resolution projected from 2015 to 2025, used in global environmental modelling, assessment of agrichemical contamination, and risk analysis.
Abstract: Available georeferenced environmental layers are facilitating new insights into global environmental assets and their vulnerability to anthropogenic inputs. Geographically gridded data of agricultural pesticides are crucial to assess human and ecosystem exposure to potential and recognised toxicants. However, pesticides inventories are often sparse over time and by region, mostly report aggregated classes of active ingredients, and are generally fragmented across local or government authorities, thus hampering an integrated global analysis of pesticide risk. Here, we introduce PEST-CHEMGRIDS, a comprehensive database of the 20 most used pesticide active ingredients on 6 dominant crops and 4 aggregated crop classes at 5 arc-min resolution (about 10 km at the equator) projected from 2015 to 2025. To estimate the global application rates of specific active ingredients we use spatial statistical methods to re-analyse the USGS/PNSP and FAOSTAT pesticide databases along with other public inventories including global gridded data of soil physical properties, hydroclimatic variables, agricultural quantities, and socio-economic indices. PEST-CHEMGRIDS can be used in global environmental modelling, assessment of agrichemical contamination, and risk analysis. Machine-accessible metadata file describing the reported data (ISA-Tab format)

Journal ArticleDOI
TL;DR: A dataset of “codified recipes” for solid-state synthesis automatically extracted from scientific publications is generated by using text mining and natural language processing approaches for predicting inorganic materials synthesis.
Abstract: Materials discovery has become significantly facilitated and accelerated by high-throughput ab-initio computations. This ability to rapidly design interesting novel compounds has displaced the materials innovation bottleneck to the development of synthesis routes for the desired material. As there is no a fundamental theory for materials synthesis, one might attempt a data-driven approach for predicting inorganic materials synthesis, but this is impeded by the lack of a comprehensive database containing synthesis processes. To overcome this limitation, we have generated a dataset of "codified recipes" for solid-state synthesis automatically extracted from scientific publications. The dataset consists of 19,488 synthesis entries retrieved from 53,538 solid-state synthesis paragraphs by using text mining and natural language processing approaches. Every entry contains information about target material, starting compounds, operations used and their conditions, as well as the balanced chemical equation of the synthesis reaction. The dataset is publicly available and can be used for data mining of various aspects of inorganic materials synthesis.

Journal ArticleDOI
TL;DR: A multi-cohort genomics study of postmortem brains from controls, individuals with schizophrenia and bipolar disorder, and a public resource of functional genomic data from the dorsolateral prefrontal cortex is presented.
Abstract: Schizophrenia and bipolar disorder are serious mental illnesses that affect more than 2% of adults. While large-scale genetics studies have identified genomic regions associated with disease risk, less is known about the molecular mechanisms by which risk alleles with small effects lead to schizophrenia and bipolar disorder. In order to fill this gap between genetics and disease phenotype, we have undertaken a multi-cohort genomics study of postmortem brains from controls, individuals with schizophrenia and bipolar disorder. Here we present a public resource of functional genomic data from the dorsolateral prefrontal cortex (DLPFC; Brodmann areas 9 and 46) of 986 individuals from 4 separate brain banks, including 353 diagnosed with schizophrenia and 120 with bipolar disorder. The genomic data include RNA-seq and SNP genotypes on 980 individuals, and ATAC-seq on 269 individuals, of which 264 are a subset of individuals with RNA-seq. We have performed extensive preprocessing and quality control on these data so that the research community can take advantage of this public resource available on the Synapse platform at http://CommonMind.org .

Journal ArticleDOI
TL;DR: This work describes the multi-layer temporal network which connects a population of more than 700 university students over a period of four weeks, and expects that reuse of this dataset will allow researchers to make progress on the analysis and modeling of human social networks.
Abstract: We describe the multi-layer temporal network which connects a population of more than 700 university students over a period of four weeks. The dataset was collected via smartphones as part of the Copenhagen Networks Study. We include the network of physical proximity among the participants (estimated via Bluetooth signal strength), the network of phone calls (start time, duration, no content), the network of text messages (time of message, no content), and information about Facebook friendships. Thus, we provide multiple types of communication networks expressed in a single, large population with high temporal resolution, and over a period of multiple weeks, a fact which makes the dataset shared here unique. We expect that reuse of this dataset will allow researchers to make progress on the analysis and modeling of human social networks.

Journal ArticleDOI
TL;DR: The nature of team sports like soccer, halfway between the abstraction of a game and the reality of complex social systems, combined with the unique size and composition of this dataset, provide the ideal ground for tackling a wide range of data science problems.
Abstract: Soccer analytics is attracting increasing interest in academia and industry, thanks to the availability of sensing technologies that provide high-fidelity data streams for every match. Unfortunately, these detailed data are owned by specialized companies and hence are rarely publicly available for scientific research. To fill this gap, this paper describes the largest open collection of soccer-logs ever released, containing all the spatio-temporal events (passes, shots, fouls, etc.) that occured during each match for an entire season of seven prominent soccer competitions. Each match event contains information about its position, time, outcome, player and characteristics. The nature of team sports like soccer, halfway between the abstraction of a game and the reality of complex social systems, combined with the unique size and composition of this dataset, provide an ideal ground for tackling a wide range of data science problems, including the measurement and evaluation of performance, both at individual and at collective level, and the determinants of success and failure.

Journal ArticleDOI
TL;DR: This work has assembled national master health facility lists from a variety of government and non-government sources from 50 countries and islands in sub Saharan Africa and used multiple geocoding methods to provide a comprehensive spatial inventory of 98,745 public health facilities.
Abstract: Health facilities form a central component of health systems, providing curative and preventative services and structured to allow referral through a pyramid of increasingly complex service provision. Access to health care is a complex and multidimensional concept, however, in its most narrow sense, it refers to geographic availability. Linking health facilities to populations has been a traditional per capita index of heath care coverage, however, with locations of health facilities and higher resolution population data, Geographic Information Systems allow for a more refined metric of health access, define geographic inequalities in service provision and inform planning. Maximizing the value of spatial heath access requires a complete census of providers and their locations. To-date there has not been a single, geo-referenced and comprehensive public health facility database for sub-Saharan Africa. We have assembled national master health facility lists from a variety of government and non-government sources from 50 countries and islands in sub Saharan Africa and used multiple geocoding methods to provide a comprehensive spatial inventory of 98,745 public health facilities.

Journal ArticleDOI
TL;DR: This work presents and test a data mining work flow to create a global database of single fires that allows for the characterization of fire types and fire regimes worldwide.
Abstract: Global fire monitoring systems are crucial to study fire behaviour, fire regimes and their impact at the global scale. Although global fire products based on the use of Earth Observation satellites exist, most remote sensing products only partially cover the requirements for these analyses. These data do not provide information like fire size, fire spread speed, how fires may evolve and joint into single event, or the number of fire events for a given area. This high level of abstraction is very valuable; it makes it possible to characterize fires by types (either size, spread, behaviour, etc.). Here, we present and test a data mining work flow to create a global database of single fires that allows for the characterization of fire types and fire regimes worldwide. This work describes the data produced by a data mining process using MODIS burnt area product Collection 6 (MCD64A1). The entire product has been computed until the present and is available under the umbrella of the Global Wildfire Information System (GWIS). Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.10284101

Journal ArticleDOI
TL;DR: This database contains for the period 1990–2017 for 1625 regions within 161 countries the national and subnational values of the Subnational Human Development Index (SHDI), for the three dimension indices on the basis of which the SHDI is constructed – education, health and standard of living.
Abstract: In this paper we describe the Subnational Human Development Database. This database contains for the period 1990–2017 for 1625 regions within 161 countries the national and subnational values of the Subnational Human Development Index (SHDI), for the three dimension indices on the basis of which the SHDI is constructed – education, health and standard of living --, and for the four indicators needed to create the dimension indices -- expected years of schooling, mean years of schooling, life expectancy and gross national income per capita. The subnational values of the four indicators were computed using data from statistical offices and from the Area Database of the Global Data Lab, which contains indicators aggregated from household surveys and census datasets. Values for missing years were estimated by interpolation and extrapolation from real data. By normalizing the population-weighted averages of the indicators to their national levels in the UNDP-HDI database, values of the SHDI and its dimension indices were obtained that at national level equal their official versions of the UNDP. Machine-accessible metadata file describing the reported data (ISA-Tab format)

Journal ArticleDOI
TL;DR: A scalable, automatable framework to evaluate digital resources that encompasses measurable indicators, open source tools, and participation guidelines, which come together to accommodate domain relevant community-defined FAIR assessments is proposed.
Abstract: Transparent evaluations of FAIRness are increasingly required by a wide range of stakeholders, from scientists to publishers, funding agencies and policy makers. We propose a scalable, automatable framework to evaluate digital resources that encompasses measurable indicators, open source tools, and participation guidelines, which come together to accommodate domain relevant community-defined FAIR assessments. The components of the framework are: (1) Maturity Indicators - community-authored specifications that delimit a specific automatically-measurable FAIR behavior; (2) Compliance Tests - small Web apps that test digital resources against individual Maturity Indicators; and (3) the Evaluator, a Web application that registers, assembles, and applies community-relevant sets of Compliance Tests against a digital resource, and provides a detailed report about what a machine "sees" when it visits that resource. We discuss the technical and social considerations of FAIR assessments, and how this translates to our community-driven infrastructure. We then illustrate how the output of the Evaluator tool can serve as a roadmap to assist data stewards to incrementally and realistically improve the FAIRness of their resources.

Journal ArticleDOI
TL;DR: This dataset can be used to validate satellite data products, to evaluate predictions of land surface models, to interpret the seasonality of ecosystem-scale CO2 and H2O flux data, and to study climate change impacts on the terrestrial biosphere.
Abstract: Monitoring vegetation phenology is critical for quantifying climate change impacts on ecosystems. We present an extensive dataset of 1783 site-years of phenological data derived from PhenoCam network imagery from 393 digital cameras, situated from tropics to tundra across a wide range of plant functional types, biomes, and climates. Most cameras are located in North America. Every half hour, cameras upload images to the PhenoCam server. Images are displayed in near-real time and provisional data products, including timeseries of the Green Chromatic Coordinate (Gcc), are made publicly available through the project web page ( https://phenocam.sr.unh.edu/webcam/gallery/ ). Processing is conducted separately for each plant functional type in the camera field of view. The PhenoCam Dataset v2.0, described here, has been fully processed and curated, including outlier detection and expert inspection, to ensure high quality data. This dataset can be used to validate satellite data products, to evaluate predictions of land surface models, to interpret the seasonality of ecosystem-scale CO2 and H2O flux data, and to study climate change impacts on the terrestrial biosphere. Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.9913694

Journal ArticleDOI
TL;DR: A survey tool to assess the availability of digital research artifacts published alongside peer-reviewed journal articles and reproducibility of article results and key bottlenecks to making work more reproducible was developed.
Abstract: There is broad interest to improve the reproducibility of published research. We developed a survey tool to assess the availability of digital research artifacts published alongside peer-reviewed journal articles (e.g. data, models, code, directions for use) and reproducibility of article results. We used the tool to assess 360 of the 1,989 articles published by six hydrology and water resources journals in 2017. Like studies from other fields, we reproduced results for only a small fraction of articles (1.6% of tested articles) using their available artifacts. We estimated, with 95% confidence, that results might be reproduced for only 0.6% to 6.8% of all 1,989 articles. Unlike prior studies, the survey tool identified key bottlenecks to making work more reproducible. Bottlenecks include: only some digital artifacts available (44% of articles), no directions (89%), or all artifacts available but results not reproducible (5%). The tool (or extensions) can help authors, journals, funders, and institutions to self-assess manuscripts, provide feedback to improve reproducibility, and recognize and reward reproducible articles as examples for others.

Journal ArticleDOI
TL;DR: The dataset presented in this descriptor contains EEG recordings from human neonates, the visual interpretation of the EEG by the human experts, supporting clinical data and codes to assist access, and the development of automated methods of seizure detection and other EEG analyses.
Abstract: Neonatal seizures are a common emergency in the neonatal intensive care unit (NICU). There are many questions yet to be answered regarding the temporal/spatial characteristics of seizures from different pathologies, response to medication, effects on neurodevelopment and optimal detection. The dataset presented in this descriptor contains EEG recordings from human neonates, the visual interpretation of the EEG by the human experts, supporting clinical data and codes to assist access. Multi-channel EEG was recorded from 79 term neonates admitted to the NICU at the Helsinki University Hospital. The median recording duration was 74 min (IQR: 64 to 96 min). The presence of seizures in the EEGs was annotated independently by three experts. An average of 460 seizures were annotated per expert in the dataset; 39 neonates had seizures and 22 were seizure free, by consensus. The dataset can be used as a reference set of neonatal seizures, in studies of inter-observer agreement and for the development of automated methods of seizure detection and other EEG analyses. Machine-accessible metadata file describing the reported data (ISA-Tab format)