scispace - formally typeset
Search or ask a question

Showing papers in "Scientific Data in 2016"


Journal ArticleDOI
TL;DR: The FAIR Data Principles as mentioned in this paper are a set of data reuse principles that focus on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals.
Abstract: There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.

7,602 citations


Journal ArticleDOI
TL;DR: The Medical Information Mart for Intensive Care (MIMIC-III) as discussed by the authors is a large, single-center database comprising information relating to patients admitted to critical care units at a large tertiary care hospital.
Abstract: MIMIC-III ('Medical Information Mart for Intensive Care') is a large, single-center database comprising information relating to patients admitted to critical care units at a large tertiary care hospital. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more. The database supports applications including academic and industrial research, quality improvement initiatives, and higher education coursework.

4,056 citations


Journal ArticleDOI
TL;DR: The Brain Imaging Data Structure (BIDS) is developed, a standard for organizing and describing MRI datasets that uses file formats compatible with existing software, unifies the majority of practices already common in the field, and captures the metadata necessary for most common data processing operations.
Abstract: The development of magnetic resonance imaging (MRI) techniques has defined modern neuroimaging. Since its inception, tens of thousands of studies using techniques such as functional MRI and diffusion weighted imaging have allowed for the non-invasive study of the brain. Despite the fact that MRI is routinely used to obtain data for neuroscience research, there has been no widely adopted standard for organizing and describing the data collected in an imaging experiment. This renders sharing and reusing data (within or between labs) difficult if not impossible and unnecessarily complicates the application of automatic pipelines and quality assurance protocols. To solve this problem, we have developed the Brain Imaging Data Structure (BIDS), a standard for organizing and describing MRI datasets. The BIDS standard uses file formats compatible with existing software, unifies the majority of practices already common in the field, and captures the metadata necessary for most common data processing operations.

1,037 citations


Journal ArticleDOI
TL;DR: This work created a dataset of solar PV arrays to initiate and develop the process of automatically identifying solar PV locations using remote sensing imagery, and contains the geospatial coordinates and border vertices for over 19,000 solar panels across 601 high-resolution images from four cities in California.
Abstract: Earth-observing remote sensing data, including aerial photography and satellite imagery, offer a snapshot of the world from which we can learn about the state of natural resources and the built environment. The components of energy systems that are visible from above can be automatically assessed with these remote sensing data when processed with machine learning methods. Here, we focus on the information gap in distributed solar photovoltaic (PV) arrays, of which there is limited public data on solar PV deployments at small geographic scales. We created a dataset of solar PV arrays to initiate and develop the process of automatically identifying solar PV locations using remote sensing imagery. This dataset contains the geospatial coordinates and border vertices for over 19,000 solar panels across 601 high-resolution images from four cities in California. Dataset applications include training object detection and other machine learning algorithms that use remote sensing imagery, developing specific algorithms for predictive detection of distributed PV systems, estimating installed PV capacity, and analysis of the socioeconomic correlates of PV deployment. Machine-accessible metadata file describing the reported data (ISA-Tab format)

633 citations


Journal ArticleDOI
TL;DR: A large, diverse set of sequencing data for seven human genomes is described; five are current or candidate NIST Reference Materials and two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry are described.
Abstract: The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCode WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.

581 citations


Journal ArticleDOI
TL;DR: This work presents the largest database of calculated surface energies for elemental crystals to date, which contains the surface energies of more than 100 polymorphs of about 70 elements, up to a maximum Miller index of two and three for non-cubic and cubic crystals, respectively.
Abstract: The surface energy is a fundamental property of the different facets of a crystal that is crucial to the understanding of various phenomena like surface segregation, roughening, catalytic activity, and the crystal's equilibrium shape. Such surface phenomena are especially important at the nanoscale, where the large surface area to volume ratios lead to properties that are significantly different from the bulk. In this work, we present the largest database of calculated surface energies for elemental crystals to date. This database contains the surface energies of more than 100 polymorphs of about 70 elements, up to a maximum Miller index of two and three for non-cubic and cubic crystals, respectively. Well-known reconstruction schemes are also accounted for. The database is systematically improvable and has been rigorously validated against previous experimental and computational data where available. We will describe the methodology used in constructing the database, and how it can be accessed for further studies and design of materials.

537 citations


Journal ArticleDOI
TL;DR: The updated maps should provide an increased understanding of the human pressures that drive macro-ecological patterns, as well as for tracking environmental change and informing conservation science and application.
Abstract: Remotely-sensed and bottom-up survey information were compiled on eight variables measuring the direct and indirect human pressures on the environment globally in 1993 and 2009. This represents not only the most current information of its type, but also the first temporally-consistent set of Human Footprint maps. Data on human pressures were acquired or developed for: 1) built environments, 2) population density, 3) electric infrastructure, 4) crop lands, 5) pasture lands, 6) roads, 7) railways, and 8) navigable waterways. Pressures were then overlaid to create the standardized Human Footprint maps for all non-Antarctic land areas. A validation analysis using scored pressures from 3114×1 km2 random sample plots revealed strong agreement with the Human Footprint maps. We anticipate that the Human Footprint maps will find a range of uses as proxies for human disturbance of natural systems. The updated maps should provide an increased understanding of the human pressures that drive macro-ecological patterns, as well as for tracking environmental change and informing conservation science and application. Machine-accessible metadata file describing the reported data (ISA-Tab format)

431 citations


Journal ArticleDOI
TL;DR: The study interrogated aspects of this movement disorder through surveys and frequent sensor-based recordings from participants with and without Parkinson disease, and hopes that releasing data contributed by engaged research participants will seed a new community of analysts working collaboratively on understanding mobile health data to advance human health.
Abstract: Current measures of health and disease are often insensitive, episodic, and subjective. Further, these measures generally are not designed to provide meaningful feedback to individuals. The impact of high-resolution activity data collected from mobile phones is only beginning to be explored. Here we present data from mPower, a clinical observational study about Parkinson disease conducted purely through an iPhone app interface. The study interrogated aspects of this movement disorder through surveys and frequent sensor-based recordings from participants with and without Parkinson disease. Benefitting from large enrollment and repeated measurements on many individuals, these data may help establish baseline variability of real-world activity measurement collected via mobile phones, and ultimately may lead to quantification of the ebbs-and-flows of Parkinson symptoms. App source code for these data collection modules are available through an open source license for use in studies of other conditions. We hope that releasing data contributed by engaged research participants will seed a new community of analysts working collaboratively on understanding mobile health data to advance human health.

429 citations


Journal ArticleDOI
TL;DR: This work has analyzed gene expression levels in the brain tissue of subjects with AD and related diseases and expects that these datasets will enable investigators to explore and identify transcriptional mechanisms contributing to neurodegenerative diseases.
Abstract: Previous genome-wide association studies (GWAS), conducted by our group and others, have identified loci that harbor risk variants for neurodegenerative diseases, including Alzheimer's disease (AD). Human disease variants are enriched for polymorphisms that affect gene expression, including some that are known to associate with expression changes in the brain. Postulating that many variants confer risk to neurodegenerative disease via transcriptional regulatory mechanisms, we have analyzed gene expression levels in the brain tissue of subjects with AD and related diseases. Herein, we describe our collective datasets comprised of GWAS data from 2,099 subjects; microarray gene expression data from 773 brain samples, 186 of which also have RNAseq; and an independent cohort of 556 brain samples with RNAseq. We expect that these datasets, which are available to all qualified researchers, will enable investigators to explore and identify transcriptional mechanisms contributing to neurodegenerative diseases.

311 citations


Journal ArticleDOI
TL;DR: This data descriptor outlines a shared neuroimaging dataset from the UCLA Consortium for Neuropsychiatric Phenomics, which focused on understanding the dimensional structure of memory and cognitive control (response inhibition) functions in both healthy individuals and individuals with neuropsychiatric disorders.
Abstract: This data descriptor outlines a shared neuroimaging dataset from the UCLA Consortium for Neuropsychiatric Phenomics, which focused on understanding the dimensional structure of memory and cognitive control (response inhibition) functions in both healthy individuals (130 subjects) and individuals with neuropsychiatric disorders including schizophrenia (50 subjects), bipolar disorder (49 subjects), and attention deficit/hyperactivity disorder (43 subjects). The dataset includes an extensive set of task-based fMRI assessments, resting fMRI, structural MRI, and high angular resolution diffusion MRI. The dataset is shared through the OpenfMRI project, and is formatted according to the Brain Imaging Data Structure (BIDS) standard.

254 citations


Journal ArticleDOI
TL;DR: The Almanac of Minutely Power dataset Version 2 (AMPds2) has been released to help computational sustainability researchers, power and energy engineers, building scientists and technologists, utility companies, and eco-feedback researchers test their models, systems, algorithms, or prototypes on real house data.
Abstract: With the cost of consuming resources increasing (both economically and ecologically), homeowners need to find ways to curb consumption. The Almanac of Minutely Power dataset Version 2 (AMPds2) has been released to help computational sustainability researchers, power and energy engineers, building scientists and technologists, utility companies, and eco-feedback researchers test their models, systems, algorithms, or prototypes on real house data. In the vast majority of cases, real-world datasets lead to more accurate models and algorithms. AMPds2 is the first dataset to capture all three main types of consumption (electricity, water, and natural gas) over a long period of time (2 years) and provide 11 measurement characteristics for electricity. No other such datasets from Canada exist. Each meter has 730 days of captured data. We also include environmental and utility billing data for cost analysis. AMPds2 data has been pre-cleaned to provide for consistent and comparable accuracy results amongst different researchers and machine learning algorithms.

Journal ArticleDOI
TL;DR: The overall goal is for the Coral Trait Database to become an open-source, community-led data clearinghouse that accelerates coral reef research.
Abstract: Trait-based approaches advance ecological and evolutionary research because traits provide a strong link to an organism’s function and fitness. Trait-based research might lead to a deeper understanding of the functions of, and services provided by, ecosystems, thereby improving management, which is vital in the current era of rapid environmental change. Coral reef scientists have long collected trait data for corals; however, these are difficult to access and often under-utilized in addressing large-scale questions. We present the Coral Trait Database initiative that aims to bring together physiological, morphological, ecological, phylogenetic and biogeographic trait information into a single repository. The database houses species- and individual-level data from published field and experimental studies alongside contextual data that provide important framing for analyses. In this data descriptor, we release data for 56 traits for 1547 species, and present a collaborative platform on which other trait data are being actively federated. Our overall goal is for the Coral Trait Database to become an open-source, community-led data clearinghouse that accelerates coral reef research.

Journal ArticleDOI
TL;DR: A dataset of 1,073 polymers and related materials is developed, initially designed to include the optimized structures, atomization energies, band gaps, and dielectric constants, and is progressively expanded by accumulating new materials and including additional properties calculated for the optimize structures provided.
Abstract: Emerging computation- and data-driven approaches are particularly useful for rationally designing materials with targeted properties. Generally, these approaches rely on identifying structure-property relationships by learning from a dataset of sufficiently large number of relevant materials. The learned information can then be used to predict the properties of materials not already in the dataset, thus accelerating the materials design. Herein, we develop a dataset of 1,073 polymers and related materials and make it available at http://khazana.uconn.edu/. This dataset is uniformly prepared using first-principles calculations with structures obtained either from other sources or by using structure search methods. Because the immediate target of this work is to assist the design of high dielectric constant polymers, it is initially designed to include the optimized structures, atomization energies, band gaps, and dielectric constants. It will be progressively expanded by accumulating new materials and including additional properties calculated for the optimized structures provided.

Journal ArticleDOI
TL;DR: The archived dataset detailed here includes the monthly subaerial profiles, available bathymetry for each survey transect extending seawards to 20’m water depth, and time-series of ocean astronomical tide and inshore wave forcing at 10 m water depths, the latter corresponding to the location of individual survey Transects.
Abstract: Long-term observational datasets that record and quantify variability, changes and trends in beach morphology at sandy coastlines together with the accompanying wave climate are rare. A monthly beach profile survey program commenced in April 1976 at Narrabeen located on Sydney’s Northern Beaches in southeast Australia is one of just a handful of sites worldwide where on-going and uninterrupted beach monitoring now spans multiple decades. With the Narrabeen survey program reaching its 40-year milestone in April 2016, it is timely that free and unrestricted use of these data be facilitated to support the next advances in beach erosion-recovery modelling. The archived dataset detailed here includes the monthly subaerial profiles, available bathymetry for each survey transect extending seawards to 20 m water depth, and time-series of ocean astronomical tide and inshore wave forcing at 10 m water depths, the latter corresponding to the location of individual survey transects. In addition, on-going access to the results of the continuing monthly survey program is described. Machine-accessible metadata file describing the reported data (ISA-Tab format)

Journal ArticleDOI
TL;DR: This work provides a curated and standardized version of FAERS removing duplicate case records, applying standardized vocabularies with drug names mapped to RxNorm concepts and outcomes mapped to SNOMED-CT concepts, and pre-computed summary statistics about drug-outcome relationships for general consumption.
Abstract: Identification of adverse drug reactions (ADRs) during the post-marketing phase is one of the most important goals of drug safety surveillance Spontaneous reporting systems (SRS) data, which are the mainstay of traditional drug safety surveillance, are used for hypothesis generation and to validate the newer approaches The publicly available US Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS) data requires substantial curation before they can be used appropriately, and applying different strategies for data cleaning and normalization can have material impact on analysis results We provide a curated and standardized version of FAERS removing duplicate case records, applying standardized vocabularies with drug names mapped to RxNorm concepts and outcomes mapped to SNOMED-CT concepts, and pre-computed summary statistics about drug-outcome relationships for general consumption This publicly available resource, along with the source code, will accelerate drug safety research by reducing the amount of time spent performing data management on the source FAERS reports, improving the quality of the underlying data, and enabling standardized analyses using common vocabularies

Journal ArticleDOI
TL;DR: High resolution (250 m) irrigated area maps showed satisfactory accuracy (R2=0.95) and can be used to understand interannual variability in irrigation area at various spatial scales.
Abstract: India is among the countries that uses a significant fraction of available water for irrigation. Irrigated area in India has increased substantially after the Green revolution and both surface and groundwater have been extensively used. Under warming climate projections, irrigation frequency may increase leading to increased irrigation water demands. Water resources planning and management in agriculture need spatially-explicit irrigated area information for different crops and different crop growing seasons. However, annual, high-resolution irrigated area maps for India for an extended historical record that can be used for water resources planning and management are unavailable. Using 250 m normalized difference vegetation index (NDVI) data from Moderate Resolution Imaging Spectroradiometer (MODIS) and 56 m land use/land cover data, high-resolution irrigated area maps are developed for all the agroecological zones in India for the period of 2000–2015. The irrigated area maps were evaluated using the agricultural statistics data from ground surveys and were compared with the previously developed irrigation maps. High resolution (250 m) irrigated area maps showed satisfactory accuracy (R2=0.95) and can be used to understand interannual variability in irrigated area at various spatial scales. Machine-accessible metadata file describing the reported data (ISA-Tab format)

Journal ArticleDOI
TL;DR: The statistical model and considerations for temporally comparable maps are described, along with the resulting datasets, which are unique in terms of granularity and extent, providing fine-scale patterns of population distribution for mainland China.
Abstract: According to UN forecasts, global population will increase to over 8 billion by 2025, with much of this anticipated population growth expected in urban areas. In China, the scale of urbanization has, and continues to be, unprecedented in terms of magnitude and rate of change. Since the late 1970s, the percentage of Chinese living in urban areas increased from ~18% to over 50%. To quantify these patterns spatially we use time-invariant or temporally-explicit data, including census data for 1990, 2000, and 2010 in an ensemble prediction model. Resulting multi-temporal, gridded population datasets are unique in terms of granularity and extent, providing fine-scale (~100 m) patterns of population distribution for mainland China. For consistency purposes, the Tibet Autonomous Region, Taiwan, and the islands in the South China Sea were excluded. The statistical model and considerations for temporally comparable maps are described, along with the resulting datasets. Final, mainland China population maps for 1990, 2000, and 2010 are freely available as products from the WorldPop Project website and the WorldPop Dataverse Repository.

Journal ArticleDOI
TL;DR: Wild populations of the house mouse (Mus musculus) represent the raw genetic material for the classical inbred strains in biomedical research and are a major model system for evolutionary biology and whole genome sequencing data are provided.
Abstract: Wild populations of the house mouse (Mus musculus) represent the raw genetic material for the classical inbred strains in biomedical research and are a major model system for evolutionary biology. We provide whole genome sequencing data of individuals representing natural populations of M. m. domesticus (24 individuals from 3 populations), M. m. helgolandicus (3 individuals), M. m. musculus (22 individuals from 3 populations) and M. spretus (8 individuals from one population). We use a single pipeline to map and call variants for these individuals and also include 10 additional individuals of M. m. castaneus for which genomic data are publically available. In addition, RNAseq data were obtained from 10 tissues of up to eight adult individuals from each of the three M. m. domesticus populations for which genomic data were collected. Data and analyses are presented via tracks viewable in the UCSC or IGV genome browsers. We also provide information on available outbred stocks and instructions on how to keep them in the laboratory.

Journal ArticleDOI
TL;DR: This paper presents a comprehensive and freely available data set of lakes’ status over the TP region dating back to the 1960s, including three time series derived from ground survey, high-spatial-resolution satellite images from the China-Brazil Earth Resources Satellite (CBERS) and China’s newly launched GaoFen-1 (GF-1, which means high-resolution images in Chinese) satellite (2014).
Abstract: Long-term datasets of number and size of lakes over the Tibetan Plateau (TP) are among the most critical components for better understanding the interactions among the cryosphere, hydrosphere, and atmosphere at regional and global scales. Due to the harsh environment and the scarcity of data over the TP, data accumulation and sharing become more valuable for scientists worldwide to make new discoveries in this region. This paper, for the first time, presents a comprehensive and freely available data set of lakes’ status (name, location, shape, area, perimeter, etc.) over the TP region dating back to the 1960s, including three time series, i.e., the 1960s, 2005, and 2014, derived from ground survey (the 1960s) or high-spatial-resolution satellite images from the China-Brazil Earth Resources Satellite (CBERS) (2005) and China’s newly launched GaoFen-1 (GF-1, which means high-resolution images in Chinese) satellite (2014). The data set could provide scientists with useful information for revealing environmental changes and mechanisms over the TP region. Machine-accessible metadata file describing the reported data (ISA-Tab format)

Journal ArticleDOI
TL;DR: The spatial footprint and temporal clustering of extreme sea level and skew surge events around the UK coast over the last 100 years are analyzed to distinguish four broad categories of spatial footprints of events and the distinct storm tracks that generated them.
Abstract: In this paper we analyse the spatial footprint and temporal clustering of extreme sea level and skew surge events around the UK coast over the last 100 years (1915-2014). The vast majority of the extreme sea level events are generated by moderate, rather than extreme skew surges, combined with spring astronomical high tides. We distinguish four broad categories of spatial footprints of events and the distinct storm tracks that generated them. There have been rare events when extreme levels have occurred along two unconnected coastal regions during the same storm. The events that occur in closest succession (< 4 days) typically impact different stretches of coastline. The spring/neap tidal cycle prevents successive extreme sea level events from happening within 4-8 days. Finally, the 2013/14 season was highly unusual in the context of the last 100 years from an extreme sea level perspective.

Journal ArticleDOI
TL;DR: The STRESSFLEA consortium generated a comprehensive RNA-Seq data set by exposing two inbred genotypes of D. magna and a recombinant cross of these genotypes to a range of environmental perturbations to investigate links between genes and the environment.
Abstract: The full exploration of gene-environment interactions requires model organisms with well-characterized ecological interactions in their natural environment, manipulability in the laboratory and genomic tools. The waterflea Daphnia magna is an established ecological and toxicological model species, central to the food webs of freshwater lentic habitats and sentinel for water quality. Its tractability and cyclic parthenogenetic life-cycle are ideal to investigate links between genes and the environment. Capitalizing on this unique model system, the STRESSFLEA consortium generated a comprehensive RNA-Seq data set by exposing two inbred genotypes of D. magna and a recombinant cross of these genotypes to a range of environmental perturbations. Gene models were constructed from the transcriptome data and mapped onto the draft genome of D. magna using EvidentialGene. The transcriptome data generated here, together with the available draft genome sequence of D. magna and a high-density genetic map will be a key asset for future investigations in environmental genomics.

Journal ArticleDOI
TL;DR: A Geographic Information Systems (GIS) protocol to transfer polyline data into a workable network format in the form of a node layer, an edge layer, and a list of nodes/edges with relevant geographic information (e.g., length) is offered.
Abstract: The study of geographical systems as graphs, and networks has gained significant momentum in the academic literature as these systems possess measurable and relevant network properties. Crowd-based sources of data such as OpenStreetMaps (OSM) have created a wealth of worldwide geographic information including on transportation systems (e.g., road networks). In this work, we offer a Geographic Information Systems (GIS) protocol to transfer polyline data into a workable network format in the form of; a node layer, an edge layer, and a list of nodes/edges with relevant geographic information (e.g., length). Moreover, we have developed an ArcGIS tool to perform this protocol on OSM data, which we have applied to 80 urban areas in the world and made the results freely available. The tool accounts for crossover roads such as ramps and bridges. A separate tool is also made available for planar data and can be applied to any line features in ArcGIS.

Journal ArticleDOI
TL;DR: The HOPV15 dataset is a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets.
Abstract: The Harvard Organic Photovoltaic Dataset (HOPV15) presented in this work is a collation of experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of conformers, each with quantum chemical results using a variety of density functionals and basis sets. It is anticipated that this dataset will be of use in both relating electronic structure calculations to experimental observations through the generation of calibration schemes, as well as for the creation of new semi-empirical methods and the benchmarking of current and future model chemistries for organic electronic applications.

Journal ArticleDOI
TL;DR: An update of the studyforrest dataset is presented that complements the previously released functional magnetic resonance imaging data for natural language processing with a new two-hour 3 Tesla fMRI acquisition while 15 of the original participants were shown an audio-visual version of the stimulus motion picture.
Abstract: Here we present an update of the studyforrest (http://studyforrest.org) dataset that complements the previously released functional magnetic resonance imaging (fMRI) data for natural language processing with a new two-hour 3 Tesla fMRI acquisition while 15 of the original participants were shown an audio-visual version of the stimulus motion picture. We demonstrate with two validation analyses that these new data support modeling specific properties of the complex natural stimulus, as well as a substantial within-subject BOLD response congruency in brain areas related to the processing of auditory inputs, speech, and narrative when compared to the existing fMRI data for audio-only stimulation. In addition, we provide participants' eye gaze location as recorded simultaneously with fMRI, and an additional sample of 15 control participants whose eye gaze trajectories for the entire movie were recorded in a lab setting-to enable studies on attentional processes and comparative investigations on the potential impact of the stimulation setting on these processes.

Journal ArticleDOI
TL;DR: In this paper, the authors demonstrate automated generation of diffusion databases from high-throughput density functional theory (DFT) calculations, including more than 230 dilute solute diffusion systems in Mg, Al, Cu, Ni, Pd, and Pt host lattices.
Abstract: We demonstrate automated generation of diffusion databases from high-throughput density functional theory (DFT) calculations. A total of more than 230 dilute solute diffusion systems in Mg, Al, Cu, Ni, Pd, and Pt host lattices have been determined using multi-frequency diffusion models. We apply a correction method for solute diffusion in alloys using experimental and simulated values of host self-diffusivity. We find good agreement with experimental solute diffusion data, obtaining a weighted activation barrier RMS error of 0.176 eV when excluding magnetic solutes in non-magnetic alloys. The compiled database is the largest collection of consistently calculated ab-initio solute diffusion data in the world. Machine-accessible metadata file describing the reported data (ISA-Tab format)

Journal ArticleDOI
TL;DR: The next generation metagenomic sequence data of a defined mock community (Mock Bacteria ARchaea Community; MBARC-26), composed of 23 bacterial and 3 archaeal strains with finished genomes, is reported, enabling extensive benchmarking and comparative evaluation of bioinformatics tools without the need to simulate data.
Abstract: Generating sequence data of a defined community composed of organisms with complete reference genomes is indispensable for the benchmarking of new genome sequence analysis methods, including assembly and binning tools. Moreover the validation of new sequencing library protocols and platforms to assess critical components such as sequencing errors and biases relies on such datasets. We here report the next generation metagenomic sequence data of a defined mock community (Mock Bacteria ARchaea Community; MBARC-26), composed of 23 bacterial and 3 archaeal strains with finished genomes. These strains span 10 phyla and 14 classes, a range of GC contents, genome sizes, repeat content and encompass a diverse abundance profile. Short read Illumina and long-read PacBio SMRT sequences of this mock community are described. These data represent a valuable resource for the scientific community, enabling extensive benchmarking and comparative evaluation of bioinformatics tools without the need to simulate data. As such, these data can aid in improving our current sequence data analysis toolkit and spur interest in the development of new tools.

Journal ArticleDOI
TL;DR: MycoDB as mentioned in this paper is a database of 4,010 studies from 438 unique publications to aid in multi-factor meta-analyses elucidating the ecological and evolutionary context in which mycorrhizal fungi alter plant productivity.
Abstract: Plants form belowground associations with mycorrhizal fungi in one of the most common symbioses on Earth. However, few large-scale generalizations exist for the structure and function of mycorrhizal symbioses, as the nature of this relationship varies from mutualistic to parasitic and is largely context-dependent. We announce the public release of MycoDB, a database of 4,010 studies (from 438 unique publications) to aid in multi-factor meta-analyses elucidating the ecological and evolutionary context in which mycorrhizal fungi alter plant productivity. Over 10 years with nearly 80 collaborators, we compiled data on the response of plant biomass to mycorrhizal fungal inoculation, including meta-analysis metrics and 24 additional explanatory variables that describe the biotic and abiotic context of each study. We also include phylogenetic trees for all plants and fungi in the database. To our knowledge, MycoDB is the largest ecological meta-analysis database. We aim to share these data to highlight significant gaps in mycorrhizal research and encourage synthesis to explore the ecological and evolutionary generalities that govern mycorrhizal functioning in ecosystems.

Journal ArticleDOI
TL;DR: This work presents a historical medium-resolution DEM and orthophotographs that consistently cover the entire surroundings and margins of the Greenland Ice Sheet 1978–1987 and proved successful for topographical mapping and geodetic mass balance.
Abstract: Digital Elevation Models (DEMs) play a prominent role in glaciological studies for the mass balance of glaciers and ice sheets. By providing a time snapshot of glacier geometry, DEMs are crucial for most glacier evolution modelling studies, but are also important for cryospheric modelling in general. We present a historical medium-resolution DEM and orthophotographs that consistently cover the entire surroundings and margins of the Greenland Ice Sheet 1978–1987. About 3,500 aerial photographs of Greenland are combined with field surveyed geodetic ground control to produce a 25 m gridded DEM and a 2 m black-and-white digital orthophotograph. Supporting data consist of a reliability mask and a photo footprint coverage with recording dates. Through one internal and two external validation tests, this DEM shows an accuracy better than 10 m horizontally and 6 m vertically while the precision is better than 4 m. This dataset proved successful for topographical mapping and geodetic mass balance. Other uses include control and calibration of remotely sensed data such as imagery or InSAR velocity maps. Machine-accessible metadata file describing the reported data (ISA-Tab format)

Journal ArticleDOI
TL;DR: The dataset provides the first spatially explicit archive of the location and size of urban populations over the last 6,000 years and can contribute to an improved understanding of contemporary and historical urbanization trends.
Abstract: How were cities distributed globally in the past? How many people lived in these cities? How did cities influence their local and regional environments? In order to understand the current era of urbanization, we must understand long-term historical urbanization trends and patterns. However, to date there is no comprehensive record of spatially explicit, historic, city-level population data at the global scale. Here, we developed the first spatially explicit dataset of urban settlements from 3700 BC to AD 2000, by digitizing, transcribing, and geocoding historical, archaeological, and census-based urban population data previously published in tabular form by Chandler and Modelski. The dataset creation process also required data cleaning and harmonization procedures to make the data internally consistent. Additionally, we created a reliability ranking for each geocoded location to assess the geographic uncertainty of each data point. The dataset provides the first spatially explicit archive of the location and size of urban populations over the last 6,000 years and can contribute to an improved understanding of contemporary and historical urbanization trends. Machine-accessible metadata file describing the reported data (ISA-Tab format)

Journal ArticleDOI
TL;DR: This Data Descriptor describes a single unified and universally accessible data file, merging across 255 separate files and stitching data across 4 surveys, encompassing 41,474 individuals and 1,191 variables.
Abstract: The National Health and Nutrition Examination Survey (NHANES) is a population survey implemented by the Centers for Disease Control and Prevention (CDC) to monitor the health of the United States whose data is publicly available in hundreds of files. This Data Descriptor describes a single unified and universally accessible data file, merging across 255 separate files and stitching data across 4 surveys, encompassing 41,474 individuals and 1,191 variables. The variables consist of phenotype and environmental exposure information on each individual, specifically (1) demographic information, physical exam results (e.g., height, body mass index), laboratory results (e.g., cholesterol, glucose, and environmental exposures), and (4) questionnaire items. Second, the data descriptor describes a dictionary to enable analysts find variables by category and human-readable description. The datasets are available on DataDryad and a hands-on analytics tutorial is available on GitHub. Through a new big data platform, BD2K Patient Centered Information Commons ( http://pic-sure.org ), we provide a new way to browse the dataset via a web browser ( https://nhanes.hms.harvard.edu ) and provide application programming interface for programmatic access. Machine-accessible metadata file describing the reported data (ISA-Tab format)