scispace - formally typeset
Search or ask a question

Showing papers in "Scientific Data in 2014"


Journal ArticleDOI
TL;DR: This data set provides quantum chemical properties for a relevant, consistent, and comprehensive chemical space of small organic molecules that may serve the benchmarking of existing methods, development of new methods, such as hybrid quantum mechanics/machine learning, and systematic identification of structure-property relationships.
Abstract: Computational de novo design of new drugs and materials requires rigorous and unbiased exploration of chemical compound space. However, large uncharted territories persist due to its size scaling combinatorially with molecular size. We report computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of CHONF. These molecules correspond to the subset of all 133,885 species with up to nine heavy atoms (CONF) out of the GDB-17 chemical universe of 166 billion organic molecules. We report geometries minimal in energy, corresponding harmonic frequencies, dipole moments, polarizabilities, along with energies, enthalpies, and free energies of atomization. All properties were calculated at the B3LYP/6-31G(2df,p) level of quantum chemistry. Furthermore, for the predominant stoichiometry, C7H10O2, there are 6,095 constitutional isomers among the 134k molecules. We report energies, enthalpies, and free energies of atomization at the more accurate G4MP2 level of theory for all of them. As such, this data set provides quantum chemical properties for a relevant, consistent, and comprehensive chemical space of small organic molecules. This database may serve the benchmarking of existing methods, development of new methods, such as hybrid quantum mechanics/machine learning, and systematic identification of structure-property relationships.

1,272 citations


Journal ArticleDOI
TL;DR: This work aims to close this gap by allowing worldwide research groups to develop and test movement recognition and force control algorithms on a benchmark scientific database, with the final goal of developing non-invasive, naturally controlled, robotic hand prostheses.
Abstract: Recent advances in rehabilitation robotics suggest that it may be possible for hand-amputated subjects to recover at least a significant part of the lost hand functionality. The control of robotic prosthetic hands using non-invasive techniques is still a challenge in real life: myoelectric prostheses give limited control capabilities, the control is often unnatural and must be learned through long training times. Meanwhile, scientific literature results are promising but they are still far from fulfilling real-life needs. This work aims to close this gap by allowing worldwide research groups to develop and test movement recognition and force control algorithms on a benchmark scientific database. The database is targeted at studying the relationship between surface electromyography, hand kinematics and hand forces, with the final goal of developing non-invasive, naturally controlled, robotic hand prostheses. The validation section verifies that the data are similar to data acquired in real-life conditions, and that recognition of different hand tasks by applying state-of-the-art signal features and machine-learning algorithms is possible. Machine-accessible metadata file describing the reported data (ISA-Tab format)

558 citations


Journal ArticleDOI
TL;DR: The Consortium for Reliability and Reproducibility (CoRR) has aggregated 1,629 typical individuals’ resting state fMRI data from 18 international sites, and is openly sharing them via the International Data-sharing Neuroimaging Initiative (INDI).
Abstract: Efforts to identify meaningful functional imaging-based biomarkers are limited by the ability to reliably characterize inter-individual differences in human brain function. Although a growing number of connectomics-based measures are reported to have moderate to high test-retest reliability, the variability in data acquisition, experimental designs, and analytic methods precludes the ability to generalize results. The Consortium for Reliability and Reproducibility (CoRR) is working to address this challenge and establish test-retest reliability as a minimum standard for methods development in functional connectomics. Specifically, CoRR has aggregated 1,629 typical individuals’ resting state fMRI (rfMRI) data (5,093 rfMRI scans) from 18 international sites, and is openly sharing them via the International Data-sharing Neuroimaging Initiative (INDI). To allow researchers to generate various estimates of reliability and reproducibility, a variety of data acquisition procedures and experimental designs are included. Similarly, to enable users to assess the impact of commonly encountered artifacts (for example, motion) on characterizations of inter-individual variation, datasets of varying quality are included.

406 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present a compendium of highly specific assays covering more than 10,000 human proteins and enabling their targeted analysis in SWATH-MS datasets acquired from research or clinical specimens.
Abstract: Mass spectrometry is the method of choice for deep and reliable exploration of the (human) proteome. Targeted mass spectrometry reliably detects and quantifies pre-determined sets of proteins in a complex biological matrix and is used in studies that rely on the quantitatively accurate and reproducible measurement of proteins across multiple samples. It requires the one-time, a priori generation of a specific measurement assay for each targeted protein. SWATH-MS is a mass spectrometric method that combines data-independent acquisition (DIA) and targeted data analysis and vastly extends the throughput of proteins that can be targeted in a sample compared to selected reaction monitoring (SRM). Here we present a compendium of highly specific assays covering more than 10,000 human proteins and enabling their targeted analysis in SWATH-MS datasets acquired from research or clinical specimens. This resource supports the confident detection and quantification of 50.9% of all human proteins annotated by UniProtKB/Swiss-Prot and is therefore expected to find wide application in basic and clinical research. Data are available via ProteomeXchange (PXD000953-954) and SWATHAtlas (SAL00016-35).

378 citations


Journal ArticleDOI
TL;DR: This dataset facilitates the linkage of genetic dependencies with specific cellular contexts (e.g., gene mutations or cell lineage) and developed and provided a bioinformatics tool to identify linear and nonlinear correlations between these features.
Abstract: Using a genome-scale, lentivirally delivered shRNA library, we performed massively parallel pooled shRNA screens in 216 cancer cell lines to identify genes that are required for cell proliferation and/or viability. Cell line dependencies on 11,000 genes were interrogated by 5 shRNAs per gene. The proliferation effect of each shRNA in each cell line was assessed by transducing a population of 11M cells with one shRNA-virus per cell and determining the relative enrichment or depletion of each of the 54,000 shRNAs after 16 population doublings using Next Generation Sequencing. All the cell lines were screened using standardized conditions to best assess differential genetic dependencies across cell lines. When combined with genomic characterization of these cell lines, this dataset facilitates the linkage of genetic dependencies with specific cellular contexts (e.g., gene mutations or cell lineage). To enable such comparisons, we developed and provided a bioinformatics tool to identify linear and nonlinear correlations between these features.

372 citations


Journal ArticleDOI
TL;DR: The results indicate that GIDMaPS data sets reliably captured several major droughts from across the globe, which would be instrumental in reducing drought impacts especially in developing countries.
Abstract: Drought is by far the most costly natural disaster that can lead to widespread impacts, including water and food crises. Here we present data sets available from the Global Integrated Drought Monitoring and Prediction System (GIDMaPS), which provides drought information based on multiple drought indicators. The system provides meteorological and agricultural drought information based on multiple satellite-, and model-based precipitation and soil moisture data sets. GIDMaPS includes a near real-time monitoring component and a seasonal probabilistic prediction module. The data sets include historical drought severity data from the monitoring component, and probabilistic seasonal forecasts from the prediction module. The probabilistic forecasts provide essential information for early warning, taking preventive measures, and planning mitigation strategies. GIDMaPS data sets are a significant extension to current capabilities and data sets for global drought assessment and early warning. The presented data sets would be instrumental in reducing drought impacts especially in developing countries. Our results indicate that GIDMaPS data sets reliably captured several major droughts from across the globe.

348 citations


Journal ArticleDOI
TL;DR: The original version of this Data Descriptor contained a typographical error in the spelling of the author Terence C. Wong, which was incorrectly given as Terrence C Wong as discussed by the authors.
Abstract: Scientific Data 1:140035 doi: 10.1038/sdata.2014.35 (2014); Published 30 September 2014; Updated 11 November 2014 The original version of this Data Descriptor contained a typographical error in the spelling of the author Terence C. Wong, which was incorrectly given as Terrence C. Wong. This has now been corrected in the PDF and HTML versions of the Data Descriptor.

244 citations


Journal ArticleDOI
TL;DR: A database developed to facilitate access to sexual system and sex chromosome information is described, with data on sexual systems from 11,038 plant, 705 fish, 173 amphibian, 593 non-avian reptilian, 195 avian, 479 mammalian, and 11,556 invertebrate species.
Abstract: The vast majority of eukaryotic organisms reproduce sexually, yet the nature of the sexual system and the mechanism of sex determination often vary remarkably, even among closely related species. Some species of animals and plants change sex across their lifespan, some contain hermaphrodites as well as males and females, some determine sex with highly differentiated chromosomes, while others determine sex according to their environment. Testing evolutionary hypotheses regarding the causes and consequences of this diversity requires interspecific data placed in a phylogenetic context. Such comparative studies have been hampered by the lack of accessible data listing sexual systems and sex determination mechanisms across the eukaryotic tree of life. Here, we describe a database developed to facilitate access to sexual system and sex chromosome information, with data on sexual systems from 11,038 plant, 705 fish, 173 amphibian, 593 non-avian reptilian, 195 avian, 479 mammalian, and 11,556 invertebrate species.

197 citations


Journal ArticleDOI
TL;DR: A high-resolution functional magnetic resonance (fMRI) dataset – 20 participants recorded at high field strength (7 Tesla) during prolonged stimulation with an auditory feature film (“Forrest Gump”) is presented.
Abstract: Here we present a high-resolution functional magnetic resonance (fMRI) dataset – 20 participants recorded at high field strength (7 Tesla) during prolonged stimulation with an auditory feature film (“Forrest Gump”). In addition, a comprehensive set of auxiliary data (T1w, T2w, DTI, susceptibility-weighted image, angiography) as well as measurements to assess technical and physiological noise components have been acquired. An initial analysis confirms that these data can be used to study common and idiosyncratic brain response patterns to complex auditory stimulation. Among the potential uses of this dataset are the study of auditory attention and cognition, language and music perception, and social perception. The auxiliary measurements enable a large variety of additional analysis strategies that relate functional response patterns to structural properties of the brain. Alongside the acquired data, we provide source code and detailed information on all employed procedures – from stimulus creation to data analysis. In order to facilitate replicative and derived works, only free and open-source software was utilized.

164 citations


Journal ArticleDOI
TL;DR: A dataset of gridded hourly estimates of typical microclimatic conditions (air temperature, wind speed, relative humidity, solar radiation, sky radiation and substrate temperatures from the surface to 1 m depth) at high resolution for the globe is presented.
Abstract: The mechanistic links between climate and the environmental sensitivities of organisms occur through the microclimatic conditions that organisms experience. Here we present a dataset of gridded hourly estimates of typical microclimatic conditions (air temperature, wind speed, relative humidity, solar radiation, sky radiation and substrate temperatures from the surface to 1 m depth) at high resolution (~15 km) for the globe. The estimates are for the middle day of each month, based on long-term average macroclimates, and include six shade levels and three generic substrates (soil, rock and sand) per pixel. These data are suitable for deriving biophysical estimates of the heat, water and activity budgets of terrestrial organisms.

159 citations


Journal ArticleDOI
TL;DR: The Reef Life Survey (RLS) reef fish dataset is described, which contains 134,759 abundance records, of 2,367 fish taxa, from 1,879 sites in coral and rocky reefs distributed worldwide, offering new opportunities to assess broad-scale spatial patterns in community structure.
Abstract: The assessment of patterns in macroecology, including those most relevant to global biodiversity conservation, has been hampered by a lack of quantitative data collected in a consistent manner over the global scale. Global analyses of species’ abundance data typically rely on records aggregated from multiple studies where different sampling methods and varying levels of taxonomic and spatial resolution have been applied. Here we describe the Reef Life Survey (RLS) reef fish dataset, which contains 134,759 abundance records, of 2,367 fish taxa, from 1,879 sites in coral and rocky reefs distributed worldwide. Data were systematically collected using standardized methods, offering new opportunities to assess broad-scale spatial patterns in community structure. The development of such a large dataset was made possible through contributions of investigators associated with science and conservation agencies worldwide, and the assistance of a team of over 100 recreational SCUBA divers, who undertook training in scientific techniques for underwater surveys and voluntarily contributed skills, expertise and their time to data collection.

Journal ArticleDOI
TL;DR: Eight high-coverage SMRT sequence datasets from five organisms that have been publicly released to the general scientific community are described and used to generate whole-genome assemblies, test new algorithms, investigate genome structure and evolution, and identify base modifications in some of the most widely-studied model systems in biological research.
Abstract: Single molecule, real-time (SMRT) sequencing from Pacific Biosciences is increasingly used in many areas of biological research including de novo genome assembly, structural-variant identification, haplotype phasing, mRNA isoform discovery, and base-modification analyses. High-quality, public datasets of SMRT sequences can spur development of analytic tools that can accommodate unique characteristics of SMRT data (long read lengths, lack of GC or amplification bias, and a random error profile leading to high consensus accuracy). In this paper, we describe eight high-coverage SMRT sequence datasets from five organisms (Escherichia coli, Saccharomyces cerevisiae, Neurospora crassa, Arabidopsis thaliana, and Drosophila melanogaster) that have been publicly released to the general scientific community (NCBI Sequence Read Archive ID SRP040522). Data were generated using two sequencing chemistries (P4C2 and P5C3) on the PacBio RS II instrument. The datasets reported here can be used without restriction by the research community to generate whole-genome assemblies, test new algorithms, investigate genome structure and evolution, and identify base modifications in some of the most widely-studied model systems in biological research.

Journal ArticleDOI
TL;DR: A unique dataset is established to assess the repeatability of brain segmentation and analysis methods and it is shown that ventricle volume in the subjects varied between days.
Abstract: Evaluation of neurodegenerative disease progression may be assisted by quantification of the volume of structures in the human brain using magnetic resonance imaging (MRI). Automated segmentation software has improved the feasibility of this approach, but often the reliability of measurements is uncertain. We have established a unique dataset to assess the repeatability of brain segmentation and analysis methods. We acquired 120 T1-weighted volumes from 3 subjects (40 volumes/subject) in 20 sessions spanning 31 days, using the protocol recommended by the Alzheimer's Disease Neuroimaging Initiative (ADNI). Each subject was scanned twice within each session, with repositioning between the two scans, allowing determination of test-retest reliability both within a single session (intra-session) and from day to day (inter-session). To demonstrate the application of the dataset, all 3D volumes were processed using FreeSurfer v5.1. The coefficient of variation of volumetric measurements was between 1.6% (caudate) and 6.1% (thalamus). Inter-session variability exceeded intra-session variability for lateral ventricle volume (P<0.0001), indicating that ventricle volume in the subjects varied between days.

Journal ArticleDOI
TL;DR: This landmark dataset of around 75,000 temperature and salinity profiles from 20–140 °E, concentrated on the sector between the Kerguelen Islands and Prydz Bay, continues to grow through the coordinated efforts of French and Australian marine research teams.
Abstract: The instrumentation of southern elephant seals with satellite-linked CTD tags has offered unique temporal and spatial coverage of the Southern Indian Ocean since 2004. This includes extensive data from the Antarctic continental slope and shelf regions during the winter months, which is outside the conventional areas of Argo autonomous floats and ship-based studies. This landmark dataset of around 75,000 temperature and salinity profiles from 20-140 °E, concentrated on the sector between the Kerguelen Islands and Prydz Bay, continues to grow through the coordinated efforts of French and Australian marine research teams. The seal data are quality controlled and calibrated using delayed-mode techniques involving comparisons with other existing profiles as well as cross-comparisons similar to established protocols within the Argo community, with a resulting accuracy of ±0.03 °C in temperature and ±0.05 in salinity or better. The data offer invaluable new insights into the water masses, oceanographic processes and provides a vital tool for oceanographers seeking to advance our understanding of this key component of the global ocean climate.

Journal ArticleDOI
TL;DR: Comparisons with the instrumental record of temperature suggest that the extended and revised database of proxy temperature records recently used to reconstruct Arctic temperatures for the past 2,000 years is the best record annual temperature variability across the Arctic.
Abstract: Robust climate reconstructions of the most recent centuries and millennia are invaluable for placing modern warming in the context of natural variability. Here we present an extended and revised database (version 1.1) of proxy temperature records recently used to reconstruct Arctic temperatures for the past 2,000 years. The datasets are presented in a machine-readable format, and have been extended with the geochronologic data and consistently generated time-uncertain ensembles, which will be useful in future analyses of the influence of geochronologic uncertainty. A standardized description of the seasonality of the temperature response for each record, as reported by the original authors, is also included to motivate a more nuanced approach to integrating records with variable seasonal sensitivities. Despite the predominance of seasonal, rather than annual, temperature responders in the database, comparisons with the instrumental record of temperature suggest that, as a whole, the datasets best record annual temperature variability across the Arctic, especially in northeast Canada and Greenland, where the density of records is highest.

Journal ArticleDOI
TL;DR: This dataset represents a systematic evaluation of the reproducibility of a multi-batch DIMS metabolomics study of cardiac tissue extracts, designed to test the efficacy of a batch-correction algorithm and will enable others to evaluate novel data processing algorithms.
Abstract: Direct-infusion mass spectrometry (DIMS) metabolomics is an important approach for characterising molecular responses of organisms to disease, drugs and the environment. Increasingly large-scale metabolomics studies are being conducted, necessitating improvements in both bioanalytical and computational workflows to maintain data quality. This dataset represents a systematic evaluation of the reproducibility of a multi-batch DIMS metabolomics study of cardiac tissue extracts. It comprises of twenty biological samples (cow vs. sheep) that were analysed repeatedly, in 8 batches across 7 days, together with a concurrent set of quality control (QC) samples. Data are presented from each step of the workflow and are available in MetaboLights. The strength of the dataset is that intra- and inter-batch variation can be corrected using QC spectra and the quality of this correction assessed independently using the repeatedly-measured biological samples. Originally designed to test the efficacy of a batch-correction algorithm, it will enable others to evaluate novel data processing algorithms. Furthermore, this dataset serves as a benchmark for DIMS metabolomics, derived using best-practice workflows and rigorous quality assessment.

Journal ArticleDOI
TL;DR: A global dataset consisting of 100,605 total measurements of particulate organic carbon, nitrogen, or phosphorus analyzed as part of 70 cruises or time-series is presented to assist in a wide range of future studies of ocean elemental ratios.
Abstract: Knowledge of concentrations and elemental ratios of suspended particles are important for understanding many biogeochemical processes in the ocean. These include patterns of phytoplankton nutrient limitation as well as linkages between the cycles of carbon and nitrogen or phosphorus. To further enable studies of ocean biogeochemistry, we here present a global dataset consisting of 100,605 total measurements of particulate organic carbon, nitrogen, or phosphorus analyzed as part of 70 cruises or time-series. The data are globally distributed and represent all major ocean regions as well as different depths in the water column. The global median C:P, N:P, and C:N ratios are 163, 22, and 6.6, respectively, but the data also includes extensive variation between samples from different regions. Thus, this compilation will hopefully assist in a wide range of future studies of ocean elemental ratios.

Journal ArticleDOI
TL;DR: This is the most comprehensive database of confirmed human dengue infection to-date, consisting of 8,309 geo-positioned occurrences in total, and describes all data collection processes in full.
Abstract: A global geographic database of human dengue virus occurrence was produced to generate a global risk map and associated burden estimates(1). Herein we present the database, which comprises occurrence data linked to point or polygon locations, derived from peer-reviewed literature and case reports as well as informal online sources. Entries date from 1960 to 2012. We describe all data collection processes in full, as well as geo-positioning, database management and quality-control procedures. This is the most comprehensive database of confirmed human dengue infection to-date, consisting of 8,309 geo-positioned occurrences in total.

Journal ArticleDOI
TL;DR: The marine cyanobacterium Prochlorococcus is the numerically dominant photosynthetic organism in the oligotrophic oceans, and a model system in marine microbial ecology as mentioned in this paper.
Abstract: The marine cyanobacterium Prochlorococcus is the numerically dominant photosynthetic organism in the oligotrophic oceans, and a model system in marine microbial ecology. Here we report 27 new whole genome sequences (2 complete and closed; 25 of draft quality) of cultured isolates, representing five major phylogenetic clades of Prochlorococcus. The sequenced strains were isolated from diverse regions of the oceans, facilitating studies of the drivers of microbial diversity—both in the lab and in the field. To improve the utility of these genomes for comparative genomics, we also define pre-computed clusters of orthologous groups of proteins (COGs), indicating how genes are distributed among these and other publicly available Prochlorococcus genomes. These data represent a significant expansion of Prochlorococcus reference genomes that are useful for numerous applications in microbial ecology, evolution and oceanography.

Journal ArticleDOI
TL;DR: Way-EEG-GAL is a dataset designed to allow critical tests of techniques to decode sensation, intention, and action from scalp EEG recordings in humans who perform a grasp-and-lift task.
Abstract: WAY-EEG-GAL is a dataset designed to allow critical tests of techniques to decode sensation, intention, and action from scalp EEG recordings in humans who perform a grasp-and-lift task. Twelve part ...

Journal ArticleDOI
TL;DR: The co-occurrence matrix quantifies the relatedness among medical concepts which can serve as the basis for many statistical tests, and can be used to directly compute Bayesian conditional probabilities, association rules, as well as a range of test statistics such as relative risks and odds ratios.
Abstract: Electronic health records (EHR) represent a rich and relatively untapped resource for characterizing the true nature of clinical practice and for quantifying the degree of inter-relatedness of medical entities such as drugs, diseases, procedures and devices. We provide a unique set of co-occurrence matrices, quantifying the pairwise mentions of 3 million terms mapped onto 1 million clinical concepts, calculated from the raw text of 20 million clinical notes spanning 19 years of data. Co-frequencies were computed by means of a parallelized annotation, hashing, and counting pipeline that was applied over clinical notes from Stanford Hospitals and Clinics. The co-occurrence matrix quantifies the relatedness among medical concepts which can serve as the basis for many statistical tests, and can be used to directly compute Bayesian conditional probabilities, association rules, as well as a range of test statistics such as relative risks and odds ratios. This dataset can be leveraged to quantitatively assess comorbidity, drug-drug, and drug-disease patterns for a range of clinical, epidemiological, and financial applications.

Journal ArticleDOI
TL;DR: This paper investigates the human skeletal muscle transcriptome responses to differentiated exercise and non-exercise control intervention and provides a simple-to-use spread sheet containing transcriptome data allowing other investigators to easily see how mRNA of their gene(s) of interest behave in skeletal muscle following exercise.
Abstract: Few studies have investigated exercise-induced global gene expression responses in human skeletal muscle and these have typically focused at one specific mode of exercise and not implemented non-exercise control models. However, interpretation on effects of differentiated exercise necessitate direct comparison between essentially different modes of exercise and the ability to identify true exercise effect, necessitate implementation of independent non-exercise control subjects. Furthermore, muscle transcriptome data made available through previous exercise studies can be difficult to extract and interpret by individuals that are inexperienced with bioinformatics procedures. In a comparative study, we therefore; (1) investigated the human skeletal muscle transcriptome responses to differentiated exercise and non-exercise control intervention, and; (2) set out to develop a straightforward search tool to allow for easy access and interpretation of our data. We provide a simple-to-use spread sheet containing transcriptome data allowing other investigators to easily see how mRNA of their gene(s) of interest behave in skeletal muscle following exercise, both endurance, resistance and non-exercise, to better aid hypothesis-driven question in this field of research.

Journal ArticleDOI
TL;DR: The ATAG data set includes whole-brain and reduced field-of-view MP2RAGE and T2*-weighted scans of the subcortex and brainstem with ultra-high resolution at a sub-millimeter scale that can be used to develop new algorithms that help building high-resolution atlases both relevant for the basic and clinical neurosciences.
Abstract: Structural brain data is key for the understanding of brain function and networks, i.e., connectomics. Here we present data sets available from the ‘atlasing of the basal ganglia (ATAG)’ project, which provides ultra-high resolution 7 Tesla (T) magnetic resonance imaging (MRI) scans from young, middle-aged, and elderly participants. The ATAG data set includes whole-brain and reduced field-of-view MP2RAGE and T2*-weighted scans of the subcortex and brainstem with ultra-high resolution at a sub-millimeter scale. The data can be used to develop new algorithms that help building high-resolution atlases both relevant for the basic and clinical neurosciences. Importantly, the present data repository may also be used to inform the exact positioning of electrodes used for deep-brain-stimulation in patients with Parkinson’s disease and neuropsychiatric diseases.

Journal ArticleDOI
TL;DR: Verified data for 3,627 individuals of 27 taxa in the form of a life history table containing summarized species values for variables relating to ancestry, reproduction, longevity, and body mass are presented.
Abstract: Since its establishment in 1966, the Duke Lemur Center (DLC) has accumulated detailed records for nearly 4,200 individuals from over 40 strepsirrhine primate taxa—the lemurs, lorises, and galagos. Here we present verified data for 3,627 individuals of 27 taxa in the form of a life history table containing summarized species values for variables relating to ancestry, reproduction, longevity, and body mass, as well as the two raw data files containing direct and calculated variables from which this summary table is built. Large sample sizes, longitudinal data that in many cases span an animal’s entire life, exact dates of events, and large numbers of individuals from closely related yet biologically diverse primate taxa make these datasets unique. This single source for verified raw data and systematically compiled species values, particularly in combination with the availability of associated biological samples and the current live colony for research, will support future studies from an enormous spectrum of disciplines.

Journal ArticleDOI
TL;DR: The work presented herein represents a comprehensive dataset derived from normal samples profiled in a single study, with miRNA profiles of multiple organs, as well as precise information on experimental procedures and organ-specific miRNAs identified.
Abstract: MicroRNAs (miRNAs) are small (~22 nucleotide) noncoding RNAs that play pivotal roles in regulation of gene expression. The value of miRNAs as circulating biomarkers is now broadly recognized; such tissue-specific biomarkers can be used to monitor tissue injury and several pathophysiological conditions in organs. In addition, miRNA profiles of normal organs and tissues are important for obtaining a better understanding of the source of modulated miRNAs in blood and how those modulations reflect various physiological and toxicological conditions. This work was aimed at creating an miRNA atlas in rats, as part of a collaborative effort with the Toxicogenomics Informatics Project in Japan (TGP2). We analyzed genome-wide miRNA profiles of 55 different organs and tissues obtained from normal male rats using miRNA arrays. The work presented herein represents a comprehensive dataset derived from normal samples profiled in a single study. Here we present the whole dataset with miRNA profiles of multiple organs, as well as precise information on experimental procedures and organ-specific miRNAs identified in this dataset.

Journal ArticleDOI
TL;DR: This database collates the existing knowledge of all known human outbreaks of Ebola for the first time by extracting details of their suspected zoonotic origin and subsequent human-to-human spread from a range of published and non-published sources.
Abstract: Ebola is a zoonotic filovirus that has the potential to cause outbreaks of variable magnitude in human populations. This database collates our existing knowledge of all known human outbreaks of Ebola for the first time by extracting details of their suspected zoonotic origin and subsequent human-to-human spread from a range of published and non-published sources. In total, 22 unique Ebola outbreaks were identified, composed of 117 unique geographic transmission clusters. Details of the index case and geographic spread of secondary and imported cases were recorded as well as summaries of patient numbers and case fatality rates. A brief text summary describing suspected routes and means of spread for each outbreak was also included. While we cannot yet include the ongoing Guinea and DRC outbreaks until they are over, these data and compiled maps can be used to gain an improved understanding of the initial spread of past Ebola outbreaks and help evaluate surveillance and control guidelines for limiting the spread of future epidemics.

Journal ArticleDOI
TL;DR: The present manuscript describes in detail the experimental settings, generation, processing and quality control analysis of the multi-layer omics dataset accessible in public repositories for further intra- and inter-species translation studies.
Abstract: The biological responses to external cues such as drugs, chemicals, viruses and hormones, is an essential question in biomedicine and in the field of toxicology, and cannot be easily studied in humans. Thus, biomedical research has continuously relied on animal models for studying the impact of these compounds and attempted to ‘translate’ the results to humans. In this context, the SBV IMPROVER (Systems Biology Verification for Industrial Methodology for PROcess VErification in Research) collaborative initiative, which uses crowd-sourcing techniques to address fundamental questions in systems biology, invited scientists to deploy their own computational methodologies to make predictions on species translatability. A multi-layer systems biology dataset was generated that was comprised of phosphoproteomics, transcriptomics and cytokine data derived from normal human (NHBE) and rat (NRBE) bronchial epithelial cells exposed in parallel to more than 50 different stimuli under identical conditions. The present manuscript describes in detail the experimental settings, generation, processing and quality control analysis of the multi-layer omics dataset accessible in public repositories for further intra- and inter-species translation studies.

Journal ArticleDOI
TL;DR: By comparing data from mutant versus wild-type virus and host strains, RNA versus protein differential expression, and infection with genetically similar strains, these data can be used to further investigate genetic and physiological determinants of host responses to viral infection.
Abstract: The Systems Biology for Infectious Diseases Research program was established by the U.S. National Institute of Allergy and Infectious Diseases to investigate host-pathogen interactions at a systems level. This program generated 47 transcriptomic and proteomic datasets from 30 studies that investigate in vivo and in vitro host responses to viral infections. Human pathogens in the Orthomyxoviridae and Coronaviridae families, especially pandemic H1N1 and avian H5N1 influenza A viruses and severe acute respiratory syndrome coronavirus (SARS-CoV), were investigated. Study validation was demonstrated via experimental quality control measures and meta-analysis of independent experiments performed under similar conditions. Primary assay results are archived at the GEO and PeptideAtlas public repositories, while processed statistical results together with standardized metadata are publically available at the Influenza Research Database (www.fludb.org) and the Virus Pathogen Resource (www.viprbrc.org). By comparing data from mutant versus wild-type virus and host strains, RNA versus protein differential expression, and infection with genetically similar strains, these data can be used to further investigate genetic and physiological determinants of host responses to viral infection. Machine-accessible metadata file describing the reported data (ISA-Tab format)

Journal ArticleDOI
TL;DR: In this paper, the authors presented an attempt to collate reported leishmaniasis occurrences from 1960 to 2012, and a total of 12,563 spatially and temporally unique occurrences of both cutaneous and visceral leischmaniasis comprise the database, ranging in geographic scale from villages to states.
Abstract: The leishmaniases are neglected tropical diseases of significant public health importance. However, information on their global occurrence is disparate and sparse. This database represents an attempt to collate reported leishmaniasis occurrences from 1960 to 2012. Methodology for the collection of data from the literature, abstraction of case locations and data processing procedures are described here. In addition, strain archives and online data resources were accessed. A total of 12,563 spatially and temporally unique occurrences of both cutaneous and visceral leishmaniasis comprise the database, ranging in geographic scale from villages to states. These data can be used for a variety of mapping and spatial analyses covering multiple resolutions.

Journal ArticleDOI
TL;DR: This dataset comprises genomic DNA sequences assembled de novo into contigs, encompassing over 10,000 annotated putative open reading frames and predicted protein products, and it is demonstrated that T. grayi is more closely related to Trypanosoma cruzi than it is to the African trypanosomes T. brucei, T. congolense and T. vivax.
Abstract: The availability of genome sequence data has greatly enhanced our understanding of the adaptations of trypanosomatid parasites to their respective host environments. However, these studies remain somewhat restricted by modest taxon sampling, generally due to focus on the most important pathogens of humans. To address this problem, at least in part, we are releasing a draft genome sequence for the African crocodilian trypanosome, Trypanosoma grayi ANR4. This dataset comprises genomic DNA sequences assembled de novo into contigs, encompassing over 10,000 annotated putative open reading frames and predicted protein products. Using phylogenomic approaches we demonstrate that T. grayi is more closely related to Trypanosoma cruzi than it is to the African trypanosomes T. brucei, T. congolense and T. vivax, despite the fact T. grayi and the African trypanosomes are each transmitted by tsetse flies. The data are deposited in publicly accessible repositories where we hope they will prove useful to the community in evolutionary studies of the trypanosomatids.