The FAIR principles have been developed to promote good scientific practises for scientific data and data resources and put a specific emphasis on enhancing the ability of both individuals and software to discover and re-use digital objects in an automated fashion throughout their entire life cycle.
Abstract:
The amount of omics data in the public domain is increasing every year. Public availability of datasets is growing in all disciplines, because it is considered to be a good scientific practice (e.g. to enable reproducibility), and/or it is mandated by funding agencies, scientific journals. Science is now a data-intensive discipline and therefore, new and innovative ways for data management, data sharing, and for discovering novel datasets are increasingly required. In 2016, we released the first version of the Omics Discovery Index (www.omicsdi.org) as a light-weight system to aggregate datasets across multiple public omics data resources. OmicsDI integrates genomics, transcriptomics, proteomics, metabolomics, and multi-omics datasets, as well as computational models of biological processes. Here, we propose a set of novel metrics to quantify the impact of biomedical datasets. A complete framework (now integrated into OmicsDI) has been implemented in order to provide and evaluate those metrics. Finally, we propose a set of recommendations for authors, journals, and data resources to promote an optimal quantification of the impact of datasets.
TL;DR: This work discusses recent approaches, existing tools, and potential caveats in the integration of omics datasets for development of standardized analytical pipelines that could be adopted by the global omics research community.
TL;DR: The ProteomeXchange (PX) consortium of proteomics resources (http://www.proteomexchange.org) was originally set up to standardize data submission and dissemination of public MS proteomics data as discussed by the authors .
TL;DR: The diverse applications of mass spectrometry‐based proteomics in innate immunity to define communication patterns of the innate immune cells during health and disease are explored and the emerging role of proteomics is presented in immune‐based drug discovery.
TL;DR: A large set of open-access MD trajectories of phosphatidylcholine (PC) lipid bilayers are used to benchmark the conformational dynamics in several contemporary MD models (force fields) against nuclear magnetic resonance (NMR) data available in the literature: effective correlation times and spin-lattice relaxation rates.
TL;DR: A novel deep learning approach to feature selection that addresses both challenges simultaneously and discovers relevant features that provide superior prediction performance compared to the state-of-the-art benchmarks in practical scenarios where there is often limited labeled data and high correlations among features.
TL;DR: The Gene Expression Omnibus (GEO) project was initiated in response to the growing demand for a public repository for high-throughput gene expression data and provides a flexible and open design that facilitates submission, storage and retrieval of heterogeneous data sets from high-power gene expression and genomic hybridization experiments.
TL;DR: The FAIR Data Principles as mentioned in this paper are a set of data reuse principles that focus on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals.
TL;DR: The Swiss-Prot, TrEMBL and PIR protein database activities have united to form the Universal Protein Knowledgebase (UniProt), which is to provide a comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and query interfaces.
TL;DR: The Reactome Knowledgebase provides molecular details of signal transduction, transport, DNA replication, metabolism and other cellular processes as an ordered network of molecular transformations—an extended version of a classic metabolic map, in a single consistent data model.
TL;DR: The developments in PRIDE resources and related tools are summarized and a brief update on the resources under development 'PRIDE Cluster' and 'PRide Proteomes', which provide a complementary view and quality-scored information of the peptide and protein identification data available inPRIDE Archive are given.
Frequently Asked Questions (11)
Q1. What are the contributions mentioned in the paper "Quantifying the impact of public omics data" ?
Here, the authors propose a set of novel metrics to quantify the attention and impact of biomedical datasets. Finally, the authors propose a set of recommendations for authors, journals and data resources to promote an optimal quantification of the impact of datasets.
Q2. What is the method to shrink the original values of a distribution to a range?
The MinMaxScaler is a robust method to shrink original values of a distribution to a range such that it becomes a value between 0 and 1.
Q3. How many datasets are stored in OmicsDI?
At the time of writing (March 2019), OmicsDI stores just over 454,200 datasets from 16 different public data resources (https://www.omicsdi.org/database).
Q4. What is the purpose of the new OmicsDI system?
The newly implemented OmicsDI dataset claiming system enables authors, research groups, scientific consortia and research institutions to organise datasets under a unique OmicsDI profile, and for datasets to be added to their own ORCID profiles as well.
Q5. How many datasets contain connections to knowledge-based resources?
More than 53% of the datasets contains biological connections that can be traced to knowledge-based resources, such as Ensembl15, UniProt16 or IntAct17.
Q6. What can be the way to assess the impact of a dataset?
The correct tracking of datasets in a database by other data resources can help to assess its impact, since itdemonstrates that the data they store is actively re-used by (and thus it is relevant to) the community.
Q7. What is the importance of reporting scientific impact?
Reporting scientific impact is indeed increasingly relevant for individuals, but also reporting aggregated information has become essential for research groups, scientific consortia, institutions or for public data resources among others, in order to assess the level of importance, excellence and relevance of their work.
Q8. What are the main reasons for the reanalysis of datasets?
The appropriate and accurate reference to the original datasets in other resources facilitates the reproducibility and traceability of the results and the recognition for the authors that generated the original dataset32.
Q9. What is the standard deviation for the citation rate for proteomics datasets?
the standard deviation indicates that in transcriptomics some datasets get significantly more attention from the community than others (STD= 16), whereas for proteomics datasets the citation rate is much more homogenous (STD= 1.7).
Q10. What are the five metrics that can be used to estimate the impact of datasets?
The authors have formulated five metrics that can be used to estimate the impact of datasets (Fig. 5):1. Number of reanalyses (reanalyses): A reanalysis can be generally defined as the complete or partial re-use of an original dataset (A) using a different analysis protocol and stored either in the same or in another public data resource (B) (Fig. 5).
Q11. How can researchers create their own profile in OmicsDI?
Analogously to services such as Google Scholar and ResearchGate for publications, the authors have implemented a mechanism that enables researchers to create their own profile in OmicsDI, by claiming their own datasets.