scispace - formally typeset
Open AccessPosted ContentDOI

Quantifying the impact of public omics data

Reads0
Chats0
TLDR
The FAIR principles have been developed to promote good scientific practises for scientific data and data resources and put a specific emphasis on enhancing the ability of both individuals and software to discover and re-use digital objects in an automated fashion throughout their entire life cycle.
Abstract
The amount of omics data in the public domain is increasing every year. Public availability of datasets is growing in all disciplines, because it is considered to be a good scientific practice (e.g. to enable reproducibility), and/or it is mandated by funding agencies, scientific journals. Science is now a data-intensive discipline and therefore, new and innovative ways for data management, data sharing, and for discovering novel datasets are increasingly required. In 2016, we released the first version of the Omics Discovery Index (www.omicsdi.org) as a light-weight system to aggregate datasets across multiple public omics data resources. OmicsDI integrates genomics, transcriptomics, proteomics, metabolomics, and multi-omics datasets, as well as computational models of biological processes. Here, we propose a set of novel metrics to quantify the impact of biomedical datasets. A complete framework (now integrated into OmicsDI) has been implemented in order to provide and evaluate those metrics. Finally, we propose a set of recommendations for authors, journals, and data resources to promote an optimal quantification of the impact of datasets.

read more

Content maybe subject to copyright    Report

ARTICLE
Quantifying the impact of public omics data
Yasset Perez-Riverol
1
, Andrey Zorin
1
, Gaurhari Dass
1
, Manh-Tu Vu
1
, Pan Xu
2
, Mihai Glont
1
,
Juan Antonio Vizcaíno
1
, Andrew F. Jarnuczak
1
, Robert Petryszak
1
, Peipei Ping
3,4
& Henning Hermjakob
1,2
The amount of omics data in the public domain is increasing every year. Modern science has
become a data-intensive discipline. Innovative solutions for data management, data sharing,
and for discovering novel datasets are therefore increasingly required. In 2016, we released
the rst version of the Omics Discovery Index (OmicsDI) as a light-weight system to
aggregate datasets across multiple public omics data resources. OmicsDI aggregates geno-
mics, transcriptomics, proteomics, metabolomics and multiomics datasets, as well as com-
putational models of biological processes. Here, we propose a set of novel metrics to quantify
the attention and impact of biomedical datasets. A complete framework (now integrated into
OmicsDI) has been implemented in order to provide and evaluate those metrics. Finally, we
propose a set of recommendations for authors, journals and data resources to promote an
optimal quantication of the impact of datasets.
https://doi.org/10.1038/s41467-019-11461-w
OPEN
1
European Molecular Biology Laboratory, EMBL-European Bioinformatics Institute (EMBL-EBI), Cambridge CB10 1SD, UK.
2
State Key Laboratory of
Proteomics, Beijing Proteome Research Center, Beijing Institute of Lifeomics, National Center for Protein Sciences (The PHOENIX Center, Beijing), 102206
Beijing, China.
3
Department of Physiology, Division of Cardiology, David Geffen School of Medicine at UCLA, University of California, Los Angeles 90095 CA,
USA.
4
Department of Medicine, Division of Cardiology, David Geffen School of Medicine at UCLA, University of California, Los Angeles 90095 CA, USA.
Correspondence and requests for materials should be addressed to Y.P.-R. (email: yperez@ebi.ac.uk)
NATURE COMMUNICATIONS | (2019) 10:3512 | https://doi.org/10.1038/s41467-019-11461-w | www.nature.com/naturecommunications 1
1234567890():,;

P
ublic availability of datasets is growing in all disciplines,
because it is considered to be a good scientic practice (e.g.
to enable reproducibility) and/or it is mandated by funding
agencies and scientic journals
1,2
. Science is now a data intensive
discipline and therefore, new and innovative ways for data
management, data sharing and for discovering novel datasets are
increasingly required
3,4
. However, as data volumes grow, quan-
tifying data impact becomes more and more important. In this
context, the Findable, Accessible, Interoperable, Reusable (FAIR)
principles have been developed to promote good scientic prac-
tises for scientic data and data resources
5
. In fact, recently,
several resources
1,2,6
have been created to facilitate the Findability
(F) and Accessibility (A) of biomedical datasets. These principles
put a specic emphasis on enhancing the ability of both indivi-
duals and software to discover and re-use digital objects in an
automated fashion throughout their entire life cycle
5
. While data
resources typically assign an equal relevance to all datasets (e.g. as
results of a query), the usage patterns of the data can vary
enormously, similarly to other research products such as pub-
lications. How do we know which datasets are getting more
attention? More generally, how can we quantify the scientic
impact of datasets?
Recently, several authors
79
and resources
10
pointed out the
importance of evaluating the impact of each research product,
including datasets. Reporting scientic impact is indeed increas-
ingly relevant for individuals, but also reporting aggregated
information has become essential for research groups, scientic
consortia, institutions or for public data resources among others,
in order to assess the level of importance, excellence and rele-
vance of their work. This is a key piece of information for funding
agencies, which is used routinely to prioritise the projects and
resources they fund. However, most of the efforts nowadays focus
on the evaluation and quantication of the impact of publications
as the main artefact. For instance, in 2013, the altmetrics team
proposed a set of alternative metrics to trace research products
with special focus on publications
10
. Speci c tools and services
have been built since to aggregate altmetrics, including for
instance counts of mentions of a given publication in blog posts,
tweets and articles in mainstream media. The altmetrics attention
score is widely used by the research community nowadays (e.g. by
multiple scientic journals), as a measure of scienticinuence of
manuscripts. However, adequate tracking and recognition of
datasets has been limited so far for multiple reasons: (i) the
relatively low number of publications citing datasets instead of
their corresponding publications; (ii) the lack of services that
store and index datasets from heterogeneous origins; and (iii) the
absence of widely used metrics that enable the quantication of
their impact. Some attempts have been made to improve the
situation, by introducing data object identiers (DOIs) directly
associated to datasets
11
.
In 2016, we released the rst version of the Omics Discovery
Index (OmicsDIhttps://www.omicsdi.org) as a light-weight
system to aggregate datasets across multiple public omics data
resources. OmicsDI aggregates genomics, transcriptomics, pro-
teomics, metabolomics and multiomics datasets, as well as com-
putational models of biological processes
1
. The OmicsDI web
interface and Application Programming Interface (API) provide
different views and search capabilities on the indexed datasets.
Datasets can be searched and ltered based on different types of
technical and biological annotations (e.g. species, tissues, diseases,
etc.), year of publication and the original data repository where
they are stored, among others. At the time of writing (March
2019), OmicsDI stores just over 454,200 datasets from 16 dif-
ferent public data resources (https://www.omicsdi.org/database).
The split per omics technology is as follows: transcriptomics
(125,891 datasets), genomics (309,961), proteomics (12,362),
metabolomics (2411), multiomics (6578) and biological models
(8651). Here, we propose a set of novel metrics to quantify the
impact of biomedical datasets. A complete framework (now
integrated in OmicsDI) has been implemented in order to provide
and evaluate those metrics. Finally, we propose a set of recom-
mendations for authors, journals and data resources to promote
an optimal quantication of the impact of datasets.
Results
Omics data reanalysis and citations. By March 2019, the num-
ber of datasets with at least one reanalysis, one citation, one
download, one view and that contained connections in knowl-
edgebases was 12,162, 58,054, 66,418, 163,431 and 469,015,
respectively (Table 1). The reanalysis metric quanti es how many
times one dataset has been re-used (re-analysed) and the result
deposited in the same or in another resource. We classify rea-
nalyses in two different categories: (i) reanalyses performed by
independent groups (Independent Lab Reanalyses) or reanalyses
performed systematically by resources such as PeptideAtlas or
Expression Atlas (Resource Reanalyses). On average, each rea-
nalysed dataset is reanalysed 2.3 times. However, each omics type
has a different pattern: proteomics (5.90), transcriptomics (1.31),
multiomics (2.07), genomics (1.26) and models (30.08).
Frequently, dataset re-use is a hierarchical process, where one
dataset is reanalysed subsequently multiple times. Figure 1a
presents a reanalysis network for the model BIOMD0000000055,
starting from 2006 (release year) to 2015. A different pattern is
illustrated in Fig. 1b, where BIOMD0000000286 is derived from
multiple source models. BioModels curates and annotates for
each deposited model, the corresponding model from which it is
derived (if applicable). Figure 1c shows the reanalysis network of
the PRIDE dataset PXD000561 (https://www.omicsdi.org/dataset/
pride/PXD000561) (75); one of the drafts of the human
proteome. This dataset and the PXD000865 have supported
the annotation of millions of peptides and proteins evidences,
enabling the large-scale annotation of the human proteome
12
and
have been reanalysed by multiple databases including the
proteomics resources PeptideAtlas and GPMDB
13
.
Interestingly, the distribution of the elapsed time between the
year of publication of the original datasets and publication of the
reanalyses shows that most of the datasets are reanalysed within
the rst 5 years after publication (Fig. 2a). After 10 years of
publication, still datasets are often reused in public databases like
Expression Atlas. The proteomics community (PRIDE datasets)
in contrast to transcriptomics tends to reanalyse the data within 3
years of its publication. Typically, the number of reanalyses in
OmicsDI grows within the rst 5 years making this a metric
better suited to measure immediate impact.
The second metric is the number of direct citations in
publications for each dataset as previously suggested
14
. The
number of datasets with at least one citation in EuropePMC is
58,054 (Table 1). Figure 2b shows the distribution of dataset
direct citations by omics type. Transcriptomics datasets are the
most cited ones, followed by genomics and multiomics datasets.
Interestingly, the standard deviation indicates that in transcrip-
tomics some datasets get signicantly more attention from the
community than others (STD = 16), whereas for proteomics
datasets the citation rate is much more homogenous (STD = 1.7).
The current workow searches EuropePMC using all the
identiers associated with a given dataset (e.g. a given dataset
can be cited in a publication using the ArrayExpress, GEO or
BioProject identiers). For example, the dataset E-GEOD-2034
(https://www.omicsdi.org/dataset/arrayexpress-repository/E-
GEOD-2034) is cited 312 and 28 times, using the ArrayExpress
(E-GEOD-2034) and GEO (GSE2034) identiers, respectively.
ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-11461-w
2 NATURE COMMUNICATIONS | (2019) 10:3512 | https://doi.org/10.1038/s41467-019-11461-w | www.nature.com/naturecommunications

Biological entity connections. We analysed the number of bio-
logical entities reported on each omics dataset (e.g. UNIPORT
proteins) stored in other knowledge-bases (UniProt) (Table 1).
More than 53% of the datasets contains biological connections
that can be traced to knowledge-based resources, such as
Ensembl
15
, UniProt
16
or IntAct
17
. The number of connections
across different omics types can differ signicantly. For example,
dataset E-MTAB-599 (RNA-seq of mouse DBA/2JxC57BL/6J
heart, hippocampus, liver, lung, spleen and thymus), associated
with this publication
18
, has 1,710,979 connections, including
1,689,177 genome variants, 21,572 gene values and 230 other
connections, ranging from sample annotations to nucleotide
sequences. The second most connected dataset in the Metabo-
Lights database (MTBLS392https://www.omicsdi.org/dataset/
metabolights_dataset/MTBLS392), associated with this publica-
tion
19
, only contains 345 metabolites reported in the ChEBI
database
20
. To overcome these differences, we have implemented
a normalisation method that creates a connectivity score which
boosts how much a dataset contributes to a specic knowledge-
base and also boost datasets that are included in more knowledge-
bases (Supplementary Note 1).
We have studied the correlation between all the metrics for the
different omics types (Fig. 3). The number of reanalyses and
citations are highly correlated for proteomics datasets (R = 0.7)
but are not correlated for other omics elds, such as
transcriptomics, genomics and multiomics: 0.018, 0.02 and 0.12,
respectively. The highest global correlation (R = 0.5) is observed
for the combination of number of connections and downloads.
Generally, the ve metrics are not correlated for any of the omics
elds (Fig. 3) and can be seen as orthogonal variables to get a
broader representation of the impact of omics datasets.
Discussion
One of the obstacles to achieving a systematic deposition of
datasets in public repositories is the lack of a broad scientic
reward system, considering other research products in addition to
scientic publications
7
. Different studies have demonstrated the
need for metrics and frameworks to quantify the impact of
deposited datasets in the public domain. Such a system would not
only encourage authors to make their data public, but also would
help funding agencies, biological resources and the scientic
community as a whole to focus on the most impactful datasets. In
OmicsDI we have implemented a novel platform to quantify the
impact of public datasets systematically, by using data from
biological data resources (reanalyses), literature (citations),
knowledge bases (connections), views and downloads. Every
metric is updated on a weekly basis and made available through
the OmicsDI web interface and API.
One of the primary ndings is that in systems biology (the
BioModels database
21
is the representative resource), the
deposition of data has enabled systematic generation of new
knowledge (biological models) based on previous datasets. For
example, the model Genome-scale metabolic modelling of
hepatocytes reveals serine deciency in patients with non-
alcoholic fatty liver disease (MODEL1402200003)
22
has been
used to build more than 6000 models available in BioModels. We
noticed different complex graph patterns of reanalysis in the
BioModels database. For example, Fig. 1a shows the reanalysis
network of model BIOMD0000000055, where the original model
published in 2006 has been reused to build new models until
2015. BioModels can be built from multiple models and origi-
nated new models (Fig. 1b). BioModels database has dened
during the submission process a mechanism to annotate if the
model reuse parts of previously published models enabling
OmicsDI to build and trace reanalysis patterns. In contrast to
Table 1 The number of citations, reanalyses, downloads, views and connections (April 2019)
omics type Number
of
citations
Number
of cited
datasets
Number of
reanalyses
Number of
reanalysed
datasets
Number of
downloads
Number of
downloaded
datasets
Number
of views
Number
of
viewed
datasets
Number of
connections
Number of
datasets
with
connections
Genomics 8152 3389 1103 872 1,210,799 54,336 1,233,388 13,441 1,041,407,105 313,549
Metabolomics 827 117 –– 49,907 321 253,428 2726 340,483 1340
Models 3 3 7190 239 –– 435,859 7262 12,880,012 7200
Multiomics 9111 2053 5013 2422 179,669 2694 860,092 7848 16,453,633 7849
Proteomics 4624 1793 3344 567 153,548 5392 1,417,107 13,015 51,857,985 20,577
Transcriptomics 665,022 50,699 10,527 8062 208,383 3675 14,793,937 119,139 27,696,366 118,500
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-11461-w ARTICLE
NATURE COMMUNICATIONS | (2019) 10:3512 | https://doi.org/10.1038/s41467-019-11461-w | www.nature.com/naturecommunications 3

biological models, the proteomics (Fig. 1c) and transcriptomics
elds are still working to dene a proper mechanism to report the
multiple reanalyses of datasets in a hierarchical manner
11
. For
this reason, the reanalysis pattern detected in proteomics are one
to many networks where one dataset has been reanalysed by
multiple datasets (e.g. PXD000561).
Moreover, the results showed that the reanalysis metric is
crucial to highlight relevant datasets early after the dataset release
(Fig. 2c). Overall, 8000 datasets (>5% of OmicsDI content) have
been reanalysed by resources, such as PeptideAtlas, GPMDB or
Expression Atlas, among others. However, it should be noted that
the reanalysis metric measures only the impact of datasets in the
same or in other data resources contributing their metadata to
OmicsDI, which constitutes a fraction of the total re-use by the
scientic community.
To complement the reanalysis metric, we counted direct cita-
tions of datasets in scientic publications. Different studies have
estimated that the proportion of the total citation count con-
tributed by data depositions is around 620%
10,14
. Most of the
reanalyses tracked in OmicsDI have been performed using GEO
datasets, which might have biased the results to a specic
resource. However, our ndings show the same patterns in the
literature: almost 9000 datasets have been cited in publications at
least once. It is important to highlight that counting direct
database citations in the whole text of manuscripts is only pos-
sible for open access publications. In the case where the corre-
sponding publications are not open access, dataset identiers
would need to be included in the PubMed abstract to be included
in this metric. The coverage of direct citations in publications is
therefore limited by this systemic issue. We have found that the
transcriptomics community (individual researchers) tend to cite
the same datasets more often, with an average of four citations
per dataset. The most cited dataset is Transcription proling of
human breast cancer samplesrelapse free survival (E-GEOD-
2034), totalling 312 citations. Both metrics, reanalyses and cita-
tions, should be used in combination for a better understanding
of the dataset impact. Our results show that both metrics are
uncorrelated and should not be aggregated. For example, dataset
'BIOMD0000000097'
'BIOMD0000000089'
'BIOMD0000000055'
2006
2010
2015
BIOMD0000000055 Complex network
ab
c
'BIOMD0000000095'
'BIOMD0000000096'
'BIOMD0000000476'
'BIOMD0000000273'
'BIOMD0000000597'
'BIOMD0000000445'
'BIOMD0000000564'
'BIOMD0000000412'
'BIOMD0000000598'
'BIOMD0000000577'
'BIOMD0000000091'
'BIOMD0000000285'
'BIOMD0000000488'
'BIOMD0000000105'
'BIOMD0000000293'
'BIOMD0000000344'
'BIOMD0000000462'
'BIOMD0000000189'
'BIOMD0000000634'
'BIOMD0000000286'
'BIOMD0000000188'
'BIOMD0000000287'
'PAe004741'
'PXD000561'
'PAe004729'
'PAe004999'
'PAe004935'
'PAe004834'
'PAe004728'
'PAe004730'
'PAe004856'
'PAe005019'
'PAe005003'
'PAe004825'
'PAe004743'
'PAe005057'
'PAe005055'
'PAe004846'
'PAe005038'
'PAe005073'
'PAe004827'
'PAe004735'
'PAe004724'
'PAe004968'
'PAe004826'
'PAe005004'
'PAe004988'
'PAe004828'
'PAe004925'
'PAe005059'
'PAe004850'
'PAe004962'
'PAe004747'
'PAe005015'
'PAe004738'
'PAe004824'
'PAe004737'
'PAe005097'
'PAe005016'
'PAe005095'
'PAe005078'
'PAe005001'
'PAe004896'
'PAe004904'
'PAe005076'
'PAe004727'
'PAe005041'
'PAe004726'
'PAe004734'
'PAe004957'
'PAe004969'
'PAe004731'
'PAe004736'
'PAe005029'
'PAe004974'
'PAe004973'
'PAe005098'
'PAe004965'
'PAe004950'
'PAe004833'
'PAe004744'
'PAe004831'
'PAe004914'
'PAe004963'
'PAe004746'
'PAe004991'
'PAe004880'
'PAe004953'
'PAe005014'
'PAe005031'
'PAe004733'
'PAe005035'
'PAe004723'
'PAe005042'
'PAe004740'
'PAe004966'
'PAe005027'
'PAe005009'
'PAe004901'
'PAe004732'
'PAe004876'
Fig. 1 Examples of the reanalysis network for different OmicsDI datasets: a BioModels model BIOMD0000000055. BioModels are reused over time (e.g.
20062015) to build new models; in the BioModel database each new model contains references to the original source model of information. b Twelve
different BioModels models are connected through a reanalysis network. The BioModel database traces the origin of each model and the relations between
them, enabling to trace complex reanalysis relations where models can be originated from multiple models and be used by other models. c Proteomics
reanalysis network for the draft of the human proteome project (PRIDE accession PXD000561). In proteomics, the predominant reanalysis pattern is one
to many, where original deposited submissions are reanalysed in multiple datasets by multiple authors
ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-11461-w
4 NATURE COMMUNICATIONS | (2019) 10:3512 | https://doi.org/10.1038/s41467-019-11461-w | www.nature.com/naturecommunications

E-MTAB-513 is among the 10 most cited datasets in the litera-
ture: it has been cited 155 times and reanalysed 4 times. In
addition to the normalised values, we have decided to provide the
raw metrics to the community, which will enable to combine
them into more complex models
23
. However, we have shown that
these metrics can be used independently to generate models for
clustering and classication (Supplementary Note 2).
In 2011, Mons et al. introduced the idea of nano-publications,
from which the authors could get credit not only through the
actual publication but also through all the knowledge associated
with it
7
. In our view, the value of the dataset should not be only
associated with the raw data or the claims in the publication,
but also should be assessed considering all the biological entities
supported in knowledgebases. We have developed the connec-
tions metric, which can be used to estimate the impact of a
dataset for knowledgebases, by counting how many biological
entities are supported by it.
Importantly, OmicsDI is monitoring not only the web interface
views but also the interaction through the OmicsDI API. On
average, every dataset in OmicsDI has been accessed at least 30
times since 2016 (Table 1). By March 2019, we had captured the
number of direct downloads for six different databases at the
European Bioinformatics Institute. These two metrics (views and
downloads) are not publicly available in any of these resources
and at present are infeasible to retrieve. In fact, at present, the rst
coordinated efforts to gather them in a standard manner are
taking place in the context of the ELIXIR framework for Eur-
opean biological data resources
24
. With this rst implementation,
we are promoting that resources systematically release this
information to the public domain.
The newly implemented OmicsDI dataset claiming system
enables authors, research groups, scientic consortia and research
institutions to organise datasets under a unique OmicsDI prole,
and for datasets to be added to their own ORCID proles as well.
At the time of writing (March 2019), 968 datasets have been
claimed into ORCID proles through OmicsDI. In our view,
following the same system for monitoring the impact of indivi-
dual datasets, these metrics could also be used to measure at least
some aspects of the impact of public omics data resources
25,26
.A
common problem of impact evaluation is to compare different
elds or topics with the same metrics. Figure 4 shows the average
distribution of metrics (raw and normalised) for each omics type.
b
1500
30
800
600
400
200
0
0
510
15
20
20
10
0
30,000
20,000
10,000
0
1000
500
Number of datasets
0
900
600
300
0
0
5
10 15
20
0 5 10 15 20
Number of citations
Genomics Metabolomics
Transcriptomics
Multiomics
Proteomics
Genomics
Metabolomics
Multiomics
Proteomics
Transcriptomics
Omics type
a
ArrayExpress
600
400
Number of datasets
200
0
0 5 10 5
Years
10
15
15 0
510
15
Database
ArrayExpress
GEO
Pride
0
GEO Pride
Fig. 2 a Elapsed time between the original publication of a dataset and the publication of all its reanalyses for three omics data archives (PRIDE
Proteomics, GEOTranscriptomics, ArrayExpressTranscriptomics). Transcriptomics datasets tend to be reanalysed over time until datasets are 12 years
old, while proteomics datasets (PRIDE) are less reused after 3 years from their publication. b Distribution of the number of citations per dataset group by
OmicsDI omics type. Transcriptomics datasets are highly cited with more than 30,000 datasets with 11 citations; while in genomics, proteomics and
metabolomics most datasets are only cited once
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-019-11461-w ARTICLE
NATURE COMMUNICATIONS | (2019) 10:3512 | https://doi.org/10.1038/s41467-019-11461-w | www.nature.com/naturecommunications 5

Figures
Citations
More filters
Journal ArticleDOI

Integrated Omics: Tools, Advances, and Future Approaches

TL;DR: This work discusses recent approaches, existing tools, and potential caveats in the integration of omics datasets for development of standardized analytical pipelines that could be adopted by the global omics research community.
Journal ArticleDOI

The ProteomeXchange consortium at 10 years: 2023 update

TL;DR: The ProteomeXchange (PX) consortium of proteomics resources (http://www.proteomexchange.org) was originally set up to standardize data submission and dissemination of public MS proteomics data as discussed by the authors .
Journal ArticleDOI

Decoding communication patterns of the innate immune system by quantitative proteomics.

TL;DR: The diverse applications of mass spectrometry‐based proteomics in innate immunity to define communication patterns of the innate immune cells during health and disease are explored and the emerging role of proteomics is presented in immune‐based drug discovery.
Posted ContentDOI

Using open data to rapidly benchmark biomolecular simulations: Phospholipid conformational dynamics

TL;DR: A large set of open-access MD trajectories of phosphatidylcholine (PC) lipid bilayers are used to benchmark the conformational dynamics in several contemporary MD models (force fields) against nuclear magnetic resonance (NMR) data available in the literature: effective correlation times and spin-lattice relaxation rates.
Proceedings Article

Self-Supervision Enhanced Feature Selection with Correlated Gates

TL;DR: A novel deep learning approach to feature selection that addresses both challenges simultaneously and discovers relevant features that provide superior prediction performance compared to the state-of-the-art benchmarks in practical scenarios where there is often limited labeled data and high correlations among features.
References
More filters
Journal ArticleDOI

Gene Expression Omnibus: NCBI gene expression and hybridization array data repository

TL;DR: The Gene Expression Omnibus (GEO) project was initiated in response to the growing demand for a public repository for high-throughput gene expression data and provides a flexible and open design that facilitates submission, storage and retrieval of heterogeneous data sets from high-power gene expression and genomic hybridization experiments.
Journal ArticleDOI

The FAIR Guiding Principles for scientific data management and stewardship

TL;DR: The FAIR Data Principles as mentioned in this paper are a set of data reuse principles that focus on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals.
Journal ArticleDOI

UniProt: the Universal Protein knowledgebase

TL;DR: The Swiss-Prot, TrEMBL and PIR protein database activities have united to form the Universal Protein Knowledgebase (UniProt), which is to provide a comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and query interfaces.
Journal ArticleDOI

2016 update of the PRIDE database and its related tools

TL;DR: The developments in PRIDE resources and related tools are summarized and a brief update on the resources under development 'PRIDE Cluster' and 'PRide Proteomes', which provide a complementary view and quality-scored information of the peptide and protein identification data available inPRIDE Archive are given.
Frequently Asked Questions (11)
Q1. What are the contributions mentioned in the paper "Quantifying the impact of public omics data" ?

Here, the authors propose a set of novel metrics to quantify the attention and impact of biomedical datasets. Finally, the authors propose a set of recommendations for authors, journals and data resources to promote an optimal quantification of the impact of datasets. 

The MinMaxScaler is a robust method to shrink original values of a distribution to a range such that it becomes a value between 0 and 1. 

At the time of writing (March 2019), OmicsDI stores just over 454,200 datasets from 16 different public data resources (https://www.omicsdi.org/database). 

The newly implemented OmicsDI dataset claiming system enables authors, research groups, scientific consortia and research institutions to organise datasets under a unique OmicsDI profile, and for datasets to be added to their own ORCID profiles as well. 

More than 53% of the datasets contains biological connections that can be traced to knowledge-based resources, such as Ensembl15, UniProt16 or IntAct17. 

The correct tracking of datasets in a database by other data resources can help to assess its impact, since itdemonstrates that the data they store is actively re-used by (and thus it is relevant to) the community. 

Reporting scientific impact is indeed increasingly relevant for individuals, but also reporting aggregated information has become essential for research groups, scientific consortia, institutions or for public data resources among others, in order to assess the level of importance, excellence and relevance of their work. 

The appropriate and accurate reference to the original datasets in other resources facilitates the reproducibility and traceability of the results and the recognition for the authors that generated the original dataset32. 

the standard deviation indicates that in transcriptomics some datasets get significantly more attention from the community than others (STD= 16), whereas for proteomics datasets the citation rate is much more homogenous (STD= 1.7). 

The authors have formulated five metrics that can be used to estimate the impact of datasets (Fig. 5):1. Number of reanalyses (reanalyses): A reanalysis can be generally defined as the complete or partial re-use of an original dataset (A) using a different analysis protocol and stored either in the same or in another public data resource (B) (Fig. 5). 

Analogously to services such as Google Scholar and ResearchGate for publications, the authors have implemented a mechanism that enables researchers to create their own profile in OmicsDI, by claiming their own datasets.