scispace - formally typeset
Search or ask a question
Journal ArticleDOI

The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible.

TL;DR: In the latest version 10.5 of STRING, the biggest changes are concerned with data dissemination: the web frontend has been completely redesigned to reduce dependency on outdated browser technologies, and the database can now also be queried from inside the popular Cytoscape software framework.
Abstract: A system-wide understanding of cellular function requires knowledge of all functional interactions between the expressed proteins. The STRING database aims to collect and integrate this information, by consolidating known and predicted protein-protein association data for a large number of organisms. The associations in STRING include direct (physical) interactions, as well as indirect (functional) interactions, as long as both are specific and biologically meaningful. Apart from collecting and reassessing available experimental data on protein-protein interactions, and importing known pathways and protein complexes from curated databases, interaction predictions are derived from the following sources: (i) systematic co-expression analysis, (ii) detection of shared selective signals across genomes, (iii) automated text-mining of the scientific literature and (iv) computational transfer of interaction knowledge between organisms based on gene orthology. In the latest version 10.5 of STRING, the biggest changes are concerned with data dissemination: the web frontend has been completely redesigned to reduce dependency on outdated browser technologies, and the database can now also be queried from inside the popular Cytoscape software framework. Further improvements include automated background analysis of user inputs for functional enrichments, and streamlined download options. The STRING resource is available online, at http://string-db.org/.

Content maybe subject to copyright    Report

D362–D368 Nucleic Acids Research, 2017, Vol. 45, Database issue Published online 18 October 2016
doi: 10.1093/nar/gkw937
The STRING database in 2017: quality-controlled
protein–protein association networks, made broadly
accessible
Damian Szklarczyk
1
, John H Morris
2
, Helen Cook
3
, Michael Kuhn
4
, Stefan Wyder
1
,
Milan Simonovic
1
, Alber to Santos
3
, Nadezhda T Doncheva
3
, Alexander Roth
1
,
Peer Bork
4,5,6,7,*
, Lars J. Jensen
3,*
and Christian von Mering
1,*
1
Institute of Molecular Life Sciences and Swiss Institute of Bioinformatics, University of Zurich, 8057 Zurich,
Switzerland,
2
Resource on Biocomputing, Visualization, and Informatics, University of California, San Francisco, CA
94158-2517, USA,
3
Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, 2200
Copenhagen N, Denmark,
4
Structural and Computational Biology Unit, European Molecular Biology Laboratory,
69117 Heidelberg, Germany,
5
Molecular Medicine Partnership Unit, University of Heidelberg and European
Molecular Biology Laboratory, 69117 Heidelberg, Germany,
6
Max Delbr
¨
uck Centre for Molecular Medicine, 13125
Berlin, Germany and
7
Department of Bioinformatics, Biocenter, University of W
¨
urzburg, 97074 W
¨
urzburg, Germany
Received September 15, 2016; Editorial Decision October 04, 2016; Accepted October 06, 2016
ABSTRACT
A system-wide understanding of cellular f unction re-
quires knowledge of all functional interactions be-
tween the expressed proteins. The STRING database
aims to collect and integrate this information, by
consolidating known and predicted protein–protein
association data for a large number of organisms.
The associations in STRING include direct (physical)
interactions, as well as indirect (functional) interac-
tions, as long as both are specific and biologically
meaningful. Apart from collecting and reassessing
available experimental data on protein–protein inter-
actions, and importing known pathways and protein
complexes from curated databases, interaction pre-
dictions are derived from the following sources: (i)
systematic co-expression analysis, (ii) detection of
shared selective signals across genomes, (iii) auto-
mated text-mining of the scientific literature and (iv)
computational transfer of interaction knowledge be-
tween organisms based on gene orthology. In the
latest version 10.5 of STRING, the biggest changes
are concerned with data dissemination: the web fron-
tend has been completely redesigned to reduce de-
pendency on outdated browser technologies, and
the database can now also be queried from in-
side the popular Cytoscape software framework. Fur-
ther improvements include automated background
analysis of user inputs for functional enrichments,
and streamlined download options. The STRING re-
source is available online, at
http://string-db.org/.
INTRODUCTION
The ow of information and energy through the cell pro-
ceeds along specic and evolved interfaces: across and be-
tween nucleotides, proteins, lipids, metabolites and other
small molecules. Among these interfaces, those between
proteins are arguably among the most important, being
biochemically diverse and information-rich, and showing
exquisite specicity (
1–3). Apart from direct physical bind-
ing, proteins also have many other, indirect ways of co-
operation and mutual regulation: they can inuence each
other’s production and half-life transcriptionally and post-
transcriptionally, exchange reaction products, participate in
signal relay mechanisms, or jointly contribute toward spe-
cic organismal functions. Together, these direct and indi-
rect interactions constitute ‘functional association’, a use-
ful operational umbrella-term for specic and functionally
productive interactions of any type (
4–9).
Assembling all known and predicted protein functional
associations for a given organism results in a protein net-
work of genome-wide functional connectivity. These net-
works represent a crucial, intermediate level of information
aggregation: they are placed between pathway databases at
one extreme (which provide mechanistic detail but often
have low coverage), and high-throughput experimental in-
teraction discovery and ad hoc predictions at the other ex-
*
To whom correspondence should be addressed. Tel: +41 44 6353147; Fax: +41 44 6356864; Email: mering@imls.uzh.ch
Correspondence may also be addressed to Peer Bork. Tel: +49 6221 3878526; Fax: +49 6221 387517; Email: bork@embl.de
Correspondence may also be addressed to Lars J. Jensen. Tel: +45 353 25025; Fax: +45 353 25001; Email: lars.juhl.jensen@cpr.ku.dk
C
The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which
permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact
journals.permissions@oup.com
Downloaded from https://academic.oup.com/nar/article/45/D1/D362/2290901 by guest on 21 August 2022

Nucleic Acids Research, 2017, Vol. 45, Database issue D363
treme (which have high coverage but usually also high lev-
els of false positives). As such, protein networks are ideally
suited to serve as scaffolds or lters for further data integra-
tion, for visualization and for molecular discovery. They are
essential for modern life sciences: protein networks are used
to increase discovery power for noisy data sets by ‘network
smoothing’ (10,11), help dene drug efciency by network-
based ‘drug-disease proximity measures’ (
12), help to inter-
pret the results of genome-wide association screens (
13–17)
and enable the discovery of new molecular players through
the ‘guilt by association concept (
18,19).
A number of databases and online resources are dedi-
cated to protein networks, at various levels of abstraction
and each with a somewhat different focus/scope. First, indi-
vidual well-supported protein–protein interactions are cu-
rated manually from the published literature, through ded-
icated efforts by members of the IMEx consortium (20,21),
but also as part of more general annotation workows such
as within the UniProt consortium (
22). Second, a number of
databases assemble larger, genome-wide protein networks
that are nevertheless still restricted to experimentally ob-
served interactions only; examples include BioGRID (
23),
HINT (
24), iRefWeb (25) and APID (26). Lastly, resources
such as STRING include indirect and predicted interac-
tions on top, aiming for inclusiveness in scope and for max-
imal coverage. Apart from STRING, this latter group in-
cludes GeneMANIA (
27), Integrated Multi-species Predic-
tion (
28), Integrated Interactions Database (29), Human-
Net (
17), FunCoup (30) and others. For this group of data
resources, it is particularly important to provide interaction
weights ( such as quality scores or condence estimates), to
allow the users to prune down these inclusive networks, as
needed.
Within the spectrum of the above resources, STRING
aims to set itself apart in three ways: (i) comprehensive-
ness it covers the largest number of organisms and uses
the widest breadth of input sources, including automated
text-mining and computational predictions, (ii) usability
in terms of an intuitive web interface, Cytoscape integration
and programmatic access options, and (iii) quality control
and traceability each interaction is annotated with bench-
marked condence scores, separately per evidence type,
and the underlying evidence can be tracked to its source.
STRING has been maintained continuously since the year
2000, and has already been described in several publications
(
31–34). Below, we provide a brief overview of the main fea-
tures, and describe recent technical developments.
DATABASE CONTENT
For each protein–protein association stored in STRING, a
score is provided. These scores (i.e., the ‘edge weights’ in
each network) represent condence scores, and are scaled
between zero and one. They indicate the estimated likeli-
hood that a given interaction is biologically meaningful,
specic and reproducible, given the supporting evidence.
For each interaction, the supporting evidence is divided into
one or more ‘evidence channels’, depending on the origin
and type of the evidence. There are seven channels, and
they are assembled, scored and benchmarked separately. In
the network visualization on the web frontend, the evidence
channels are usually delineated by edges of different color,
and each of the channels can be disabled individually by the
user, in case some types of evidence might not be consid-
ered suitable for a particular question that is being studied.
Based on the seven channels, a combined and nal con-
dence score is computed for each interaction, and it is this
‘combined score’ that is typically used as the nal measure
when building networks or when sorting and ltering inter-
actions. For a given interaction, it is generally a good sign of
support when not only the combined score is high, but when
there is also more than one evidence channel contributing
to the score. Furthermore, it is important to note that the
interactions in STRING have gene-locus resolution only:
we do not discriminate between different splice isoforms or
post-translationally modied forms. Hence, the interacting
units in STRING are actually the protein-coding gene loci
(represented by their main, canonical protein isoform).
Briey, the seven evidence channels in STRING are (i)
The experiments channel: Here, evidence comes from actual
experiments in the lab (including biochemical, biophysical,
as well as genetic experiments). This channel is populated
mainly from the primary interaction databases organized
in the IMEx consortium, plus BioGRID. (ii) The database
channel: In this channel, STRING collects evidence that
has been asserted by a human expert curator; this informa-
tion is imported from pathway databases. (iii) The textmin-
ing channel: Here, STRING searches for mentions of pro-
tein names in all PubMed abstracts, in an in-house collec-
tion of more than three million fulltext articles, and in other
text collections (
35,36). Pairs of proteins are given an associ-
ation score when they are frequently mentioned together in
the same paper, abstract or even sentence (relative to how
often they are mentioned separately). This score is raised
further when it has been possible to parse one or more sen-
tences through Natural Language Processing, and a con-
cept connecting the two proteins was encountered (such as
‘binding’ or ‘phosphorylation by’). (iv) The coexpression
channel: For this channel, gene expression data originat-
ing from a variety of expression experiments are normal-
ized, pruned and then correlated (34). Pairs of proteins that
are consistently similar in their expression patterns, under a
variety of conditions, will receive a high association score.
In addition to large-scale microarray data, in version 10.5
of STRING, RNAseq expression data are now also pro-
cessed; this results in the inclusion of 16 previously non-
covered organisms into this channel. (v) The neighborhood
channel: This channel, and the next two, are genome-based
prediction channels, whose functionality is generally most
relevant for Bacteria and Archaea. In the neighborhood
channel, genes are given an association score where they are
consistently observed in each other’s genome neighborhood
(such as in the case of conserved, co-transcribed ‘operons’).
(vi) The fusion channel: Pairs of proteins are given an asso-
ciation score when there is at least one organism where their
respective orthologs have fused into a single, protein-coding
gene. Finally, (vii) The co-occurrence channel: In this chan-
nel, STRING evaluates the phylogenetic distribution of or-
thologs of all proteins in a given organism. If two proteins
show a high similarity in this distribution, i.e. if their or-
thologs tend to be observed as ‘present’ or ‘absent’ in the
same subsets of organisms, then an association score is as-
Downloaded from https://academic.oup.com/nar/article/45/D1/D362/2290901 by guest on 21 August 2022

D364 Nucleic Acids Research, 2017, Vol. 45, Database issue
Figure 1. Network and Enrichment Analysis. Combined screenshots from the STRING website, showing results obtained upon entering a set of 31 proteins
suspected to be involved in Amyotrophic Lateral Sclerosis (
55). The insets are showing (from top to bottom): the accessory information available for a
single protein, a reported enrichment of functional connections among the set of proteins, and statistical enrichments detected in functional subsystems.
In the bottom inset, one enriched function has been selected, and the corresponding protein nodes in the network are automatically highlighted in color.
signed. For this channel, the details of the STRING imple-
mentation have recently been described, separately (
37).
Apart from direct evidence collected in the seven evi-
dence channels, another important contribution of interac-
tions in STRING comes f rom the transfer of evidence from
one organism to another. This so-called ‘interolog’ trans-
fer (
38,39) is based on the observation that orthologs of in-
teracting proteins in one organism are often also interact-
ing in another organism this inference is the more con-
dent the better the orthology relationships can be estab-
lished. STRING relies on hierarchical orthology relations
imported from the eggNOG database (
40), and conducts an
all-against-all transfer of interactions, benchmarked sepa-
rately for each evidence channel. Transfers between closely
related organisms are made more condently, whereas the
existence of paralogs (i.e., implied gene duplications) will
lower the transfer score. Overall, the biggest benet of the
transfers can be seen for poorly studied organisms, where
the fraction of interactions supported by transfers only can
be as high as 99%. In contrast, in well-studied model organ-
isms such as Escherichia coli, the corresponding fraction is
below 20%.
Downloaded from https://academic.oup.com/nar/article/45/D1/D362/2290901 by guest on 21 August 2022

Nucleic Acids Research, 2017, Vol. 45, Database issue D365
USER INTERFACE
The protein networks stored in STRING can be accessed
in a number of ways. Programmatic access is provided via a
REST-API (
41), via an R/Bioconductor package (34)and
via a mechanism to add additional user-provided interac-
tions, as well as protein-centric information, onto the web-
site (‘data payload’) (
32). Studies that require genome-wide
networks can refer to the STRING download pages, where
the complete interaction scores, as well as accessory infor-
mation, are available (the downloads are free for academics;
commercial users need a license for some of the les). As
of version 10.5, the downloads can now be pruned down,
prior to receiving the les, by organism (or by groups of or-
ganisms), which greatly facilitates subsequent data process-
ing. The most important interface to STRING, however,
remains the web frontend (Figure
1). In 2016, it has been
completely redesigned from the ground up; this was done in
order to remove dependencies on deprecated web technolo-
gies such as Adobe Flash. The new website allows easier and
more intuitive browsing of the networks and the underlying
evidence, and it is tightly integrated with the database back-
end to provide speedy responses. Users can make search re-
sults and gene sets persistent by logging in, and stable URLs
are provided on each page to facilitate sharing of results.
Importantly, users are now––by default––provided with
statistical analysis results for each network. The analysis is
done server-side, in the background, so as not to slow down
the user experience, and it produces alerts when a network
is enriched in certain known functions, or has more inter-
actions (edges) than expected. This is particularly meaning-
ful when users arrive to the website with a set of proteins
instead of just a single query protein, as it provides a func-
tional characterization of the set (this feature is increasingly
used by STRING users). The enrichment tests are done for
a variety of classication systems (Gene Ontology, KEGG,
Pfam and InterPro), and employ a Fisher’s exact test fol-
lowed by a correction for multiple testing (
42,43).
CYTOSCAPE APP INTEGRATION
The web interface of STRING is designed primarily
for users interested in small- to medium-scale networks,
whereas the API, R package and download les are
mainly intended for bioinformaticians who want to inte-
grate STRING with other resources or perform large-scale
network analyses. To bridge the gap between the two, we
have developed a so-called App for the Cytoscape software
framework (
44,45), which allows users to easily retrieve, vi-
sualize and analyze networks of hundreds to thousands of
proteins via a GUI.
The App allows users to query STRING in three dif-
ferent ways from within Cytoscape: by protein names, by
disease or by PubMed query. The rst of these mirrors
the ‘Multiple proteins’ query in the STRING web inter-
face and allows users to retrieve a network for a list of up
to 2000 protein names or identiers from, for example, a
proteomics or transcriptomics study. The second option is
to retrieve a network for a disease of interest; it rst re-
trieves a list of the top-N human proteins associated with
the disease from the DISEASES database (
46) and subse-
quently loads the STRING network for these proteins into
Cytoscape. The third option, PubMed query, allows users to
retrieve a STRING network pertaining to any topic of in-
terest based on t ext mining of PubMed abstracts. The app
fetches the abstracts for a user-specied query via NCBI E-
utilities, counts how many of these mention each protein
from the organism of interest, ranks the proteins by com-
paring these counts to precomputed background counts
over entire PubMed and retrieves a STRING network for
the top-N proteins. The underlying text mining is performed
by the software also used for the text-mining channel in
STRING.
When a network is retrieved by the App, it comes associ-
ated with a large number of node attributes for each protein
and edge attributes for each interaction, which can subse-
quently be used within Cytoscape. These include STRING
and UniProt accessions to facilitate cross-linking with other
resources, a human-readable name for display purposes and
the protein sequence. If a protein was retrieved through
a protein name query, we store also the exact query term
with which the protein was found. This is helpful when
querying for proteins identied in a proteomics or transcrip-
tomics study, since it facilitates subsequent import of tabu-
lar data from the study (Figure
2). If available for the organ-
ism in question, the App also fetches information on the
subcellular localization and tissue expression of each pro-
tein from the COMPARTMENTS (
47) and TISSUES (48)
databases as well as drug target information from Pharos
(
http://pharos.nih.gov/). For each interaction, the edge at-
tributes include the overall condence score and the sub-
scores from each individual evidence channel.
Cytoscape and its hundreds of apps provide numer-
ous ways for users to interact with, visualize and analyze
STRING networks (
49), including integrating additional
data from public repositories or their own experiments,
changing visual styles and applying algorithms for network
layout, clustering (
50), enrichment analysis (51,52) and net-
work analysis (
53). In addition to these, the STRING App
allows users to modify an already retrieved network in three
different ways. First, the condence cutoff for the imported
evidence channels can be increased or decreased, which in
the latter case involves fetching additional interactions from
STRING. Second, users can expand the network by a user-
specied number of interactors that are most closely asso-
ciated with all network nodes or a selected subset of them.
Third, any number of additional nodes can be queried by
name and added to the existing network. Furthermore, the
App provides a results panel with links to related databases
such as UniProt (
22), GeneCards (54), Pharos, COMPART-
MENTS, TISSUES and DISEASES.
OUTLOOK
The availability of completely sequenced genomes, and of
protein–protein interaction data, continues to grow quickly.
Hence, the data importing and processing for STRING will
be further streamlined in order to accommodate this. The
upcoming version 11 of STRING will cover more than 4000
organisms, and will contain pre-computed protein networks
for all of them. We are also developing a separate and dis-
tinctive interface specically for the investigation of virus-
Downloaded from https://academic.oup.com/nar/article/45/D1/D362/2290901 by guest on 21 August 2022

D366 Nucleic Acids Research, 2017, Vol. 45, Database issue
Figure 2. STRING network visualization within Cytoscape. Using the Cytoscape STRING app, a network was retrieved for 78 proteins interacting with
TrkA (tropomyosin-related kinase A) 10 min after stimulating neuroblastoma cells with NGF (nerve growth factor) (
56). With a condence cutoff of 0.4,
the resulting network contains 182 functional associations between 57 of the proteins; the 21 proteins with no associations to other proteins in the network
were removed. Nodes are colored according to the protein abundance (log ratio) compared to the cells before NGF treatment. The condence score of
each interaction is mapped to the edge thickness and opacity.
host protein–protein interactions, which will incorporate
many of the evidence channels present in STRING. This
specialized database will enable querying for a whole virus
or for specic viral proteins and will superimpose the viral
interaction network onto that of the host.
Furthermore, we plan to extend the analysis options for
user-provided gene set input, addressing a frequently ex-
pressed user need. This will include the possibility to report
statistical enrichments for ranked genes lists, even genome-
wide rankings. Together with the up-to-date network infor-
mation, this will allow users to extract the maximum func-
tional information from their queries, for any organism of
interest.
ACKNOWLEDGEMENTS
The authors are indebted to Yan P. Yuan (EMBL Heidel-
berg) for IT support, and to Dr. Thomas Rattei (Univer-
sity of Vienna) for producing and sharing systematic, all-
against-all protein-protein similarity data.
FUNDING
Core funding for STRING comes from the Swiss In-
stitute of Bioinformatics (Lausanne), the Novo Nordisk
Foundation (Copenhagen, NNF14CC0001), and the Eu-
ropean Molecular Biology Laboratory (EMBL Heidel-
berg). J.H.M. has been funded by NIHGMS grant P41-
GM103311. Funding for Open Access charges: University
of Zurich.
Conict of interest statement. None declared.
REFERENCES
1. Aloy,P. and Russell,R.B. (2004) Ten thousand interactions for the
molecular biologist. Nat. Biotechnol., 22, 1317–1321.
2. Gao,M. and Skolnick,J. (2010) Structural space of protein-protein
interfaces is degenerate, close to complete, and highly connected.
Proc. Natl. Acad. Sci. U.S.A., 107, 22517–22522.
3. Garma,L., Mukherjee,S., Mitra,P. and Zhang,Y. (2012)How many
protein-protein interactions types exist in nature? PLoS One, 7,
e38913.
4. Enright,A.J. and Ouzounis,C.A. (2001)Functional associations of
proteins in entire genomes by means of exhaustive detection of gene
fusions. Genome Biol., 2, RESEARCH0034.
5. Snel,B., Bork,P. and Huynen,M.A. (2002)The identication of
functional modules from the genomic association of genes. Proc.
Natl. Acad. Sci. U.S.A., 99, 5890–5895.
6. Rives,A.W. and Galitski,T. (2003)Modular organization of cellular
networks. Proc. Natl. Acad. Sci. U.S.A., 100, 1128–1133.
7. De Las Rivas,J and de Luis,A (2004)Interactome data and databases:
different types of protein interaction. Comp. Funct. Genomics, 5,
173–178.
8. Dannenfelser,R., Clark,N.R. and Ma’ayan,A. (2012) Genes2FANs:
connecting genes through functional association networks. BMC
Bioinformatics, 13, 156–168.
9. Studham,M.E., Tj
¨
arnberg,A., Nordling,T.E., Nelander,S. and
Sonnhammer,E.L. (2014) Functional association networks as priors
for gene regulatory network inference. Bioinformatics, 30, i130–i138.
10. Cun,Y. and Frohlich,H. (2013) Network and data integration for
biomarker signature discovery via network smoothed T-statistics.
PLoS One, 8, e73074.
11. Hofree,M., Shen,J.P., Carter,H., Gross,A. and Ideker,T. (2013)
Network-based stratication of tumor mutations. Nat. Methods, 10,
1108–1115.
12. Guney,E., Menche,J., Vidal,M. and Bar
´
abasi,A.L. (2016)
Network-based in silico drug efcacy screening. Nat. Commun., 7,
10331–10343.
13. Hillenmeyer,S., Davis,L.K., Gamazon,E.R., Cook,E.H., Cox,N.J.
and Altman,R.B. (2016) STAMS: STRING-Assisted Module Search
Downloaded from https://academic.oup.com/nar/article/45/D1/D362/2290901 by guest on 21 August 2022

Citations
More filters
Journal ArticleDOI
TL;DR: The latest version of STRING more than doubles the number of organisms it covers, and offers an option to upload entire, genome-wide datasets as input, allowing users to visualize subsets as interaction networks and to perform gene-set enrichment analysis on the entire input.
Abstract: Proteins and their functional interactions form the backbone of the cellular machinery. Their connectivity network needs to be considered for the full understanding of biological phenomena, but the available information on protein-protein associations is incomplete and exhibits varying levels of annotation granularity and reliability. The STRING database aims to collect, score and integrate all publicly available sources of protein-protein interaction information, and to complement these with computational predictions. Its goal is to achieve a comprehensive and objective global network, including direct (physical) as well as indirect (functional) interactions. The latest version of STRING (11.0) more than doubles the number of organisms it covers, to 5090. The most important new feature is an option to upload entire, genome-wide datasets as input, allowing users to visualize subsets as interaction networks and to perform gene-set enrichment analysis on the entire input. For the enrichment analysis, STRING implements well-known classification systems such as Gene Ontology and KEGG, but also offers additional, new classification systems based on high-throughput text-mining as well as on a hierarchical clustering of the association network itself. The STRING resource is available online at https://string-db.org/.

10,584 citations

Journal ArticleDOI
TL;DR: This article highlights some specific advances in the areas of visualization and usability, performance, and extensibility in ChimeraX.
Abstract: UCSF ChimeraX is next-generation software for the visualization and analysis of molecular structures, density maps, 3D microscopy, and associated data. It addresses challenges in the size, scope, and disparate types of data attendant with cutting-edge experimental methods, while providing advanced options for high-quality rendering (interactive ambient occlusion, reliable molecular surface calculations, etc.) and professional approaches to software design and distribution. This article highlights some specific advances in the areas of visualization and usability, performance, and extensibility. ChimeraX is free for noncommercial use and is available from http://www.rbvi.ucsf.edu/chimerax/ for Windows, Mac, and Linux.

2,866 citations

Journal ArticleDOI
TL;DR: eggNOG as discussed by the authors is a public database of orthology relationships, gene evolutionary histories and functional annotations, with a major update of the underlying genome sets, which have been expanded to 4445 representative bacteria and 168 archaea derived from 25 038 genomes.
Abstract: eggNOG is a public database of orthology relationships, gene evolutionary histories and functional annotations. Here, we present version 5.0, featuring a major update of the underlying genome sets, which have been expanded to 4445 representative bacteria and 168 archaea derived from 25 038 genomes, as well as 477 eukaryotic organisms and 2502 viral proteomes that were selected for diversity and filtered by genome quality. In total, 4.4M orthologous groups (OGs) distributed across 379 taxonomic levels were computed together with their associated sequence alignments, phylogenies, HMM models and functional descriptors. Precomputed evolutionary analysis provides fine-grained resolution of duplication/speciation events within each OG. Our benchmarks show that, despite doubling the amount of genomes, the quality of orthology assignments and functional annotations (80% coverage) has persisted without significant changes across this update. Finally, we improved eggNOG online services for fast functional annotation and orthology prediction of custom genomics or metagenomics datasets. All precomputed data are publicly available for downloading or via API queries at http://eggnog.embl.de.

1,971 citations


Cites background or methods from "The STRING database in 2017: qualit..."

  • ...STRING (28)), in which information needs to be propagated across the hierarchy of taxonomic levels....

    [...]

  • ...This high recall pattern is in general preferred by probabilistic prediction methods such as interolog inference in the STRING database (28)....

    [...]

  • ...Solving those cases is particularly important for third-party applications (e.g. STRING (28)), in which information needs to be propagated across the hierarchy of taxonomic levels....

    [...]

Journal ArticleDOI
TL;DR: In its 20th year, the SMART analysis results pages have been streamlined again and its information sources have been updated, and the internal full text search engine has been redesigned and updated, resulting in greatly increased search speed.
Abstract: SMART (Simple Modular Architecture Research Tool) is a web resource (http://smart.embl.de) for the identification and annotation of protein domains and the analysis of protein domain architectures. SMART version 8 contains manually curated models for more than 1300 protein domains, with approximately 100 new models added since our last update article (1). The underlying protein databases were synchronized with UniProt (2), Ensembl (3) and STRING (4), doubling the total number of annotated domains and other protein features to more than 200 million. In its 20th year, the SMART analysis results pages have been streamlined again and its information sources have been updated. SMART's vector based display engine has been extended to all protein schematics in SMART and rewritten to use the latest web technologies. The internal full text search engine has been redesigned and updated, resulting in greatly increased search speed.

1,351 citations


Cites background or methods from "The STRING database in 2017: qualit..."

  • ...With the update of the underlying protein databases, we have also synchronized our protein interaction data with the version 10.5 of the STRING database (4)....

    [...]

  • ...5 (4), it currently contains approximately 9....

    [...]

  • ...with UniProt (2), Ensembl (3) and STRING (4), dou-...

    [...]

  • ...Synchronized with the current STRING version 10.5 (4), it currently contains approximately 9.6 million proteins from 2031 complete genomes (238 Eukaryota, 1678 Bacteria and 115 Archaea)....

    [...]

  • ...The underlying protein databases were synchronized with UniProt (2), Ensembl (3) and STRING (4), doubling the total number of annotated domains and other protein features to more than 200 million....

    [...]

Journal ArticleDOI
TL;DR: A new dedicated aspect of BioGRID annotates genome-wide CRISPR/Cas9-based screens that report gene–phenotype and gene–gene relationships, and captures chemical interaction data, including chemical–protein interactions for human drug targets drawn from the DrugBank database and manually curated bioactive compounds reported in the literature.
Abstract: The Biological General Repository for Interaction Datasets (BioGRID: https://thebiogrid.org) is an open access database dedicated to the curation and archival storage of protein, genetic and chemical interactions for all major model organism species and humans. As of September 2018 (build 3.4.164), BioGRID contains records for 1 598 688 biological interactions manually annotated from 55 809 publications for 71 species, as classified by an updated set of controlled vocabularies for experimental detection methods. BioGRID also houses records for >700 000 post-translational modification sites. BioGRID now captures chemical interaction data, including chemical-protein interactions for human drug targets drawn from the DrugBank database and manually curated bioactive compounds reported in the literature. A new dedicated aspect of BioGRID annotates genome-wide CRISPR/Cas9-based screens that report gene-phenotype and gene-gene relationships. An extension of the BioGRID resource called the Open Repository for CRISPR Screens (ORCS) database (https://orcs.thebiogrid.org) currently contains over 500 genome-wide screens carried out in human or mouse cell lines. All data in BioGRID is made freely available without restriction, is directly downloadable in standard formats and can be readily incorporated into existing applications via our web service platforms. BioGRID data are also freely distributed through partner model organism databases and meta-databases.

1,046 citations


Cites background from "The STRING database in 2017: qualit..."

  • ...Other major meta-database resources that disseminate BioGRID data include STRING (32), Pathway Commons (31), Gene Mania (89), InnateDB (90) and FlyAtlas (91) (see https://wiki.thebiogrid.org/doku.php/partners for full list)....

    [...]

  • ...Other major meta-database resources that disseminate BioGRID data include STRING (32), Pathway Commons (31), Gene Mania (89), InnateDB (90) and FlyAtlas (91) (see https://wiki....

    [...]

  • ...These statistics do not include the widespread dissemination of BioGRID records by various partner databases, which include the MODs SGD (19), PomBase (25), Candida Genome Database (CGD) (26), WormBase (20), FlyBase (27), the Arabidopsis Information Resource (TAIR) (28), ZFIN (29) and Mouse Genome Database (MGD) (30) and the meta-database resources NCBI (21), UniProt (22), Pathway Commons (31), STRING (32) and others....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: In this paper, a different approach to problems of multiple significance testing is presented, which calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate, which is equivalent to the FWER when all hypotheses are true but is smaller otherwise.
Abstract: SUMMARY The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses -the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferronitype procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.

83,420 citations


"The STRING database in 2017: qualit..." refers methods in this paper

  • ...The enrichment tests are done for a variety of classification systems (Gene Ontology, KEGG, Pfam and InterPro), and employ a Fisher’s exact test followed by a correction for multiple testing (42,43)....

    [...]

Journal ArticleDOI
TL;DR: Several case studies of Cytoscape plug-ins are surveyed, including a search for interaction pathways correlating with changes in gene expression, a study of protein complexes involved in cellular recovery to DNA damage, inference of a combined physical/functional interaction network for Halobacterium, and an interface to detailed stochastic/kinetic gene regulatory models.
Abstract: Cytoscape is an open source software project for integrating biomolecular interaction networks with high-throughput expression data and other molecular states into a unified conceptual framework. Although applicable to any system of molecular components and interactions, Cytoscape is most powerful when used in conjunction with large databases of protein-protein, protein-DNA, and genetic interactions that are increasingly available for humans and model organisms. Cytoscape's software Core provides basic functionality to layout and query the network; to visually integrate the network with expression profiles, phenotypes, and other molecular states; and to link the network to databases of functional annotations. The Core is extensible through a straightforward plug-in architecture, allowing rapid development of additional computational analyses and features. Several case studies of Cytoscape plug-ins are surveyed, including a search for interaction pathways correlating with changes in gene expression, a study of protein complexes involved in cellular recovery to DNA damage, inference of a combined physical/functional interaction network for Halobacterium, and an interface to detailed stochastic/kinetic gene regulatory models.

32,980 citations


"The STRING database in 2017: qualit..." refers methods in this paper

  • ...To bridge the gap between the two, we have developed a so-called App for the Cytoscape software framework (44,45), which allows users to easily retrieve, visualize and analyze networks of hundreds to thousands of proteins via a GUI....

    [...]

Journal ArticleDOI
TL;DR: H hierarchical and self-consistent orthology annotations are introduced for all interacting proteins, grouping the proteins into families at various levels of phylogenetic resolution in the STRING database.
Abstract: The many functional partnerships and interactions that occur between proteins are at the core of cellular processing and their systematic characterization helps to provide context in molecular systems biology. However, known and predicted interactions are scattered over multiple resources, and the available data exhibit notable differences in terms of quality and completeness. The STRING database (http://string-db.org) aims to provide a critical assessment and integration of protein-protein interactions, including direct (physical) as well as indirect (functional) associations. The new version 10.0 of STRING covers more than 2000 organisms, which has necessitated novel, scalable algorithms for transferring interaction information between organisms. For this purpose, we have introduced hierarchical and self-consistent orthology annotations for all interacting proteins, grouping the proteins into families at various levels of phylogenetic resolution. Further improvements in version 10.0 include a completely redesigned prediction pipeline for inferring protein-protein associations from co-expression data, an API interface for the R computing environment and improved statistical analysis for enrichment tests in user-provided networks.

8,224 citations


"The STRING database in 2017: qualit..." refers background or methods in this paper

  • ...(iv) The coexpression channel: For this channel, gene expression data originating from a variety of expression experiments are normalized, pruned and then correlated (34)....

    [...]

  • ...Programmatic access is provided via a REST-API (41), via an R/Bioconductor package (34) and via a mechanism to add additional user-provided interactions, as well as protein-centric information, onto the website (‘data payload’) (32)....

    [...]

Journal ArticleDOI
TL;DR: ClueGO is an easy to use Cytoscape plug-in that strongly improves biological interpretation of large lists of genes and creates a functionally organized GO/pathway term network.
Abstract: We have developed ClueGO, an easy to use Cytoscape plug-in that strongly improves biological interpretation of large lists of genes. ClueGO integrates Gene Ontology (GO) terms as well as KEGG/BioCarta pathways and creates a functionally organized GO/pathway term network. It can analyze one or compare two lists of genes and comprehensively visualizes functionally grouped terms. A one-click update option allows ClueGO to automatically download the most recent GO/KEGG release at any time. ClueGO provides an intuitive representation of the analysis results and can be optionally used in conjunction with the GOlorize plug-in.

4,768 citations

Journal ArticleDOI
Alex Bateman, Maria Jesus Martin, Claire O'Donovan, Michele Magrane, Rolf Apweiler, Emanuele Alpi, Ricardo Antunes, Joanna Arganiska, Benoit Bely, Mark Bingley, Carlos Bonilla, Ramona Britto, Borisas Bursteinas, Gayatri Chavali, Elena Cibrian-Uhalte, Alan Wilter Sousa da Silva, Maurizio De Giorgi, Tunca Doğan, Francesco Fazzini, Paul Gane, Leyla Jael Garcia Castro, Penelope Garmiri, Emma Hatton-Ellis, Reija Hieta, Rachael P. Huntley, Duncan Legge, W Liu, Jie Luo, Alistair MacDougall, Prudence Mutowo, Andrew Nightingale, Sandra Orchard, Klemens Pichler, Diego Poggioli, Sangya Pundir, Luis Pureza, Guoying Qi, Steven Rosanoff, Rabie Saidi, Tony Sawford, Aleksandra Shypitsyna, Edward Turner, Vladimir Volynkin, Tony Wardell, Xavier Watkins, Hermann Zellner, Andrew Peter Cowley, Luis Figueira, Weizhong Li, Hamish McWilliam, Rodrigo Lopez, Ioannis Xenarios, Lydie Bougueleret, Alan Bridge, Sylvain Poux, Nicole Redaschi, Lucila Aimo, Ghislaine Argoud-Puy, Andrea H. Auchincloss, Kristian B. Axelsen, Parit Bansal, Delphine Baratin, Marie Claude Blatter, Brigitte Boeckmann, Jerven Bolleman, Emmanuel Boutet, Lionel Breuza, Cristina Casal-Casas, Edouard de Castro, Elisabeth Coudert, Béatrice A. Cuche, M Doche, Dolnide Dornevil, Séverine Duvaud, Anne Estreicher, L Famiglietti, Marc Feuermann, Elisabeth Gasteiger, Sebastien Gehant, Vivienne Baillie Gerritsen, Arnaud Gos, Nadine Gruaz-Gumowski, Ursula Hinz, Chantal Hulo, Florence Jungo, Guillaume Keller, Vicente Lara, P Lemercier, Damien Lieberherr, Thierry Lombardot, Xavier D. Martin, Patrick Masson, Anne Morgat, Teresa Batista Neto, Nevila Nouspikel, Salvo Paesano, Ivo Pedruzzi, Sandrine Pilbout, Monica Pozzato, Manuela Pruess, Catherine Rivoire, Bernd Roechert, Michel Schneider, Christian J. A. Sigrist, K Sonesson, S Staehli, Andre Stutz, Shyamala Sundaram, Michael Tognolli, Laure Verbregue, Anne Lise Veuthey, Cathy H. Wu, Cecilia N. Arighi, Leslie Arminski, Chuming Chen, Yongxing Chen, John S. Garavelli, Hongzhan Huang, Kati Laiho, Peter B. McGarvey, Darren A. Natale, Baris E. Suzek, C. R. Vinayaka, Qinghua Wang, Yuqi Wang, Lai-Su L. Yeh, Meher Shruti Yerramalla, Jian Zhang 
TL;DR: An annotation score for all entries in UniProt is introduced to represent the relative amount of knowledge known about each protein to help identify which proteins are the best characterized and most informative for comparative analysis.
Abstract: UniProt is an important collection of protein sequences and their annotations, which has doubled in size to 80 million sequences during the past year. This growth in sequences has prompted an extension of UniProt accession number space from 6 to 10 characters. An increasing fraction of new sequences are identical to a sequence that already exists in the database with the majority of sequences coming from genome sequencing projects. We have created a new proteome identifier that uniquely identifies a particular assembly of a species and strain or subspecies to help users track the provenance of sequences. We present a new website that has been designed using a user-experience design process. We have introduced an annotation score for all entries in UniProt to represent the relative amount of knowledge known about each protein. These scores will be helpful in identifying which proteins are the best characterized and most informative for comparative analysis. All UniProt data is provided freely and is available on the web at http://www.uniprot.org/.

4,050 citations

Trending Questions (1)
What is STRING in a protein-protein analysis?

STRING is a database that collects and integrates protein-protein association data, including direct and indirect interactions, for a large number of organisms. It uses various sources such as experimental data, co-expression analysis, text-mining, and computational predictions.