scispace - formally typeset
Search or ask a question

Showing papers in "Nucleic Acids Research in 2019"


Journal ArticleDOI
TL;DR: The latest version of STRING more than doubles the number of organisms it covers, and offers an option to upload entire, genome-wide datasets as input, allowing users to visualize subsets as interaction networks and to perform gene-set enrichment analysis on the entire input.
Abstract: Proteins and their functional interactions form the backbone of the cellular machinery. Their connectivity network needs to be considered for the full understanding of biological phenomena, but the available information on protein-protein associations is incomplete and exhibits varying levels of annotation granularity and reliability. The STRING database aims to collect, score and integrate all publicly available sources of protein-protein interaction information, and to complement these with computational predictions. Its goal is to achieve a comprehensive and objective global network, including direct (physical) as well as indirect (functional) interactions. The latest version of STRING (11.0) more than doubles the number of organisms it covers, to 5090. The most important new feature is an option to upload entire, genome-wide datasets as input, allowing users to visualize subsets as interaction networks and to perform gene-set enrichment analysis on the entire input. For the enrichment analysis, STRING implements well-known classification systems such as Gene Ontology and KEGG, but also offers additional, new classification systems based on high-throughput text-mining as well as on a hierarchical clustering of the association network itself. The STRING resource is available online at https://string-db.org/.

10,584 citations


Journal ArticleDOI
TL;DR: Key statistics on the current data contents and volume of downloads are outlined, and how PRIDE data are starting to be disseminated to added-value resources including Ensembl, UniProt and Expression Atlas are outlined.
Abstract: The PRoteomics IDEntifications (PRIDE) database (https://www.ebi.ac.uk/pride/) is the world’s largest data repository of mass spectrometry-based proteomics data, and is one of the founding members of the global ProteomeXchange (PX) consortium. In this manuscript, we summarize the developments in PRIDE resources and related tools since the previous update manuscript was published in Nucleic Acids Research in 2016. In the last 3 years, public data sharing through PRIDE (as part of PX) has definitely become the norm in the field. In parallel, data re-use of public proteomics data has increased enormously, with multiple applications. We first describe the new architecture of PRIDE Archive, the archival component of PRIDE. PRIDE Archive and the related data submission framework have been further developed to support the increase in submitted data volumes and additional data types. A new scalable and fault tolerant storage backend, Application Programming Interface and web interface have been implemented, as a part of an ongoing process. Additionally, we emphasize the improved support for quantitative proteomics data through the mzTab format. At last, we outline key statistics on the current data contents and volume of downloads, and how PRIDE data are starting to be disseminated to added-value resources including Ensembl, UniProt and Expression Atlas.

5,735 citations



Journal ArticleDOI
TL;DR: The current version of iTOL v4 introduces four new dataset types, together with numerous new features, and is the first tool which supports direct visualization of Qiime 2 trees and associated annotations.
Abstract: The Interactive Tree Of Life (https://itol.embl.de) is an online tool for the display, manipulation and annotation of phylogenetic and other trees. It is freely available and open to everyone. The current version introduces four new dataset types, together with numerous new features. Annotation options have been expanded and new control options added for many display elements. An interactive spreadsheet-like editor has been implemented, providing dataset creation and editing directly in the web interface. Font support has been rewritten with full support for UTF-8 character encoding throughout the user interface. Google Web Fonts are now fully supported in the tree text labels. iTOL v4 is the first tool which supports direct visualization of Qiime 2 trees and associated annotations. The user account system has been streamlined and expanded with new navigation options, and currently handles >700 000 trees from more than 40 000 individual users. Full batch access has been implemented allowing programmatic upload and export of trees and annotations.

4,233 citations


Journal ArticleDOI
TL;DR: A significant comparison to the structural classification database that led to the creation of 825 new families based on their set of uncharacterized families (EUFs) was carried out and Pfam entries were connected to the Sequence Ontology (SO) through mapping of the Pfam type definitions to SO terms.
Abstract: The last few years have witnessed significant changes in Pfam (https://pfam.xfam.org). The number of families has grown substantially to a total of 17,929 in release 32.0. New additions have been coupled with efforts to improve existing families, including refinement of domain boundaries, their classification into Pfam clans, as well as their functional annotation. We recently began to collaborate with the RepeatsDB resource to improve the definition of tandem repeat families within Pfam. We carried out a significant comparison to the structural classification database, namely the Evolutionary Classification of Protein Domains (ECOD) that led to the creation of 825 new families based on their set of uncharacterized families (EUFs). Furthermore, we also connected Pfam entries to the Sequence Ontology (SO) through mapping of the Pfam type definitions to SO terms. Since Pfam has many community contributors, we recently enabled the linking between authorship of all Pfam entries with the corresponding authors' ORCID identifiers. This effectively permits authors to claim credit for their Pfam curation and link them to their ORCID record.

3,617 citations


Journal ArticleDOI
TL;DR: The latest improvements made to the frameworks which enhance the interconnectivity between public EMBL-EBI resources and ultimately enhance biological data discoverability, accessibility, interoperability and reusability are described.
Abstract: The EMBL-EBI provides free access to popular bioinformatics sequence analysis applications as well as to a full-featured text search engine with powerful cross-referencing and data retrieval capabilities. Access to these services is provided via user-friendly web interfaces and via established RESTful and SOAP Web Services APIs (https://www.ebi.ac.uk/seqdb/confluence/display/JDSAT/EMBL-EBI+Web+Services+APIs+-+Data+Retrieval). Both systems have been developed with the same core principles that allow them to integrate an ever-increasing volume of biological data, making them an integral part of many popular data resources provided at the EMBL-EBI. Here, we describe the latest improvements made to the frameworks which enhance the interconnectivity between public EMBL-EBI resources and ultimately enhance biological data discoverability, accessibility, interoperability and reusability.

3,529 citations


Journal ArticleDOI
TL;DR: G:Profiler is now capable of analysing data from any organism, including vertebrates, plants, fungi, insects and parasites, and the 2019 update introduces an extensive technical rewrite making the services faster and more flexible.
Abstract: Biological data analysis often deals with lists of genes arising from various studies. The g:Profiler toolset is widely used for finding biological categories enriched in gene lists, conversions between gene identifiers and mappings to their orthologs. The mission of g:Profiler is to provide a reliable service based on up-to-date high quality data in a convenient manner across many evidence types, identifier spaces and organisms. g:Profiler relies on Ensembl as a primary data source and follows their quarterly release cycle while updating the other data sources simultaneously. The current update provides a better user experience due to a modern responsive web interface, standardised API and libraries. The results are delivered through an interactive and configurable web design. Results can be downloaded as publication ready visualisations or delimited text files. In the current update we have extended the support to 467 species and strains, including vertebrates, plants, fungi, insects and parasites. By supporting user uploaded custom GMT files, g:Profiler is now capable of analysing data from any organism. All past releases are maintained for reproducibility and transparency. The 2019 update introduces an extensive technical rewrite making the services faster and more flexible. g:Profiler is freely available at https://biit.cs.ut.ee/gprofiler.

2,959 citations


Journal ArticleDOI
TL;DR: Improved data access is improved with the release of a new RESTful API to support high-throughput programmatic access, an improved web interface and a new summary statistics database.
Abstract: The GWAS Catalog delivers a high-quality curated collection of all published genome-wide association studies enabling investigations to identify causal variants, understand disease mechanisms, and establish targets for novel therapies. The scope of the Catalog has also expanded to targeted and exome arrays with 1000 new associations added for these technologies. As of September 2018, the Catalog contains 5687 GWAS comprising 71673 variant-trait associations from 3567 publications. New content includes 284 full P-value summary statistics datasets for genome-wide and new targeted array studies, representing 6 × 109 individual variant-trait statistics. In the last 12 months, the Catalog's user interface was accessed by ∼90000 unique users who viewed >1 million pages. We have improved data access with the release of a new RESTful API to support high-throughput programmatic access, an improved web interface and a new summary statistics database. Summary statistics provision is supported by a new format proposed as a community standard for summary statistics data representation. This format was derived from our experience in standardizing heterogeneous submissions, mapping formats and in harmonizing content. Availability: https://www.ebi.ac.uk/gwas/.

2,878 citations


Journal ArticleDOI
TL;DR: Improvements to the public website and data-download systems and new functionality in COSMIC-3D allows exploration of mutations within three-dimensional protein structures, their protein structural and functional impacts, and implications for druggability.
Abstract: COSMIC, the Catalogue Of Somatic Mutations In Cancer (https://cancer.sanger.ac.uk) is the most detailed and comprehensive resource for exploring the effect of somatic mutations in human cancer. The latest release, COSMIC v86 (August 2018), includes almost 6 million coding mutations across 1.4 million tumour samples, curated from over 26 000 publications. In addition to coding mutations, COSMIC covers all the genetic mechanisms by which somatic mutations promote cancer, including non-coding mutations, gene fusions, copy-number variants and drug-resistance mutations. COSMIC is primarily hand-curated, ensuring quality, accuracy and descriptive data capture. Building on our manual curation processes, we are introducing new initiatives that allow us to prioritize key genes and diseases, and to react more quickly and comprehensively to new findings in the literature. Alongside improvements to the public website and data-download systems, new functionality in COSMIC-3D allows exploration of mutations within three-dimensional protein structures, their protein structural and functional impacts, and implications for druggability. In parallel with COSMIC's deep and broad variant coverage, the Cancer Gene Census (CGC) describes a curated catalogue of genes driving every form of human cancer. Currently describing 719 genes, the CGC has recently introduced functional descriptions of how each gene drives disease, summarized into the 10 cancer Hallmarks.

2,626 citations


Journal ArticleDOI
TL;DR: Improvements to the database and website are described to provide more information about the quality of micro RNA gene annotations, and the cellular functions of their products, and to improve the availability of microRNA functional information.
Abstract: miRBase catalogs, names and distributes microRNA gene sequences. The latest release of miRBase (v22) contains microRNA sequences from 271 organisms: 38 589 hairpin precursors and 48 860 mature microRNAs. We describe improvements to the database and website to provide more information about the quality of microRNA gene annotations, and the cellular functions of their products. We have collected 1493 small RNA deep sequencing datasets and mapped a total of 5.5 billion reads to microRNA sequences. The read mapping patterns provide strong support for the validity of between 20% and 65% of microRNA annotations in different well-studied animal genomes, and evidence for the removal of >200 sequences from the database. To improve the availability of microRNA functional information, we are disseminating Gene Ontology terms annotated against miRBase sequences. We have also used a text-mining approach to search for microRNA gene names in the full-text of open access articles. Over 500 000 sentences from 18 542 papers contain microRNA names. We score these sentences for functional information and link them with 12 519 microRNA entries. The sentences themselves, and word clouds built from them, provide effective summaries of the functional information about specific microRNAs. miRBase is publicly and freely available at http://mirbase.org/.

2,508 citations


Journal ArticleDOI
TL;DR: Protein Analysis Through Evolutionary Relationships is a resource for the evolutionary and functional classification of genes from organisms across the tree of life, and an entirely new PANTHER GO-slim is developed, containing over four times as many Gene Ontology terms as the previous GO- slim.
Abstract: PANTHER (Protein Analysis Through Evolutionary Relationships, http://pantherdb.org) is a resource for the evolutionary and functional classification of genes from organisms across the tree of life. We report the improvements we have made to the resource during the past two years. For evolutionary classifications, we have added more prokaryotic and plant genomes to the phylogenetic gene trees, expanding the representation of gene evolution in these lineages. We have refined many protein family boundaries, and have aligned PANTHER with the MEROPS resource for protease and protease inhibitor families. For functional classifications, we have developed an entirely new PANTHER GO-slim, containing over four times as many Gene Ontology terms as our previous GO-slim, as well as curated associations of genes to these terms. Lastly, we have made substantial improvements to the enrichment analysis tools available on the PANTHER website: users can now analyze over 900 different genomes, using updated statistical tests with false discovery rate corrections for multiple testing. The overrepresentation test is also available as a web service, for easy addition to third-party sites.

Journal ArticleDOI
Seth Carbon1, Eric Douglass1, Nathan Dunn1, Benjamin M. Good1  +189 moreInstitutions (19)
TL;DR: GO-CAM, a new framework for representing gene function that is more expressive than standard GO annotations, has been released, and users can now explore the growing repository of these models.
Abstract: The Gene Ontology resource (GO; http://geneontology.org) provides structured, computable knowledge regarding the functions of genes and gene products. Founded in 1998, GO has become widely adopted in the life sciences, and its contents are under continual improvement, both in quantity and in quality. Here, we report the major developments of the GO resource during the past two years. Each monthly release of the GO resource is now packaged and given a unique identifier (DOI), enabling GO-based analyses on a specific release to be reproduced in the future. The molecular function ontology has been refactored to better represent the overall activities of gene products, with a focus on transcription regulator activities. Quality assurance efforts have been ramped up to address potentially out-of-date or inaccurate annotations. New evidence codes for high-throughput experiments now enable users to filter out annotations obtained from these sources. GO-CAM, a new framework for representing gene function that is more expressive than standard GO annotations, has been released, and users can now explore the growing repository of these models. We also provide the ‘GO ribbon’ widget for visualizing GO annotations to a gene; the widget can be easily embedded in any web page.

Journal ArticleDOI
TL;DR: This work generates primary data, creates bioinformatics tools and provides analysis to support the work of expert manual gene annotators and automated gene annotation pipelines to identify and characterise gene loci to the highest standard.
Abstract: The accurate identification and description of the genes in the human and mouse genomes is a fundamental requirement for high quality analysis of data informing both genome biology and clinical genomics. Over the last 15 years, the GENCODE consortium has been producing reference quality gene annotations to provide this foundational resource. The GENCODE consortium includes both experimental and computational biology groups who work together to improve and extend the GENCODE gene annotation. Specifically, we generate primary data, create bioinformatics tools and provide analysis to support the work of expert manual gene annotators and automated gene annotation pipelines. In addition, manual and computational annotation workflows use any and all publicly available data and analysis, along with the research literature to identify and characterise gene loci to the highest standard. GENCODE gene annotations are accessible via the Ensembl and UCSC Genome Browsers, the Ensembl FTP site, Ensembl Biomart, Ensembl Perl and REST APIs as well as https://www.gencodegenes.org.

Journal ArticleDOI
TL;DR: The latest updates to CADD are reviewed, including the most recent version, 1.4, which supports the human genome build GRCh38, and also present updates to the website that include simplified variant lookup, extended documentation, an Application Program Interface and improved mechanisms for integrating CADD scores into other tools or applications.
Abstract: Combined Annotation-Dependent Depletion (CADD) is a widely used measure of variant deleteriousness that can effectively prioritize causal variants in genetic analyses, particularly highly penetrant contributors to severe Mendelian disorders. CADD is an integrative annotation built from more than 60 genomic features, and can score human single nucleotide variants and short insertion and deletions anywhere in the reference assembly. CADD uses a machine learning model trained on a binary distinction between simulated de novo variants and variants that have arisen and become fixed in human populations since the split between humans and chimpanzees; the former are free of selective pressure and may thus include both neutral and deleterious alleles, while the latter are overwhelmingly neutral (or, at most, weakly deleterious) by virtue of having survived millions of years of purifying selection. Here we review the latest updates to CADD, including the most recent version, 1.4, which supports the human genome build GRCh38. We also present updates to our website that include simplified variant lookup, extended documentation, an Application Program Interface and improved mechanisms for integrating CADD scores into other tools or applications. CADD scores, software and documentation are available at https://cadd.gs.washington.edu.

Journal ArticleDOI
TL;DR: AntiSMASH 5 adds detection rules for clusters encoding the biosynthesis of acyl-amino acids, β-lactones, fungal RiPPs, RaS-Ri PPs, polybrominated diphenyl ethers, C-nucleosides, PPY-like ketones and lipolanthines and provides more detailed predictions for type II polyketide synthase-encoding gene clusters.
Abstract: Secondary metabolites produced by bacteria and fungi are an important source of antimicrobials and other bioactive compounds. In recent years, genome mining has seen broad applications in identifying and characterizing new compounds as well as in metabolic engineering. Since 2011, the 'antibiotics and secondary metabolite analysis shell-antiSMASH' (https://antismash.secondarymetabolites.org) has assisted researchers in this, both as a web server and a standalone tool. It has established itself as the most widely used tool for identifying and analysing biosynthetic gene clusters (BGCs) in bacterial and fungal genome sequences. Here, we present an entirely redesigned and extended version 5 of antiSMASH. antiSMASH 5 adds detection rules for clusters encoding the biosynthesis of acyl-amino acids, β-lactones, fungal RiPPs, RaS-RiPPs, polybrominated diphenyl ethers, C-nucleosides, PPY-like ketones and lipolanthines. For type II polyketide synthase-encoding gene clusters, antiSMASH 5 now offers more detailed predictions. The HTML output visualization has been redesigned to improve the navigation and visual representation of annotations. We have again improved the runtime of analysis steps, making it possible to deliver comprehensive annotations for bacterial genomes within a few minutes. A new output file in the standard JavaScript object notation (JSON) format is aimed at downstream tools that process antiSMASH results programmatically.

Journal ArticleDOI
TL;DR: This paper describes the new developments in PubChem, a key chemical information resource for the biomedical research community, which released new web interfaces, such as PubChem Target View page, Sources page, Bioactivity dyad pages and Patent View page.
Abstract: PubChem (https://pubchem.ncbi.nlm.nih.gov) is a key chemical information resource for the biomedical research community. Substantial improvements were made in the past few years. New data content was added, including spectral information, scientific articles mentioning chemicals, and information for food and agricultural chemicals. PubChem released new web interfaces, such as PubChem Target View page, Sources page, Bioactivity dyad pages and Patent View page. PubChem also released a major update to PubChem Widgets and introduced a new programmatic access interface, called PUG-View. This paper describes these new developments in PubChem.

Journal ArticleDOI
Zefang Tang1, Boxi Kang1, Chenwei Li1, Tianxiang Chen1, Zemin Zhang1 
TL;DR: GEPIA2 has adopted new analysis techniques of gene signature quantification inspired by single-cell sequencing studies, and provides customized analysis where users can upload their own RNA-seq data and compare them with TCGA and GTEx samples.
Abstract: Introduced in 2017, the GEPIA (Gene Expression Profiling Interactive Analysis) web server has been a valuable and highly cited resource for gene expression analysis based on tumor and normal samples from the TCGA and the GTEx databases. Here, we present GEPIA2, an updated and enhanced version to provide insights with higher resolution and more functionalities. Featuring 198 619 isoforms and 84 cancer subtypes, GEPIA2 has extended gene expression quantification from the gene level to the transcript level, and supports analysis of a specific cancer subtype, and comparison between subtypes. In addition, GEPIA2 has adopted new analysis techniques of gene signature quantification inspired by single-cell sequencing studies, and provides customized analysis where users can upload their own RNA-seq data and compare them with TCGA and GTEx samples. We also offer an API for batch process and easy retrieval of the analysis results. The updated web server is publicly accessible at http://gepia2.cancer-pku.cn/.

Journal ArticleDOI
TL;DR: eggNOG as discussed by the authors is a public database of orthology relationships, gene evolutionary histories and functional annotations, with a major update of the underlying genome sets, which have been expanded to 4445 representative bacteria and 168 archaea derived from 25 038 genomes.
Abstract: eggNOG is a public database of orthology relationships, gene evolutionary histories and functional annotations. Here, we present version 5.0, featuring a major update of the underlying genome sets, which have been expanded to 4445 representative bacteria and 168 archaea derived from 25 038 genomes, as well as 477 eukaryotic organisms and 2502 viral proteomes that were selected for diversity and filtered by genome quality. In total, 4.4M orthologous groups (OGs) distributed across 379 taxonomic levels were computed together with their associated sequence alignments, phylogenies, HMM models and functional descriptors. Precomputed evolutionary analysis provides fine-grained resolution of duplication/speciation events within each OG. Our benchmarks show that, despite doubling the amount of genomes, the quality of orthology assignments and functional annotations (80% coverage) has persisted without significant changes across this update. Finally, we improved eggNOG online services for fast functional annotation and orthology prediction of custom genomics or metagenomics datasets. All precomputed data are publicly available for downloading or via API queries at http://eggnog.embl.de.

Journal ArticleDOI
Yuxing Liao1, Jing Wang1, Eric J. Jaehnig1, Zhiao Shi1, Bing Zhang1 
TL;DR: In the 2019 update, WebGestalt supports 12 organisms, 342 gene identifiers and 155 175 functional categories, as well as user-uploaded functional databases and has completely redesigned result visualizations and user interfaces to improve user-friendliness and to provide multiple types of interactive and publication-ready figures.
Abstract: WebGestalt is a popular tool for the interpretation of gene lists derived from large scale -omics studies. In the 2019 update, WebGestalt supports 12 organisms, 342 gene identifiers and 155 175 functional categories, as well as user-uploaded functional databases. To address the growing and unique need for phosphoproteomics data interpretation, we have implemented phosphosite set analysis to identify important kinases from phosphoproteomics data. We have completely redesigned result visualizations and user interfaces to improve user-friendliness and to provide multiple types of interactive and publication-ready figures. To facilitate comprehension of the enrichment results, we have implemented two methods to reduce redundancy between enriched gene sets. We introduced a web API for other applications to get data programmatically from the WebGestalt server or pass data to WebGestalt for analysis. We also wrapped the core computation into an R package called WebGestaltR for users to perform analysis locally or in third party workflows. WebGestalt can be freely accessed at http://www.webgestalt.org.

Journal ArticleDOI
TL;DR: UNITE is a web-based database and sequence management environment for the molecular identification of fungi that targets the formal fungal barcode—the nuclear ribosomal internal transcribed spacer region—and offers all public fungal ITS sequences for reference.
Abstract: Alfred P. Sloan Foundation [G-2015-14062]; Swedish Research Council of Environment, Agricultural Sciences, and Spatial Planning [FORMAS, 215-2011-498]; European Regional Development Fund (Centre of Excellence EcolChange) [TK131]; Estonian Research Council [IUT20-30]. Funding for open access charge: Swedish Research Council of Environment, Agricultural Sciences and Spatial Planning.

Journal ArticleDOI
TL;DR: A new Resistomes & Variants module provides analysis and statistical summary of in silico predicted resistance variants from 82 pathogens and over 100 000 genomes, able to summarize predicted resistance using the information included in CARD, identify trends in AMR mobility and determine previously undescribed and novel resistance variants.
Abstract: The Comprehensive Antibiotic Resistance Database (CARD; https://card.mcmaster.ca) is a curated resource providing reference DNA and protein sequences, detection models and bioinformatics tools on the molecular basis of bacterial antimicrobial resistance (AMR). CARD focuses on providing high-quality reference data and molecular sequences within a controlled vocabulary, the Antibiotic Resistance Ontology (ARO), designed by the CARD biocuration team to integrate with software development efforts for resistome analysis and prediction, such as CARD's Resistance Gene Identifier (RGI) software. Since 2017, CARD has expanded through extensive curation of reference sequences, revision of the ontological structure, curation of over 500 new AMR detection models, development of a new classification paradigm and expansion of analytical tools. Most notably, a new Resistomes & Variants module provides analysis and statistical summary of in silico predicted resistance variants from 82 pathogens and over 100 000 genomes. By adding these resistance variants to CARD, we are able to summarize predicted resistance using the information included in CARD, identify trends in AMR mobility and determine previously undescribed and novel resistance variants. Here, we describe updates and recent expansions to CARD and its biocuration process, including new resources for community biocuration of AMR molecular reference data.

Journal ArticleDOI
TL;DR: Rsubread is presented, a Bioconductor software package that provides high-performance alignment and read counting functions for RNA-seq reads that integrates read mapping and quantification in a single package and has no software dependencies other than R itself.
Abstract: We present Rsubread, a Bioconductor software package that provides high-performance alignment and read counting functions for RNA-seq reads. Rsubread is based on the successful Subread suite with the added ease-of-use of the R programming environment, creating a matrix of read counts directly as an R object ready for downstream analysis. It integrates read mapping and quantification in a single package and has no software dependencies other than R itself. We demonstrate Rsubread's ability to detect exon-exon junctions de novo and to quantify expression at the level of either genes, exons or exon junctions. The resulting read counts can be input directly into a wide range of downstream statistical analyses using other Bioconductor packages. Using SEQC data and simulations, we compare Rsubread to TopHat2, STAR and HTSeq as well as to counting functions in the Bioconductor infrastructure packages. We consider the performance of these tools on the combined quantification task starting from raw sequence reads through to summary counts, and in particular evaluate the performance of different combinations of alignment and counting algorithms. We show that Rsubread is faster and uses less memory than competitor tools and produces read count summaries that more accurately correlate with true values.

Journal ArticleDOI
Minoru Kanehisa1, Yoko Sato2, Miho Furumichi1, Kanae Morishima1, Mao Tanabe1 
TL;DR: This work has introduced a new approach where human gene variants are explicitly incorporated into what it calls ‘network variants’ in the recently released KEGG NETWORK database, which allows accumulation of knowledge about disease-related perturbed molecular networks caused not only by gene variants, but also by viruses and other pathogens, environmental factors and drugs.
Abstract: KEGG (Kyoto Encyclopedia of Genes and Genomes; https://www.kegg.jp/ or https://www.genome.jp/kegg/) is a reference knowledge base for biological interpretation of genome sequences and other high-throughput data. It is an integrated database consisting of three generic categories of systems information, genomic information and chemical information, and an additional human-specific category of health information. KEGG pathway maps, BRITE hierarchies and KEGG modules have been developed as generic molecular networks with KEGG Orthology nodes of functional orthologs so that KEGG pathway mapping and other procedures can be applied to any cellular organism. Unfortunately, however, this generic approach was inadequate for knowledge representation in the health information category, where variations of human genomes, especially disease-related variations, had to be considered. Thus, we have introduced a new approach where human gene variants are explicitly incorporated into what we call 'network variants' in the recently released KEGG NETWORK database. This allows accumulation of knowledge about disease-related perturbed molecular networks caused not only by gene variants, but also by viruses and other pathogens, environmental factors and drugs. We expect that KEGG NETWORK will become another reference knowledge base for the basic understanding of disease mechanisms and practical use in clinical sequencing and drug development.

Journal ArticleDOI
TL;DR: The 2019 version of SwissTargetPrediction is described, which represents a major update in terms of underlying data, backend and web interface, and high levels of predictive performance were maintained despite more extended biological and chemical spaces to be explored.
Abstract: SwissTargetPrediction is a web tool, on-line since 2014, that aims to predict the most probable protein targets of small molecules. Predictions are based on the similarity principle, through reverse screening. Here, we describe the 2019 version, which represents a major update in terms of underlying data, backend and web interface. The bioactivity data were updated, the model retrained and similarity thresholds redefined. In the new version, the predictions are performed by searching for similar molecules, in 2D and 3D, within a larger collection of 376 342 compounds known to be experimentally active on an extended set of 3068 macromolecular targets. An efficient backend implementation allows to speed up the process that returns results for a druglike molecule on human proteins in 15-20 s. The refreshed web interface enhances user experience with new features for easy input and improved analysis. Interoperability capacity enables straightforward submission of any input or output molecule to other on-line computer-aided drug design tools, developed by the SIB Swiss Institute of Bioinformatics. High levels of predictive performance were maintained despite more extended biological and chemical spaces to be explored, e.g. achieving at least one correct human target in the top 15 predictions for >70% of external compounds. The new SwissTargetPrediction is available free of charge (www.swisstargetprediction.ch).

Journal ArticleDOI
TL;DR: In this 8th release of JASPAR, the CORE collection has been expanded with 245 new PFMs, and 156 PFMs were updated, and the genomic tracks, inference tool, and TF-binding profile similarity clusters were updated.
Abstract: JASPAR (http://jaspar.genereg.net) is an open-access database of curated, non-redundant transcription factor (TF)-binding profiles stored as position frequency matrices (PFMs) for TFs across multiple species in six taxonomic groups. In this 8th release of JASPAR, the CORE collection has been expanded with 245 new PFMs (169 for vertebrates, 42 for plants, 17 for nematodes, 10 for insects, and 7 for fungi), and 156 PFMs were updated (125 for vertebrates, 28 for plants and 3 for insects). These new profiles represent an 18% expansion compared to the previous release. JASPAR 2020 comes with a novel collection of unvalidated TF-binding profiles for which our curators did not find orthogonal supporting evidence in the literature. This collection has a dedicated web form to engage the community in the curation of unvalidated TF-binding profiles. Moreover, we created a Q&A forum to ease the communication between the user community and JASPAR curators. Finally, we updated the genomic tracks, inference tool, and TF-binding profile similarity clusters. All the data is available through the JASPAR website, its associated RESTful API, and through the JASPAR2020 R/Bioconductor package.

Journal ArticleDOI
TL;DR: The DisGeNET platform, a knowledge management platform integrating and standardizing data about disease associated genes and variants from multiple sources, is an interoperable resource supporting a variety of applications in genomic medicine and drug R&D.
Abstract: One of the most pressing challenges in genomic medicine is to understand the role played by genetic variation in health and disease. Thanks to the exploration of genomic variants at large scale, hundreds of thousands of disease-associated loci have been uncovered. However, the identification of variants of clinical relevance is a significant challenge that requires comprehensive interrogation of previous knowledge and linkage to new experimental results. To assist in this complex task, we created DisGeNET (http://www.disgenet.org/), a knowledge management platform integrating and standardizing data about disease associated genes and variants from multiple sources, including the scientific literature. DisGeNET covers the full spectrum of human diseases as well as normal and abnormal traits. The current release covers more than 24 000 diseases and traits, 17 000 genes and 117 000 genomic variants. The latest developments of DisGeNET include new sources of data, novel data attributes and prioritization metrics, a redesigned web interface and recently launched APIs. Thanks to the data standardization, the combination of expert curated information with data automatically mined from the scientific literature, and a suite of tools for accessing its publicly available data, DisGeNET is an interoperable resource supporting a variety of applications in genomic medicine and drug R&D.

Journal ArticleDOI
TL;DR: Recent developments with InterPro (version 70.0) and its associated software are reported, including an 18% growth in the size of the database in terms on new InterPro entries, updates to content, the inclusion of an additional entry type, refined modelling of discontinuous domains, and the development of a new programmatic interface and website.
Abstract: The InterPro database (http://www.ebi.ac.uk/interpro/) classifies protein sequences into families and predicts the presence of functionally important domains and sites. Here, we report recent developments with InterPro (version 70.0) and its associated software, including an 18% growth in the size of the database in terms on new InterPro entries, updates to content, the inclusion of an additional entry type, refined modelling of discontinuous domains, and the development of a new programmatic interface and website. These developments extend and enrich the information provided by InterPro, and provide greater flexibility in terms of data access. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB, and discuss how our evaluation of residue coverage may help guide future curation activities.

Journal ArticleDOI
TL;DR: The recent focus of the IEDB has been improved query and reporting functionality to meet the needs of users to access and summarize data that continues to grow in quantity and complexity.
Abstract: The Immune Epitope Database (IEDB, iedb.org) captures experimental data confined in figures, text and tables of the scientific literature, making it freely available and easily searchable to the public. The scope of the IEDB extends across immune epitope data related to all species studied and includes antibody, T cell, and MHC binding contexts associated with infectious, allergic, autoimmune, and transplant related diseases. Having been publicly accessible for >10 years, the recent focus of the IEDB has been improved query and reporting functionality to meet the needs of our users to access and summarize data that continues to grow in quantity and complexity. Here we present an update on our current efforts and future goals.

Journal ArticleDOI
TL;DR: A new dedicated aspect of BioGRID annotates genome-wide CRISPR/Cas9-based screens that report gene–phenotype and gene–gene relationships, and captures chemical interaction data, including chemical–protein interactions for human drug targets drawn from the DrugBank database and manually curated bioactive compounds reported in the literature.
Abstract: The Biological General Repository for Interaction Datasets (BioGRID: https://thebiogrid.org) is an open access database dedicated to the curation and archival storage of protein, genetic and chemical interactions for all major model organism species and humans. As of September 2018 (build 3.4.164), BioGRID contains records for 1 598 688 biological interactions manually annotated from 55 809 publications for 71 species, as classified by an updated set of controlled vocabularies for experimental detection methods. BioGRID also houses records for >700 000 post-translational modification sites. BioGRID now captures chemical interaction data, including chemical-protein interactions for human drug targets drawn from the DrugBank database and manually curated bioactive compounds reported in the literature. A new dedicated aspect of BioGRID annotates genome-wide CRISPR/Cas9-based screens that report gene-phenotype and gene-gene relationships. An extension of the BioGRID resource called the Open Repository for CRISPR Screens (ORCS) database (https://orcs.thebiogrid.org) currently contains over 500 genome-wide screens carried out in human or mouse cell lines. All data in BioGRID is made freely available without restriction, is directly downloadable in standard formats and can be readily incorporated into existing applications via our web service platforms. BioGRID data are also freely distributed through partner model organism databases and meta-databases.

Journal ArticleDOI
TL;DR: The challenge for the IPD-IMGT/HLA Database is to continue to provide a highly curated database of sequence variants, while supporting the increased number of submissions and complexity of sequences.
Abstract: The IPD-IMGT/HLA Database, http://www.ebi.ac.uk/ipd/imgt/hla/, currently contains over 25 000 allele sequence for 45 genes, which are located within the Major Histocompatibility Complex (MHC) of the human genome. This region is the most polymorphic region of the human genome, and the levels of polymorphism seen exceed most other genes. Some of the genes have several thousand variants and are now termed hyperpolymorphic, rather than just simply polymorphic. The IPD-IMGT/HLA Database has provided a stable, highly accessible, user-friendly repository for this information, providing the scientific and medical community access to the many variant sequences of this gene system, that are critical for the successful outcome of transplantation. The number of currently known variants, and dramatic increase in the number of new variants being identified has necessitated a dedicated resource with custom tools for curation and publication. The challenge for the database is to continue to provide a highly curated database of sequence variants, while supporting the increased number of submissions and complexity of sequences. In order to do this, traditional methods of accessing and presenting data will be challenged, and new methods will need to be utilized to keep pace with new discoveries.