Showing papers in "Nucleic Acids Research in 2021"

PDF

Open Access

Journal Article•DOI•

UniProt: the universal protein knowledgebase in 2021

[...]

Alex Bateman, Maria Jesus Martin, Sandra Orchard, Michele Magrane, Rahat Agivetova, Shadab Ahmad, Emanuele Alpi, Emily H Bowler-Barnett, Ramona Britto, Borisas Bursteinas, Hema Bye-A-Jee, Ray Coetzee, Austra Cukura, Alan Wilter Sousa da Silva, Paul Denny, Tunca Doğan, ThankGod Ebenezer, Jun Fan, Leyla Jael Garcia Castro, Penelope Garmiri, George Georghiou, Leonardo Gonzales, Emma Hatton-Ellis, Abdulrahman Hussein, Alexandr Ignatchenko, Giuseppe Insana, Rizwan Ishtiaq, Petteri Jokinen, Vishal Joshi, Dushyanth Jyothi, Antonia Lock, Rodrigo Lopez, Aurelien Luciani, Jie Luo, Yvonne Lussi, Alistair MacDougall, Fábio Madeira, Mahdi Mahmoudy, Manuela Menchi, Alok Mishra, Katie Moulang, Andrew Nightingale, Carla Susana Oliveira, Sangya Pundir, Guoying Qi, Shriya Raj, Daniel Rice, Milagros Rodriguez Lopez, Rabie Saidi, Joseph Sampson, Tony Sawford, Elena Speretta, Edward Turner, Nidhi Tyagi, Preethi Vasudev, Vladimir Volynkin, Kate Warner, Xavier Watkins, Rossana Zaru, Hermann Zellner, Alan Bridge, Sylvain Poux, Nicole Redaschi, Lucila Aimo, Ghislaine Argoud-Puy, Andrea H. Auchincloss, Kristian B. Axelsen, Parit Bansal, Delphine Baratin, Marie-Claude Blatter, Jerven Bolleman, Emmanuel Boutet, Lionel Breuza, Cristina Casals-Casas, Edouard de Castro, Kamal Chikh Echioukh, Elisabeth Coudert, Béatrice A. Cuche, M Doche, Dolnide Dornevil, Anne Estreicher, Maria Livia Famiglietti, Marc Feuermann, Elisabeth Gasteiger, Sebastien Gehant, Vivienne Baillie Gerritsen, Arnaud Gos, Nadine Gruaz-Gumowski, Ursula Hinz, Chantal Hulo, Nevila Hyka-Nouspikel, Florence Jungo, Guillaume Keller, Arnaud Kerhornou, Vicente Lara, Philippe Le Mercier, Damien Lieberherr, Thierry Lombardot, Xavier D. Martin, Patrick Masson, Anne Morgat, Teresa Batista Neto, Salvo Paesano, Ivo Pedruzzi, Sandrine Pilbout, Lucille Pourcel, Monica Pozzato, Manuela Pruess, Catherine Rivoire, Christian J. A. Sigrist, K Sonesson, Andre Stutz, Shyamala Sundaram, Michael Tognolli, Laure Verbregue, Cathy H. Wu, Cecilia N. Arighi, Leslie Arminski, Chuming Chen, Yongxing Chen, John S. Garavelli, Hongzhan Huang, Kati Laiho, Peter B. McGarvey, Darren A. Natale, Karen E. Ross, C. R. Vinayaka, Qinghua Wang, Yuqi Wang, Lai-Su L. Yeh, Jian Zhang, Patrick Ruch, Douglas Teodoro - Show less +129 more

08 Jan 2021-Nucleic Acids Research

TL;DR: The UniProtKB responded to the COVID-19 pandemic through expert curation of relevant entries that were rapidly made available to the research community through a dedicated portal and a credit-based publication submission interface was developed.

...read moreread less

Abstract: Abstract The aim of the UniProt Knowledgebase is to provide users with a comprehensive, high-quality and freely accessible set of protein sequences annotated with functional information. In this article, we describe significant updates that we have made over the last two years to the resource. The number of sequences in UniProtKB has risen to approximately 190 million, despite continued work to reduce sequence redundancy at the proteome level. We have adopted new methods of assessing proteome completeness and quality. We continue to extract detailed annotations from the literature to add to reviewed entries and supplement these in unreviewed entries with annotations provided by automated systems such as the newly implemented Association-Rule-Based Annotator (ARBA). We have developed a credit-based publication submission interface to allow the community to contribute publications and annotations to UniProt entries. We describe how UniProtKB responded to the COVID-19 pandemic through expert curation of relevant entries that were rapidly made available to the research community through a dedicated portal. UniProt resources are available under a CC-BY (4.0) license via the web at https://www.uniprot.org/.

...read moreread less

4,001 citations

Journal Article•DOI•

The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets.

[...]

Damian Szklarczyk¹, Annika L. Gable¹, Katerina C. Nastou², David Lyon¹, Rebecca Kirsch², Sampo Pyysalo³, Nadezhda Tsankova Doncheva², Marc Legeay², Tao Fang¹, Peer Bork, Lars Juhl Jensen², Christian von Mering¹ - Show less +8 more•Institutions (3)

Swiss Institute of Bioinformatics¹, University of Copenhagen², University of Turku³

08 Jan 2021-Nucleic Acids Research

TL;DR: Changes to the text-mining system, a new scoring-mode for physical interactions, as well as extensive user interface features for customizing, extending and sharing protein networks are described.

...read moreread less

Abstract: Cellular life depends on a complex web of functional associations between biomolecules. Among these associations, protein-protein interactions are particularly important due to their versatility, specificity and adaptability. The STRING database aims to integrate all known and predicted associations between proteins, including both physical interactions as well as functional associations. To achieve this, STRING collects and scores evidence from a number of sources: (i) automated text mining of the scientific literature, (ii) databases of interaction experiments and annotated complexes/pathways, (iii) computational interaction predictions from co-expression and from conserved genomic context and (iv) systematic transfers of interaction evidence from one organism to another. STRING aims for wide coverage; the upcoming version 11.5 of the resource will contain more than 14 000 organisms. In this update paper, we describe changes to the text-mining system, a new scoring-mode for physical interactions, as well as extensive user interface features for customizing, extending and sharing protein networks. In addition, we describe how to query STRING with genome-wide, experimental data, including the automated detection of enriched functionalities and potential biases in the user's query data. The STRING resource is available online, at https://string-db.org/.

...read moreread less

3,253 citations

Journal Article•DOI•

Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation.

[...]

Ivica Letunic, Peer Bork¹, Peer Bork², Peer Bork³•Institutions (3)

European Bioinformatics Institute¹, Yonsei University², University of Würzburg³

02 Jul 2021-Nucleic Acids Research

TL;DR: The Interactive Tree Of Life (ITOL) as mentioned in this paper is an online tool for the display, manipulation and annotation of phylogenetic and other trees, which allows users to draw shapes, labels and other features directly onto the trees.

...read moreread less

Abstract: The Interactive Tree Of Life (https://itol.embl.de) is an online tool for the display, manipulation and annotation of phylogenetic and other trees. It is freely available and open to everyone. iTOL version 5 introduces a completely new tree display engine, together with numerous new features. For example, a new dataset type has been added (MEME motifs), while annotation options have been expanded for several existing ones. Node metadata display options have been extended and now also support non-numerical categorical values, as well as multiple values per node. Direct manual annotation is now available, providing a set of basic drawing and labeling tools, allowing users to draw shapes, labels and other features by hand directly onto the trees. Support for tree and dataset scales has been extended, providing fine control over line and label styles. Unrooted tree displays can now use the equal-daylight algorithm, proving a much greater display clarity. The user account system has been streamlined and expanded with new navigation options and currently handles >1 million trees from >70 000 individual users.

...read moreread less

2,856 citations

Journal Article•DOI•

Pfam: The protein families database in 2021.

[...]

Jaina Mistry¹, Sara Chuguransky¹, Lowri Williams¹, Matloob Qureshi¹, Gustavo A. Salazar¹, Erik L. L. Sonnhammer², Silvio C. E. Tosatto³, Lisanna Paladin³, Shriya Raj¹, Lorna Richardson¹, Robert D. Finn¹, Alex Bateman¹ - Show less +8 more•Institutions (3)

European Bioinformatics Institute¹, Science for Life Laboratory², University of Padua³

08 Jan 2021-Nucleic Acids Research

TL;DR: The Pfam database is a widely used resource for classifying protein sequences into families and domains and the reintroduced Pfam-B which provides an automatically generated supplement to Pfam and contains 136 730 novel clusters of sequences that are not yet matched by a Pfam family.

...read moreread less

Abstract: The Pfam database is a widely used resource for classifying protein sequences into families and domains. Since Pfam was last described in this journal, over 350 new families have been added in Pfam 33.1 and numerous improvements have been made to existing entries. To facilitate research on COVID-19, we have revised the Pfam entries that cover the SARS-CoV-2 proteome, and built new entries for regions that were not covered by Pfam. We have reintroduced Pfam-B which provides an automatically generated supplement to Pfam and contains 136 730 novel clusters of sequences that are not yet matched by a Pfam family. The new Pfam-B is based on a clustering by the MMseqs2 software. We have compared all of the regions in the RepeatsDB to those in Pfam and have started to use the results to build and refine Pfam repeat families. Pfam is freely available for browsing and download at http://pfam.xfam.org/.

...read moreread less

2,296 citations

Journal Article•DOI•

KEGG: integrating viruses and cellular organisms.

[...]

Minoru Kanehisa¹, Miho Furumichi¹, Yoko Sato², Mari Ishiguro-Watanabe³, Mao Tanabe¹ - Show less +1 more•Institutions (3)

Kyoto University¹, Fujitsu², University of Tokyo³

08 Jan 2021-Nucleic Acids Research

TL;DR: The K EGG pathway maps are now integrated with network variation maps in the NETWORK database, as well as with conserved functional units of KEGG modules and reaction modules in the MODULE database, and the KO database for functional orthologs continues to be improved.

...read moreread less

Abstract: KEGG (https://www.kegg.jp/) is a manually curated resource integrating eighteen databases categorized into systems, genomic, chemical and health information. It also provides KEGG mapping tools, which enable understanding of cellular and organism-level functions from genome sequences and other molecular datasets. KEGG mapping is a predictive method of reconstructing molecular network systems from molecular building blocks based on the concept of functional orthologs. Since the introduction of the KEGG NETWORK database, various diseases have been associated with network variants, which are perturbed molecular networks caused by human gene variants, viruses, other pathogens and environmental factors. The network variation maps are created as aligned sets of related networks showing, for example, how different viruses inhibit or activate specific cellular signaling pathways. The KEGG pathway maps are now integrated with network variation maps in the NETWORK database, as well as with conserved functional units of KEGG modules and reaction modules in the MODULE database. The KO database for functional orthologs continues to be improved and virus KOs are being expanded for better understanding of virus-cell interactions and for enabling prediction of viral perturbations.

...read moreread less

2,087 citations

Journal Article•DOI•

AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models.

[...]

Mihaly Varadi¹, Stephen Anyango¹, Mandar Deshpande¹, Sreenath Nair¹, Cindy Natassia¹, Galabina Yordanova¹, David Yu Yuan¹, Oana Stroe¹, Gemma Wood¹, Agata Laydon, Augustin Žídek, Tim Green, Kathryn Tunyasuvunakool, Stig Petersen, John M. Jumper, Ellen Clancy, Richard E. Green, Ankur Vora, Mira Lutfi, Michael Figurnov, Andrew Cowie, Nicole Hobbs, Pushmeet Kohli, Gerard J. Kleywegt¹, Ewan Birney¹, Demis Hassabis, Sameer Velankar¹ - Show less +23 more•Institutions (1)

European Bioinformatics Institute¹

17 Nov 2021-Nucleic Acids Research

TL;DR: The AlphaFold Protein Structure Database (AlphaFold DB, https://alphafold.ebi.ac.uk) is an openly accessible, extensive database of high-accuracy protein-structure predictions.

...read moreread less

Abstract: The AlphaFold Protein Structure Database (AlphaFold DB, https://alphafold.ebi.ac.uk) is an openly accessible, extensive database of high-accuracy protein-structure predictions. Powered by AlphaFold v2.0 of DeepMind, it has enabled an unprecedented expansion of the structural coverage of the known protein-sequence space. AlphaFold DB provides programmatic access to and interactive visualization of predicted atomic coordinates, per-residue and pairwise model-confidence estimates and predicted aligned errors. The initial release of AlphaFold DB contains over 360,000 predicted structures across 21 model-organism proteomes, which will soon be expanded to cover most of the (over 100 million) representative sequences from the UniRef90 data set.

...read moreread less

2,008 citations

Journal Article•DOI•

The Gene Ontology resource: enriching a GOld mine

[...]

Seth Carbon, Eric Douglass, Benjamin M. Good, Deepak Unni +176 more

08 Jan 2021-Nucleic Acids Research

TL;DR: A historical archive covering the past 15 years of GO data with a consistent format and file structure for both the ontology and annotations is made available to maintain consistency with other ontologies.

...read moreread less

Abstract: The Gene Ontology Consortium (GOC) provides the most comprehensive resource currently available for computable knowledge regarding the functions of genes and gene products. Here, we report the advances of the consortium over the past two years. The new GO-CAM annotation framework was notably improved, and we formalized the model with a computational schema to check and validate the rapidly increasing repository of 2838 GO-CAMs. In addition, we describe the impacts of several collaborations to refine GO and report a 10% increase in the number of GO annotations, a 25% increase in annotated gene products, and over 9,400 new scientific articles annotated. As the project matures, we continue our efforts to review older annotations in light of newer findings, and, to maintain consistency with other ontologies. As a result, 20 000 annotations derived from experimental data were reviewed, corresponding to 2.5% of experimental GO annotations. The website (http://geneontology.org) was redesigned for quick access to documentation, downloads and tools. To maintain an accurate resource and support traceability and reproducibility, we have made available a historical archive covering the past 15 years of GO data with a consistent format and file structure for both the ontology and annotations.

...read moreread less

1,988 citations

Journal Article•DOI•

PubChem in 2021: new data content and improved web interfaces.

[...]

Sunghwan Kim¹, Jie Chen¹, Tiejun Cheng¹, Asta Gindulyte¹, Jia He¹, Siqian He¹, Qingliang Li¹, Benjamin A. Shoemaker¹, Paul A. Thiessen¹, Bo Yu¹, Leonid Zaslavsky¹, Jian Zhang¹, Evan E Bolton¹ - Show less +9 more•Institutions (1)

National Institutes of Health¹

08 Jan 2021-Nucleic Acids Research

TL;DR: In the past two years, PubChem made substantial improvements, including a data model change for the data objects used by these pages as well as by programmatic users, and several new services were introduced.

...read moreread less

Abstract: PubChem (https://pubchem.ncbi.nlm.nih.gov) is a popular chemical information resource that serves the scientific community as well as the general public, with millions of unique users per month. In the past two years, PubChem made substantial improvements. Data from more than 100 new data sources were added to PubChem, including chemical-literature links from Thieme Chemistry, chemical and physical property links from SpringerMaterials, and patent links from the World Intellectual Properties Organization (WIPO). PubChem's homepage and individual record pages were updated to help users find desired information faster. This update involved a data model change for the data objects used by these pages as well as by programmatic users. Several new services were introduced, including the PubChem Periodic Table and Element pages, Pathway pages, and Knowledge panels. Additionally, in response to the coronavirus disease 2019 (COVID-19) outbreak, PubChem created a special data collection that contains PubChem data related to COVID-19 and the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).

...read moreread less

1,791 citations

Journal Article•DOI•

MetaboAnalyst 5.0: narrowing the gap between raw spectra and functional insights.

[...]

Zhiqiang Pang¹, Jasmine Chong¹, Guangyan Zhou¹, David Anderson de Lima Morais², Le Chang¹, Michel Barrette², Carol Gauthier², Pierre-Étienne Jacques², Shuzhao Li, Jianguo Xia¹ - Show less +6 more•Institutions (2)

McGill University¹, Université de Sherbrooke²

02 Jul 2021-Nucleic Acids Research

TL;DR: The MetaboAnalyst 5.0 as mentioned in this paper is the latest version of the web-based platform for comprehensive metabolomics data analysis and interpretation, aiming to narrow the gap from raw data to functional insights for global metabolomics based on HRMS.

...read moreread less

Abstract: Since its first release over a decade ago, the MetaboAnalyst web-based platform has become widely used for comprehensive metabolomics data analysis and interpretation. Here we introduce MetaboAnalyst version 5.0, aiming to narrow the gap from raw data to functional insights for global metabolomics based on high-resolution mass spectrometry (HRMS). Three modules have been developed to help achieve this goal, including: (i) a LC-MS Spectra Processing module which offers an easy-to-use pipeline that can perform automated parameter optimization and resumable analysis to significantly lower the barriers to LC-MS1 spectra processing; (ii) a Functional Analysis module which expands the previous MS Peaks to Pathways module to allow users to intuitively select any peak groups of interest and evaluate their enrichment of potential functions as defined by metabolic pathways and metabolite sets; (iii) a Functional Meta-Analysis module to combine multiple global metabolomics datasets obtained under complementary conditions or from similar studies to arrive at comprehensive functional insights. There are many other new functions including weighted joint-pathway analysis, data-driven network analysis, batch effect correction, merging technical replicates, improved compound name matching, etc. The web interface, graphics and underlying codebase have also been refactored to improve performance and user experience. At the end of an analysis session, users can now easily switch to other compatible modules for a more streamlined data analysis. MetaboAnalyst 5.0 is freely available at https://www.metaboanalyst.ca.

...read moreread less

1,530 citations

Journal Article•DOI•

The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences.

[...]

Yasset Perez-Riverol¹, Jingwen Bai¹, Chakradhar Bandla¹, David García-Seisdedos¹, Suresh Hewapathirana¹, Selvakumar Kamatchinathan¹, Deepti J. Kundu¹, Ananth Prakash¹, Anika Frericks-Zipper², Martin Eisenacher², Mathias Walzer¹, Shengbo Wang¹, Alvis Brazma¹, Juan Antonio Vizcaíno¹ - Show less +10 more•Institutions (2)

European Bioinformatics Institute¹, Ruhr University Bochum²

01 Nov 2021-Nucleic Acids Research

TL;DR: The PRIDE database as discussed by the authors is the world's largest data repository of mass spectrometry-based proteomics data and is one of the founding members of the global ProteomeXchange (PX) consortium.

...read moreread less

Abstract: The PRoteomics IDEntifications (PRIDE) database (https://www.ebi.ac.uk/pride/) is the world's largest data repository of mass spectrometry-based proteomics data. PRIDE is one of the founding members of the global ProteomeXchange (PX) consortium and an ELIXIR core data resource. In this manuscript, we summarize the developments in PRIDE resources and related tools since the previous update manuscript was published in Nucleic Acids Research in 2019. The number of submitted datasets to PRIDE Archive (the archival component of PRIDE) has reached on average around 500 datasets per month during 2021. In addition to continuous improvements in PRIDE Archive data pipelines and infrastructure, the PRIDE Spectra Archive has been developed to provide direct access to the submitted mass spectra using Universal Spectrum Identifiers. As a key point, the file format MAGE-TAB for proteomics has been developed to enable the improvement of sample metadata annotation. Additionally, the resource PRIDE Peptidome provides access to aggregated peptide/protein evidences across PRIDE Archive. Furthermore, we will describe how PRIDE has increased its efforts to reuse and disseminate high-quality proteomics data into other added-value resources such as UniProt, Ensembl and Expression Atlas.

...read moreread less

1,491 citations

Journal Article•DOI•

The InterPro protein families and domains database: 20 years on.

[...]

Matthias Blum¹, Hsin-Yu Chang¹, Sara Chuguransky¹, Tiago Grego¹, Swaathi Kandasaamy¹, Alex L. Mitchell¹, Gift Nuka¹, Typhaine Paysan-Lafosse¹, Matloob Qureshi¹, Shriya Raj¹, Lorna Richardson¹, Gustavo A. Salazar¹, Lowri Williams¹, Peer Bork, Alan Bridge², Julian Gough³, Daniel H. Haft⁴, Ivica Letunic, Aron Marchler-Bauer⁴, Huaiyu Mi⁵, Darren A. Natale⁶, Marco Necci⁷, Christine A. Orengo⁸, Arun Prasad Pandurangan³, Catherine Rivoire², Christian J. A. Sigrist², Ian Sillitoe⁸, Narmada Thanki⁴, Paul Thomas⁵, Silvio C. E. Tosatto⁷, Cathy H. Wu⁶, Alex Bateman¹, Robert D. Finn¹ - Show less +29 more•Institutions (8)

European Bioinformatics Institute¹, Swiss Institute of Bioinformatics², Laboratory of Molecular Biology³, National Institutes of Health⁴, University of Southern California⁵, Georgetown University Medical Center⁶, University of Padua⁷, University College London⁸

08 Jan 2021-Nucleic Acids Research

TL;DR: The status of InterPro (version 81.0) in its 20th year of operation, and its associated software, is reported, including updates to database content, the release of a new website and REST API, and performance improvements in InterProScan.

...read moreread less

Abstract: The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. InterProScan is the underlying software that allows protein and nucleic acid sequences to be searched against InterPro's signatures. Signatures are predictive models which describe protein families, domains or sites, and are provided by multiple databases. InterPro combines signatures representing equivalent families, domains or sites, and provides additional information such as descriptions, literature references and Gene Ontology (GO) terms, to produce a comprehensive resource for protein classification. Founded in 1999, InterPro has become one of the most widely used resources for protein family annotation. Here, we report the status of InterPro (version 81.0) in its 20th year of operation, and its associated software, including updates to database content, the release of a new website and REST API, and performance improvements in InterProScan.

...read moreread less

Journal Article•DOI•

antiSMASH 6.0: improving cluster detection and comparison capabilities.

[...]

Kai Blin¹, Simon Shaw¹, Alexander M. Kloosterman², Zach Charlop-Powers, Gilles P. van Wezel², Marnix H. Medema³, Marnix H. Medema², Tilmann Weber¹ - Show less +4 more•Institutions (3)

Technical University of Denmark¹, Leiden University², Wageningen University and Research Centre³

05 Dec 2021-Nucleic Acids Research

TL;DR: antiSMASH as mentioned in this paper is the most widely used tool for detecting and characterising biosynthetic gene clusters (BGCs) in bacteria and fungi, and it is updated version 6 of antiSMASH.

...read moreread less

Abstract: Many microorganisms produce natural products that form the basis of antimicrobials, antivirals, and other drugs. Genome mining is routinely used to complement screening-based workflows to discover novel natural products. Since 2011, the "antibiotics and secondary metabolite analysis shell-antiSMASH" (https://antismash.secondarymetabolites.org/) has supported researchers in their microbial genome mining tasks, both as a free-to-use web server and as a standalone tool under an OSI-approved open-source license. It is currently the most widely used tool for detecting and characterising biosynthetic gene clusters (BGCs) in bacteria and fungi. Here, we present the updated version 6 of antiSMASH. antiSMASH 6 increases the number of supported cluster types from 58 to 71, displays the modular structure of multi-modular BGCs, adds a new BGC comparison algorithm, allows for the integration of results from other prediction tools, and more effectively detects tailoring enzymes in RiPP clusters.

...read moreread less

Journal Article•DOI•

SMART: recent updates, new developments and status in 2020.

[...]

Ivica Letunic, Supriya Khedkar¹, Peer Bork¹, Peer Bork², Peer Bork³ - Show less +1 more•Institutions (3)

European Bioinformatics Institute¹, Max Delbrück Center for Molecular Medicine², University of Würzburg³

08 Jan 2021-Nucleic Acids Research

TL;DR: SMART version 9 contains manually curated models for more than 1300 protein domains, with a topical set of 68 new models added since the last update article, greatly increasing the total number of annotated domains and other protein features available in architecture analysis mode.

...read moreread less

Abstract: SMART (Simple Modular Architecture Research Tool) is a web resource (https://smart.embl.de) for the identification and annotation of protein domains and the analysis of protein domain architectures. SMART version 9 contains manually curated models for more than 1300 protein domains, with a topical set of 68 new models added since our last update article (1). All the new models are for diverse recombinase families and subfamilies and as a set they provide a comprehensive overview of mobile element recombinases namely transposase, integrase, relaxase, resolvase, cas1 casposase and Xer like cellular recombinase. Further updates include the synchronization of the underlying protein databases with UniProt (2), Ensembl (3) and STRING (4), greatly increasing the total number of annotated domains and other protein features available in architecture analysis mode. Furthermore, SMART's vector-based protein display engine has been extended and updated to use the latest web technologies and the domain architecture analysis components have been optimized to handle the increased number of protein features available.

...read moreread less

Journal Article•DOI•

PANTHER version 16: a revised family classification, tree-based classification tool, enhancer regions and extensive API.

[...]

Huaiyu Mi¹, Dustin Ebert¹, Anushya Muruganujan¹, Caitlin Mills¹, Laurent-Philippe Albou¹, Tremayne Mushayamaha¹, Paul Thomas¹ - Show less +3 more•Institutions (1)

University of Southern California¹

08 Jan 2021-Nucleic Acids Research

TL;DR: This work analyzes the current coverage of genes from genomes in different taxonomic groups, so that users can better understand what to expect when analyzing a gene list using PANTHER tools.

...read moreread less

Abstract: PANTHER (Protein Analysis Through Evolutionary Relationships, http://www.pantherdb.org) is a resource for the evolutionary and functional classification of protein-coding genes from all domains of life. The evolutionary classification is based on a library of over 15,000 phylogenetic trees, and the functional classifications include Gene Ontology terms and pathways. Here, we analyze the current coverage of genes from genomes in different taxonomic groups, so that users can better understand what to expect when analyzing a gene list using PANTHER tools. We also describe extensive improvements to PANTHER made in the past two years. The PANTHER Protein Class ontology has been completely refactored, and 6101 PANTHER families have been manually assigned to a Protein Class, providing a high level classification of protein families and their genes. Users can access the TreeGrafter tool to add their own protein sequences to the reference phylogenetic trees in PANTHER, to infer evolutionary context as well as fine-grained annotations. We have added human enhancer-gene links that associate non-coding regions with the annotated human genes in PANTHER. We have also expanded the available services for programmatic access to PANTHER tools and data via application programming interfaces (APIs). Other improvements include additional plant genomes and an updated PANTHER GO-slim.

...read moreread less

Journal Article•DOI•

RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences.

[...]

Stephen K. Burley, Charmi Bhikadiya¹, Chunxiao Bi², Sebastian Bittrich², Li Chen¹, Gregg V. Crichlow¹, Cole Christie², Kenneth Dalenberg¹, Luigi Di Costanzo¹, Jose M. Duarte², Shuchismita Dutta¹, Zukang Feng¹, Sai J. Ganesan³, David S. Goodsell¹, David S. Goodsell⁴, Sutapa Ghosh¹, Rachel Kramer Green¹, Vladimir Guranovic¹, Dmytro Guzenko², Brian P. Hudson¹, Catherine L. Lawson¹, Yu-He Liang¹, Robert Lowe¹, Harry Namkoong¹, Ezra Peisach¹, Irina Persikova¹, Christopher Randle², Alexander S. Rose², Yana Rose², Andrej Sali³, Joan Segura², Monica Sekharan¹, Chenghua Shao¹, Yi-Ping Tao¹, Maria Voigt¹, John D. Westbrook¹, Jasmine Young¹, Christine Zardecki¹, Marina Zhuravleva¹ - Show less +35 more•Institutions (4)

Rutgers University¹, San Diego Supercomputer Center², University of California, San Francisco³, Scripps Research Institute⁴

08 Jan 2021-Nucleic Acids Research

TL;DR: New features and resources of the RCSB PDB have been described in detail using examples that showcase recently released structures of SARS-CoV-2 proteins and host cell proteins relevant to understanding and addressing the COVID-19 global pandemic.

...read moreread less

Abstract: The Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB), the US data center for the global PDB archive and a founding member of the Worldwide Protein Data Bank partnership, serves tens of thousands of data depositors in the Americas and Oceania and makes 3D macromolecular structure data available at no charge and without restrictions to millions of RCSB.org users around the world, including >660 000 educators, students and members of the curious public using PDB101.RCSB.org. PDB data depositors include structural biologists using macromolecular crystallography, nuclear magnetic resonance spectroscopy, 3D electron microscopy and micro-electron diffraction. PDB data consumers accessing our web portals include researchers, educators and students studying fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences. During the past 2 years, the research-focused RCSB PDB web portal (RCSB.org) has undergone a complete redesign, enabling improved searching with full Boolean operator logic and more facile access to PDB data integrated with >40 external biodata resources. New features and resources are described in detail using examples that showcase recently released structures of SARS-CoV-2 proteins and host cell proteins relevant to understanding and addressing the COVID-19 global pandemic.

...read moreread less

Journal Article•DOI•

MitoCarta3.0: an updated mitochondrial proteome now with sub-organelle localization and pathway annotations.

[...]

Sneha Rath¹, Sneha Rath², Rohit Sharma¹, Rohit Sharma², Rahul Gupta¹, Rahul Gupta², Tslil Ast², Tslil Ast¹, Connie Chan¹, Connie Chan², Timothy J. Durham², Timothy J. Durham¹, Russell P. Goodman², Russell P. Goodman¹, Zenon Grabarek¹, Zenon Grabarek², Mary E. Haas¹, Mary E. Haas², Wendy H. W. Hung¹, Wendy H. W. Hung², Pallavi R. Joshi², Pallavi R. Joshi¹, Alexis A. Jourdain¹, Alexis A. Jourdain², Sharon H. Kim², Sharon H. Kim¹, Anna V. Kotrys¹, Anna V. Kotrys², Stephanie S Lam¹, Stephanie S Lam², Jason G. McCoy², Jason G. McCoy¹, Joshua D. Meisel², Joshua D. Meisel¹, Maria Miranda¹, Maria Miranda², Apekshya Panda¹, Apekshya Panda², Anupam Patgiri¹, Anupam Patgiri², Robert S. Rogers², Robert S. Rogers¹, Shayan Sadre², Shayan Sadre¹, Hardik Shah², Hardik Shah¹, Owen S. Skinner¹, Owen S. Skinner², Tsz-Leung To¹, Tsz-Leung To², Melissa A. Walker², Melissa A. Walker¹, Hong Wang², Hong Wang¹, Patrick S. Ward², Patrick S. Ward¹, Jordan Wengrod², Jordan Wengrod¹, Chen-Ching Yuan², Chen-Ching Yuan¹, Sarah E. Calvo², Sarah E. Calvo¹, Vamsi K. Mootha², Vamsi K. Mootha¹ - Show less +60 more•Institutions (2)

Harvard University¹, Broad Institute²

08 Jan 2021-Nucleic Acids Research

TL;DR: MitoCarta3.0, a catalogue of over 1000 genes encoding the mammalian mitochondrial proteome, is introduced and includes manually curated annotations of sub-mitochondrial localization and MitoPathway annotations, spanning seven broad functional categories relevant to mitochondria.

...read moreread less

Abstract: The mammalian mitochondrial proteome is under dual genomic control, with 99% of proteins encoded by the nuclear genome and 13 originating from the mitochondrial DNA (mtDNA). We previously developed MitoCarta, a catalogue of over 1000 genes encoding the mammalian mitochondrial proteome. This catalogue was compiled using a Bayesian integration of multiple sequence features and experimental datasets, notably protein mass spectrometry of mitochondria isolated from fourteen murine tissues. Here, we introduce MitoCarta3.0. Beginning with the MitoCarta2.0 inventory, we performed manual review to remove 100 genes and introduce 78 additional genes, arriving at an updated inventory of 1136 human genes. We now include manually curated annotations of sub-mitochondrial localization (matrix, inner membrane, intermembrane space, outer membrane) as well as assignment to 149 hierarchical 'MitoPathways' spanning seven broad functional categories relevant to mitochondria. MitoCarta3.0, including sub-mitochondrial localization and MitoPathway annotations, is freely available at http://www.broadinstitute.org/mitocarta and should serve as a continued community resource for mitochondrial biology and medicine.

...read moreread less

Journal Article•DOI•

ADMETlab 2.0: an integrated online platform for accurate and comprehensive predictions of ADMET properties

[...]

Guo-Li Xiong¹, Zhenxing Wu², Jiacai Yi³, Li Fu¹, Zhi-Jiang Yang¹, Chang-Yu Hsieh⁴, Ming-Zhu Yin¹, Xiangxiang Zeng⁵, Chengkun Wu³, Aiping Lu⁶, Chen Xiang¹, Tingjun Hou², Dong-Sheng Cao⁶, Dong-Sheng Cao¹ - Show less +10 more•Institutions (6)

Central South University¹, Zhejiang University², National University of Defense Technology³, Tencent⁴, Hunan University⁵, Hong Kong Baptist University⁶

02 Jul 2021-Nucleic Acids Research

TL;DR: ADMETlab 2.0 as discussed by the authors is a completely redesigned version of the widely used AMDETlab web server for the predictions of pharmacokinetics and toxicity properties of chemicals, of which the supported ADMET-related endpoints are approximately twice the number of the endpoints in the previous version, including 17 physicochemical properties, 13 medicinal chemistry properties, 23 ADME properties, 27 toxicity endpoints and 8 toxicophore rules.

...read moreread less

Abstract: Because undesirable pharmacokinetics and toxicity of candidate compounds are the main reasons for the failure of drug development, it has been widely recognized that absorption, distribution, metabolism, excretion and toxicity (ADMET) should be evaluated as early as possible. In silico ADMET evaluation models have been developed as an additional tool to assist medicinal chemists in the design and optimization of leads. Here, we announced the release of ADMETlab 2.0, a completely redesigned version of the widely used AMDETlab web server for the predictions of pharmacokinetics and toxicity properties of chemicals, of which the supported ADMET-related endpoints are approximately twice the number of the endpoints in the previous version, including 17 physicochemical properties, 13 medicinal chemistry properties, 23 ADME properties, 27 toxicity endpoints and 8 toxicophore rules (751 substructures). A multi-task graph attention framework was employed to develop the robust and accurate models in ADMETlab 2.0. The batch computation module was provided in response to numerous requests from users, and the representation of the results was further optimized. The ADMETlab 2.0 server is freely available, without registration, at https://admetmesh.scbdd.com/.

...read moreread less

Journal Article•DOI•

Comparative Toxicogenomics Database (CTD): update 2021.

[...]

Allan Peter Davis¹, Cynthia J. Grondin¹, Robin J. Johnson¹, Daniela Sciaky¹, Jolene Wiegers¹, Thomas C. Wiegers¹, Carolyn J. Mattingly¹ - Show less +3 more•Institutions (1)

North Carolina State University¹

08 Jan 2021-Nucleic Acids Research

TL;DR: This biennial update of the public Comparative Toxicogenomics Database (CTD) reports a 20% increase in CTD curated content and provides 45 million toxicogenomic relationships and introduces new CTD Anatomy pages that allow users to uniquely explore and analyze chemical–phenotype interactions from an anatomical perspective.

...read moreread less

Abstract: The public Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) is an innovative digital ecosystem that relates toxicological information for chemicals, genes, phenotypes, diseases, and exposures to advance understanding about human health. Literature-based, manually curated interactions are integrated to create a knowledgebase that harmonizes cross-species heterogeneous data for chemical exposures and their biological repercussions. In this biennial update, we report a 20% increase in CTD curated content and now provide 45 million toxicogenomic relationships for over 16 300 chemicals, 51 300 genes, 5500 phenotypes, 7200 diseases and 163 000 exposure events, from 600 comparative species. Furthermore, we increase the functionality of chemical-phenotype content with new data-tabs on CTD Disease pages (to help fill in knowledge gaps for environmental health) and new phenotype search parameters (for Batch Query and Venn analysis tools). As well, we introduce new CTD Anatomy pages that allow users to uniquely explore and analyze chemical-phenotype interactions from an anatomical perspective. Finally, we have enhanced CTD Chemical pages with new literature-based chemical synonyms (to improve querying) and added 1600 amino acid-based compounds (to increase chemical landscape). Together, these updates continue to augment CTD as a powerful resource for generating testable hypotheses about the etiologies and molecular mechanisms underlying environmentally influenced diseases.

...read moreread less

Journal Article•DOI•

The Human Phenotype Ontology in 2021

[...]

Sebastian Köhler, Michael A. Gargano, Nicolas Matentzoglu¹, Leigh C. Carmody, David Lewis-Smith², David Lewis-Smith³, Nicole Vasilevsky⁴, Daniel Danis⁵, Daniel Danis⁶, Ganna Balagura⁵, Gareth Baynam⁷, Gareth Baynam⁸, Amy Brower⁹, Tiffany J. Callahan¹⁰, Christopher G. Chute¹¹, Johanna L. Est¹², Peter D. Galer¹³, Shiva Ganesan¹³, Matthias Griese¹², Matthias Haimel¹⁴, Julia Pazmandi¹⁴, Julia Pazmandi¹⁵, Marc Hanauer¹⁶, Nomi L. Harris¹⁷, Michael Hartnett⁹, Maximilian Hastreiter¹², Fabian Hauck¹², Yongqun He¹⁸, Tim Jeske¹², Hugh Kearney, Gerhard Kindle¹⁹, Christoph Klein¹², Katrin Knoflach¹², Roland Krause²⁰, David Lagorce¹⁶, Julie A. McMurry²¹, Jillian A. Miller⁹, Monica Munoz-Torres²¹, Rebecca L. Peters⁹, Christina K Rapp¹², Ana Rath¹⁶, Shahmir A. Rind⁷, Avi Z. Rosenberg¹¹, Michael M. Segal²², Markus G. Seidel²³, Damian Smedley²⁴, Tomer Talmy²⁵, Yarlalu Thomas, Samuel A. Wiafe, Julie Xian¹³, Zafer Yüksel, Ingo Helbig¹³, Ingo Helbig²⁶, Christopher J. Mungall¹⁷, Melissa A. Haendel²¹, Melissa A. Haendel⁴, Peter N. Robinson¹⁵ - Show less +53 more•Institutions (26)

European Bioinformatics Institute¹, Newcastle University², Newcastle upon Tyne Hospitals NHS Foundation Trust³, Oregon Health & Science University⁴, University of Genoa⁵, Istituto Giannina Gaslini⁶, University of Western Australia⁷, King Edward Memorial Hospital⁸, American College of Medical Genetics⁹, Anschutz Medical Campus¹⁰, Johns Hopkins University¹¹, Ludwig Maximilian University of Munich¹², Children's Hospital of Philadelphia¹³, Austrian Academy of Sciences¹⁴, University of Connecticut¹⁵, French Institute of Health and Medical Research¹⁶, Lawrence Berkeley National Laboratory¹⁷, University of Michigan¹⁸, University of Freiburg¹⁹, University of Luxembourg²⁰, Oregon State University²¹, Chestnut Hill College²², Medical University of Graz²³, Queen Mary University of London²⁴, Hebrew University of Jerusalem²⁵, University of Pennsylvania²⁶

08 Jan 2021-Nucleic Acids Research

TL;DR: Recent major extensions of the Human Phenotype Ontology for neurology, nephrology, immunology, pulmonology, newborn screening, and other areas are presented and new efforts to harmonize computational definitions of phenotypic abnormalities across the HPO and multiple phenotype ontologies used for animal models of disease are presented.

...read moreread less

Abstract: The Human Phenotype Ontology (HPO, https://hpo.jax.org) was launched in 2008 to provide a comprehensive logical standard to describe and computationally analyze phenotypic abnormalities found in human disease. The HPO is now a worldwide standard for phenotype exchange. The HPO has grown steadily since its inception due to considerable contributions from clinical experts and researchers from a diverse range of disciplines. Here, we present recent major extensions of the HPO for neurology, nephrology, immunology, pulmonology, newborn screening, and other areas. For example, the seizure subontology now reflects the International League Against Epilepsy (ILAE) guidelines and these enhancements have already shown clinical validity. We present new efforts to harmonize computational definitions of phenotypic abnormalities across the HPO and multiple phenotype ontologies used for animal models of disease. These efforts will benefit software such as Exomiser by improving the accuracy and scope of cross-species phenotype matching. The computational modeling strategy used by the HPO to define disease entities and phenotypic features and distinguish between them is explained in detail.We also report on recent efforts to translate the HPO into indigenous languages. Finally, we summarize recent advances in the use of HPO in electronic health record systems.

...read moreread less

Journal Article•DOI•

KOBAS-i: intelligent prioritization and exploratory visualization of biological functions for gene enrichment analysis.

[...]

Dechao Bu, Haitao Luo¹, Peipei Huo², Zhihao Wang², Shan Zhang², Zihao He³, Yang Wu, Lianhe Zhao, Jingjia Liu², Jin-Cheng Guo³, Shuangsang Fang³, Wanchen Cao³, Lan Yi, Yi Zhao, Lei Kong⁴ - Show less +11 more•Institutions (4)

Jinan University¹, Chinese Academy of Sciences², Beijing University of Chinese Medicine³, Peking University⁴

02 Jul 2021-Nucleic Acids Research

TL;DR: A novel machine learning-based method was introduced, CGPS, which incorporates seven FCS tools and two PT tools into a single ensemble score and intelligently prioritizes the relevant biological pathways.

...read moreread less

Abstract: Gene set enrichment (GSE) analysis plays an essential role in extracting biological insight from genome-scale experiments. ORA (overrepresentation analysis), FCS (functional class scoring), and PT (pathway topology) approaches are three generations of GSE methods along the timeline of development. Previous versions of KOBAS provided services based on just the ORA method. Here we presented version 3.0 of KOBAS, which is named KOBAS-i (short for KOBAS intelligent version). It introduced a novel machine learning-based method we published earlier, CGPS, which incorporates seven FCS tools and two PT tools into a single ensemble score and intelligently prioritizes the relevant biological pathways. In addition, KOBAS has expanded the downstream exploratory visualization for selecting and understanding the enriched results. The tool constructs a novel view of cirFunMap, which presents different enriched terms and their correlations in a landscape. Finally, based on the previous version's framework, KOBAS increased the number of supported species from 1327 to 5944. For an easier local run, it also provides a prebuilt Docker image that requires no installation, as a supplementary to the source code version. KOBAS can be freely accessed at http://kobas.cbi.pku.edu.cn, and a mirror site is available at http://bioinfo.org/kobas.

...read moreread less

Journal Article•DOI•

JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles.

[...]

Jaime A. Castro-Mondragon¹, Rafael Riudavets-Puig¹, Ieva Rauluseviciute¹, Roza Berhanu Lemma¹, Laura Turchi², Romain Blanc-Mathieu², Jérémy Lucas², Paul Boddie¹, Aziz Khan³, Nicolás Manosalva Pérez⁴, Oriol Fornes⁵, Tiffany Y. Leung⁵, Alejandro Aguirre⁵, Fayrouz Hammal⁶, Daniel Schmelter⁷, Damir Baranasic⁸, Benoit Ballester⁶, Albin Sandelin⁹, Boris Lenhard⁸, Klaas Vandepoele⁴, Wyeth W. Wasserman⁵, François Parcy², Anthony Mathelier¹⁰, Anthony Mathelier¹ - Show less +20 more•Institutions (10)

University of Oslo¹, University of Grenoble², Stanford University³, Ghent University⁴, University of British Columbia⁵, Aix-Marseille University⁶, University of California, Santa Cruz⁷, Imperial College London⁸, University of Copenhagen⁹, Oslo University Hospital¹⁰

30 Nov 2021-Nucleic Acids Research

TL;DR: JASPAR (http://jaspar.genereg.net/) is an open-access database containing manually curated, non-redundant transcription factor (TF) binding profiles for TFs across six taxonomic groups as mentioned in this paper.

...read moreread less

Abstract: JASPAR (http://jaspar.genereg.net/) is an open-access database containing manually curated, non-redundant transcription factor (TF) binding profiles for TFs across six taxonomic groups. In this 9th release, we expanded the CORE collection with 341 new profiles (148 for plants, 101 for vertebrates, 85 for urochordates, and 7 for insects), which corresponds to a 19% expansion over the previous release. We added 298 new profiles to the Unvalidated collection when no orthogonal evidence was found in the literature. All the profiles were clustered to provide familial binding profiles for each taxonomic group. Moreover, we revised the structural classification of DNA binding domains to consider plant-specific TFs. This release introduces word clouds to represent the scientific knowledge associated with each TF. We updated the genome tracks of TFBSs predicted with JASPAR profiles in eight organisms; the human and mouse TFBS predictions can be visualized as native tracks in the UCSC Genome Browser. Finally, we provide a new tool to perform JASPAR TFBS enrichment analysis in user-provided genomic regions. All the data is accessible through the JASPAR website, its associated RESTful API, the R/Bioconductor data package, and a new Python package, pyJASPAR, that facilitates serverless access to the data.

...read moreread less

Journal Article•DOI•

The reactome pathway knowledgebase 2022.

[...]

Marc Gillespie¹, Marc Gillespie², Bijay Jassal², Ralf Stephan², Marija Milacic², Karen Rothfels², Andrea Senff-Ribeiro², Andrea Senff-Ribeiro³, Johannes Griss⁴, Johannes Griss⁵, Cristoffer Sevilla⁵, Lisa Matthews, Chuqiao Gong⁵, Chuan Deng⁶, Chuan Deng⁷, Thawfeek M. Varusai⁵, Eliot Ragueneau⁵, Yusra Haider⁵, Bruce May², Veronica Shamovsky, Joel Weiser², Timothy Brunson⁸, Nasim Sanati⁸, Liam Beckman⁸, Xiang Shao⁸, Antonio Fabregat⁵, Konstantinos Sidiropoulos⁵, Julieth Murillo, Guilherme Viteri⁵, Justin Cook², Solomon Shorser², Gary D. Bader⁹, Emek Demir⁸, Chris Sander¹⁰, Robin Haw², Guanming Wu⁸, Lincoln Stein⁹, Lincoln Stein², Henning Hermjakob⁵, Henning Hermjakob⁷, Peter D'Eustachio - Show less +37 more•Institutions (10)

St. John's University¹, Ontario Institute for Cancer Research², Federal University of Paraná³, Medical University of Vienna⁴, European Bioinformatics Institute⁵, Chongqing University of Posts and Telecommunications⁶, Protein Sciences⁷, Oregon Health & Science University⁸, University of Toronto⁹, Harvard University¹⁰

12 Nov 2021-Nucleic Acids Research

TL;DR: The Reactome Knowledgebase as mentioned in this paper provides manually curated molecular details across a broad range of physiological and pathological biological processes in humans, including both hereditary and acquired disease processes, annotated as an ordered network of molecular transformations in a single consistent data model.

...read moreread less

Abstract: The Reactome Knowledgebase (https://reactome.org), an Elixir core resource, provides manually curated molecular details across a broad range of physiological and pathological biological processes in humans, including both hereditary and acquired disease processes. The processes are annotated as an ordered network of molecular transformations in a single consistent data model. Reactome thus functions both as a digital archive of manually curated human biological processes and as a tool for discovering functional relationships in data such as gene expression profiles or somatic mutation catalogs from tumor cells. Recent curation work has expanded our annotations of normal and disease-associated signaling processes and of the drugs that target them, in particular infections caused by the SARS-CoV-1 and SARS-CoV-2 coronaviruses and the host response to infection. New tools support better simultaneous analysis of high-throughput data from multiple sources and the placement of understudied ('dark') proteins from analyzed datasets in the context of Reactome's manually curated pathways.

...read moreread less

Journal Article•DOI•

The carbohydrate-active enzyme database: functions and literature.

[...]

Elodie Drula¹, Marie-Line Garron¹, Suzan Dogan¹, Vincent Lombard¹, Bernard Henrissat, Nicolas Terrapon¹ - Show less +2 more•Institutions (1)

Aix-Marseille University¹

29 Nov 2021-Nucleic Acids Research

TL;DR: The CAZy database as discussed by the authors is a taxonomic classification of carbohydrate-active enzymes in sequence-based families that is freely available for browsing and download at www.cazy.org.

...read moreread less

Abstract: Thirty years have elapsed since the emergence of the classification of carbohydrate-active enzymes in sequence-based families that became the CAZy database over 20 years ago, freely available for browsing and download at www.cazy.org. In the era of large scale sequencing and high-throughput Biology, it is important to examine the position of this specialist database that is deeply rooted in human curation. The three primary tasks of the CAZy curators are (i) to maintain and update the family classification of this class of enzymes, (ii) to classify sequences newly released by GenBank and the Protein Data Bank and (iii) to capture and present functional information for each family. The CAZy website is updated once a month. Here we briefly summarize the increase in novel families and the annotations conducted during the last 8 years. We present several important changes that facilitate taxonomic navigation, and allow to download the entirety of the annotations. Most importantly we highlight the considerable amount of work that accompanies the analysis and report of biochemical data from the literature.

...read moreread less

Journal Article•DOI•

TYGS and LPSN: a database tandem for fast and reliable genome-based classification and nomenclature of prokaryotes.

[...]

Jan P. Meier-Kolthoff¹, Joaquim Sardà Carbasse¹, Rosa L Peinado-Olarte¹, Markus Göker¹•Institutions (1)

Leibniz Association¹

11 Oct 2021-Nucleic Acids Research

TL;DR: The Type (Strain) Genome Server (TYGS) is a high-throughput platform for accurate genome-based taxonomy and is available at https://tygs.dsmz.de.

...read moreread less

Abstract: Microbial systematics is heavily influenced by genome-based methods and challenged by an ever increasing number of taxon names and associated sequences in public data repositories. This poses a challenge for database systems, particularly since it is obviously advantageous if such data are based on a globally recognized approach to manage names, such as the International Code of Nomenclature of Prokaryotes. The amount of data can only be handled if accurate and reliable high-throughput platforms are available that are able to both comply with this demand and to keep track of all changes in an efficient and flexible way. The List of Prokaryotic names with Standing in Nomenclature (LPSN) is an expert-curated authoritative resource for prokaryotic nomenclature and is available at https://lpsn.dsmz.de. The Type (Strain) Genome Server (TYGS) is a high-throughput platform for accurate genome-based taxonomy and is available at https://tygs.dsmz.de. We here present important updates of these two previously introduced, heavily interconnected platforms for taxonomic nomenclature and classification, including new high-level facilities providing access to bioinformatic algorithms, a considerable expansion of the database content, and new ways to easily access the data.

...read moreread less

Journal Article•DOI•

PLIP 2021: expanding the scope of the protein-ligand interaction profiler to DNA and RNA.

[...]

Melissa F. Adasme¹, Katja L Linnemann¹, Sarah Naomi Bolz¹, Florian Kaiser, Sebastian Salentin¹, V. Joachim Haupt, Michael Schroeder¹ - Show less +3 more•Institutions (1)

Dresden University of Technology¹

02 Jul 2021-Nucleic Acids Research

TL;DR: PLIP as discussed by the authors is a profiler for protein-ligand interaction profilers that detects and visualises these interactions and provides data in formats suitable for further processing, including DNA and RNA.

...read moreread less

Abstract: With the growth of protein structure data, the analysis of molecular interactions between ligands and their target molecules is gaining importance. PLIP, the protein-ligand interaction profiler, detects and visualises these interactions and provides data in formats suitable for further processing. PLIP has proven very successful in applications ranging from the characterisation of docking experiments to the assessment of novel ligand-protein complexes. Besides ligand-protein interactions, interactions with DNA and RNA play a vital role in many applications, such as drugs targeting DNA or RNA-binding proteins. To date, over 7% of all 3D structures in the Protein Data Bank include DNA or RNA. Therefore, we extended PLIP to encompass these important molecules. We demonstrate the power of this extension with examples of a cancer drug binding to a DNA target, and an RNA-protein complex central to a neurological disease. PLIP is available online at https://plip-tool.biotec.tu-dresden.de and as open source code. So far, the engine has served over a million queries and the source code has been downloaded several thousand times.

...read moreread less

Journal Article•DOI•

WikiPathways: connecting communities.

[...]

Marvin Martens¹, Ammar Ammar¹, Anders Riutta², Andra Waagmeester, Denise Slenter¹, Kristina Hanspers², Ryan A. Miller¹, Daniela Digles³, Elisson Nogueira Lopes⁴, Friederike Ehrhart¹, Lauren J. Dupuis¹, Laurent A. Winckers¹, Susan L. Coort¹, Egon Willighagen¹, Chris T. Evelo¹, Alexander R. Pico², Martina Kutmon¹ - Show less +13 more•Institutions (4)

Maastricht University¹, Gladstone Institutes², University of Vienna³, Universidade Federal de Minas Gerais⁴

08 Jan 2021-Nucleic Acids Research

TL;DR: The growth of WikiPathways over the last three years is shown, the new communities and collaborations of pathway authors and curators are highlighted, and various technologies to connect to external resources and initiatives are described.

...read moreread less

Abstract: WikiPathways (https://www.wikipathways.org) is a biological pathway database known for its collaborative nature and open science approaches. With the core idea of the scientific community developing and curating biological knowledge in pathway models, WikiPathways lowers all barriers for accessing and using its content. Increasingly more content creators, initiatives, projects and tools have started using WikiPathways. Central in this growth and increased use of WikiPathways are the various communities that focus on particular subsets of molecular pathways such as for rare diseases and lipid metabolism. Knowledge from published pathway figures helps prioritize pathway development, using optical character and named entity recognition. We show the growth of WikiPathways over the last three years, highlight the new communities and collaborations of pathway authors and curators, and describe various technologies to connect to external resources and initiatives. The road toward a sustainable, community-driven pathway database goes through integration with other resources such as Wikidata and allowing more use, curation and redistribution of WikiPathways content.

...read moreread less

Journal Article•DOI•

Mol* Viewer: modern web app for 3D visualization and analysis of large biomolecular structures.

[...]

David Sehnal¹, David Sehnal², David Sehnal³, Sebastian Bittrich⁴, Mandar Deshpande¹, Radka Svobodová², Radka Svobodová³, Karel Berka, Václav Bazgier, Sameer Velankar¹, Stephen K. Burley⁵, Stephen K. Burley⁶, Jaroslav Koča³, Jaroslav Koča², Alexander S. Rose⁴ - Show less +11 more•Institutions (6)

European Bioinformatics Institute¹, Central European Institute of Technology², Masaryk University³, University of California, San Diego⁴, University of Montana⁵, Rutgers University⁶

02 Jul 2021-Nucleic Acids Research

TL;DR: Mol* as mentioned in this paper is a web-native 3D visualization and streaming tool for macromolecular coordinate and experimental data, together with capabilities for displaying structure quality, functional, or biological context annotations.

...read moreread less

Abstract: Large biomolecular structures are being determined experimentally on a daily basis using established techniques such as crystallography and electron microscopy. In addition, emerging integrative or hybrid methods (I/HM) are producing structural models of huge macromolecular machines and assemblies, sometimes containing 100s of millions of non-hydrogen atoms. The performance requirements for visualization and analysis tools delivering these data are increasing rapidly. Significant progress in developing online, web-native three-dimensional (3D) visualization tools was previously accomplished with the introduction of the LiteMol suite and NGL Viewers. Thereafter, Mol* development was jointly initiated by PDBe and RCSB PDB to combine and build on the strengths of LiteMol (developed by PDBe) and NGL (developed by RCSB PDB). The web-native Mol* Viewer enables 3D visualization and streaming of macromolecular coordinate and experimental data, together with capabilities for displaying structure quality, functional, or biological context annotations. High-performance graphics and data management allows users to simultaneously visualise up to hundreds of (superimposed) protein structures, stream molecular dynamics simulation trajectories, render cell-level models, or display huge I/HM structures. It is the primary 3D structure viewer used by PDBe and RCSB PDB. It can be easily integrated into third-party services. Mol* Viewer is open source and freely available at https://molstar.org/.

...read moreread less

Journal Article•DOI•

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation

[...]

Wenjun Li¹, Kathleen R O’Neill¹, Daniel H. Haft¹, Michael DiCuccio¹, Vyacheslav Chetvernin¹, Azat Badretdin¹, George Coulouris¹, Farideh Chitsaz¹, Myra K. Derbyshire¹, A Scott Durkin¹, Noreen R. Gonzales¹, Marc Gwadz¹, Christopher J. Lanczycki¹, James S. Song¹, Narmada Thanki¹, Jiyao Wang¹, Roxanne A. Yamashita¹, Mingzhang Yang¹, Chanjuan Zheng¹, Aron Marchler-Bauer¹, Françoise Thibaud-Nissen¹ - Show less +17 more•Institutions (1)

National Institutes of Health¹

08 Jan 2021-Nucleic Acids Research

TL;DR: The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation.

...read moreread less

Abstract: The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, >122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM. Gene symbols, Enzyme Commission numbers or supporting publication attributes are available on over 40% of the PFMs and are inherited by the proteins and features they name, facilitating multi-genome analyses and connections to the literature. In adherence with the principles of FAIR (findable, accessible, interoperable, reusable), the PFMs are available in the Protein Family Models Entrez database to any user. Finally, the reference and representative genome set, a taxonomically diverse subset of RefSeq prokaryotic genomes, is now recalculated regularly and available for download and homology searches with BLAST. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.

...read moreread less

Journal Article•DOI•

Rfam 14: expanded coverage of metagenomic, viral and microRNA families.

[...]

Ioanna Kalvari¹, Eric P. Nawrocki², Nancy Ontiveros-Palacios¹, Joanna Argasinska¹, Kevin Lamkiewicz³, Manja Marz³, Sam Griffiths-Jones⁴, Claire Toffano-Nioche⁵, Daniel Gautheret⁵, Zasha Weinberg⁶, Elena Rivas⁷, Sean R. Eddy⁸, Sean R. Eddy⁷, Robert D. Finn¹, Alex Bateman¹, Anton I. Petrov¹ - Show less +12 more•Institutions (8)

European Bioinformatics Institute¹, National Institutes of Health², University of Jena³, University of Manchester⁴, Université Paris-Saclay⁵, Leipzig University⁶, Harvard University⁷, Howard Hughes Medical Institute⁸

08 Jan 2021-Nucleic Acids Research

TL;DR: The first phase of synchronising microRNA families in Rfam and miRBase is completed, creating 356 new Rfam families and updating 40, and a procedure for comprehensive annotation of viral RNA families starting with Flavivirus and Coronaviridae RNAs is established.

...read moreread less

Abstract: Rfam is a database of RNA families where each of the 3444 families is represented by a multiple sequence alignment of known RNA sequences and a covariance model that can be used to search for additional members of the family. Recent developments have involved expert collaborations to improve the quality and coverage of Rfam data, focusing on microRNAs, viral and bacterial RNAs. We have completed the first phase of synchronising microRNA families in Rfam and miRBase, creating 356 new Rfam families and updating 40. We established a procedure for comprehensive annotation of viral RNA families starting with Flavivirus and Coronaviridae RNAs. We have also increased the coverage of bacterial and metagenome-based RNA families from the ZWD database. These developments have enabled a significant growth of the database, with the addition of 759 new families in Rfam 14. To facilitate further community contribution to Rfam, expert users are now able to build and submit new families using the newly developed Rfam Cloud family curation system. New Rfam website features include a new sequence similarity search powered by RNAcentral, as well as search and visualisation of families with pseudoknots. Rfam is freely available at https://rfam.org.

...read moreread less

Journal Article•DOI•

GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy.

[...]

Donovan H. Parks¹, Maria Chuvochina¹, Christian Rinke¹, Aaron J. Mussig¹, Pierre-Alain Chaumeil¹, Philip Hugenholtz¹ - Show less +2 more•Institutions (1)

University of Queensland¹

14 Sep 2021-Nucleic Acids Research

TL;DR: The Genome Taxonomy Database (GTDB) as discussed by the authors provides a phylogenetically consistent and rank normalized genome-based taxonomy for prokaryotic genomes sourced from the NCBI Assembly database.

...read moreread less

Abstract: The Genome Taxonomy Database (GTDB; https://gtdb.ecogenomic.org) provides a phylogenetically consistent and rank normalized genome-based taxonomy for prokaryotic genomes sourced from the NCBI Assembly database. GTDB R06-RS202 spans 254 090 bacterial and 4316 archaeal genomes, a 270% increase since the introduction of the GTDB in November, 2017. These genomes are organized into 45 555 bacterial and 2339 archaeal species clusters which is a 200% increase since the integration of species clusters into the GTDB in June, 2019. Here, we explore prokaryotic diversity from the perspective of the GTDB and highlight the importance of metagenome-assembled genomes in expanding available genomic representation. We also discuss improvements to the GTDB website which allow tracking of taxonomic changes, easy assessment of genome assembly quality, and identification of genomes assembled from type material or used as species representatives. Methodological updates and policy changes made since the inception of the GTDB are then described along with the procedure used to update species clusters in the GTDB. We conclude with a discussion on the use of average nucleotide identities as a pragmatic approach for delineating prokaryotic species.

...read moreread less

Collapse