scispace - formally typeset
Search or ask a question

Showing papers in "Nucleic Acids Research in 2022"


Journal ArticleDOI
TL;DR: The DAVID Gene system was rebuilt to gain coverage of more organisms, which increased the taxonomy coverage from 17 399 to 55 464 and a species parameter for uploading a list of gene symbols to minimize the ambiguity between species, which increases the efficiency of the list upload and eliminates confusion for users.
Abstract: DAVID is a popular bioinformatics resource system including a web server and web service for functional annotation and enrichment analyses of gene lists. It consists of a comprehensive knowledgebase and a set of functional analysis tools. Here, we report all updates made in 2021. The DAVID Gene system was rebuilt to gain coverage of more organisms, which increased the taxonomy coverage from 17 399 to 55 464. All existing annotation types have been updated, if available, based on the new DAVID Gene system. Compared with the last version, the number of gene-term records for most annotation types within the updated Knowledgebase have significantly increased. Moreover, we have incorporated new annotations in the Knowledgebase including small molecule-gene interactions from PubChem, drug-gene interactions from DrugBank, tissue expression information from the Human Protein Atlas, disease information from DisGeNET, and pathways from WikiPathways and PathBank. Eight of ten subgroups split from Uniprot Keyword annotation were assigned to specific types. Finally, we added a species parameter for uploading a list of gene symbols to minimize the ambiguity between species, which increases the efficiency of the list upload and eliminates confusion for users. These current updates have significantly expanded the Knowledgebase and enhanced the discovery power of DAVID.

860 citations


Journal ArticleDOI
TL;DR: The DAVID Gene system as discussed by the authors was rebuilt to gain coverage of more organisms, which increased the taxonomy coverage from 17 399 to 55 464, and the number of gene-term records for most annotation types within the updated knowledgebase have significantly increased.
Abstract: Abstract DAVID is a popular bioinformatics resource system including a web server and web service for functional annotation and enrichment analyses of gene lists. It consists of a comprehensive knowledgebase and a set of functional analysis tools. Here, we report all updates made in 2021. The DAVID Gene system was rebuilt to gain coverage of more organisms, which increased the taxonomy coverage from 17 399 to 55 464. All existing annotation types have been updated, if available, based on the new DAVID Gene system. Compared with the last version, the number of gene-term records for most annotation types within the updated Knowledgebase have significantly increased. Moreover, we have incorporated new annotations in the Knowledgebase including small molecule-gene interactions from PubChem, drug-gene interactions from DrugBank, tissue expression information from the Human Protein Atlas, disease information from DisGeNET, and pathways from WikiPathways and PathBank. Eight of ten subgroups split from Uniprot Keyword annotation were assigned to specific types. Finally, we added a species parameter for uploading a list of gene symbols to minimize the ambiguity between species, which increases the efficiency of the list upload and eliminates confusion for users. These current updates have significantly expanded the Knowledgebase and enhanced the discovery power of DAVID.

797 citations


Journal ArticleDOI
TL;DR: Recent improvements to EBI Search and Job Dispatcher tools frameworks are described and updates made to accommodate the increasing data requirements during the COVID-19 pandemic are described.
Abstract: Abstract The EMBL-EBI search and sequence analysis tools frameworks provide integrated access to EMBL-EBI’s data resources and core bioinformatics analytical tools. EBI Search (https://www.ebi.ac.uk/ebisearch) provides a full-text search engine across nearly 5 billion entries, while the Job Dispatcher tools framework (https://www.ebi.ac.uk/services) enables the scientific community to perform a diverse range of sequence analysis using popular bioinformatics applications. Both allow users to interact through user-friendly web applications, as well as via RESTful and SOAP-based APIs. Here, we describe recent improvements to these services and updates made to accommodate the increasing data requirements during the COVID-19 pandemic.

540 citations


Journal ArticleDOI
TL;DR: An increasing number of eukaryotic genomes have been included in KEGG for better representation of organisms in the taxonomic tree, and the Brite hierarchy viewer is used for taxonomy mapping.
Abstract: Abstract KEGG (https://www.kegg.jp) is a manually curated database resource integrating various biological objects categorized into systems, genomic, chemical and health information. Each object (database entry) is identified by the KEGG identifier (kid), which generally takes the form of a prefix followed by a five-digit number, and can be retrieved by appending /entry/kid in the URL. The KEGG pathway map viewer, the Brite hierarchy viewer and the newly released KEGG genome browser can be launched by appending /pathway/kid, /brite/kid and /genome/kid, respectively, in the URL. Together with an improved annotation procedure for KO (KEGG Orthology) assignment, an increasing number of eukaryotic genomes have been included in KEGG for better representation of organisms in the taxonomic tree. Multiple taxonomy files are generated for classification of KEGG organisms and viruses, and the Brite hierarchy viewer is used for taxonomy mapping, a variant of Brite mapping in the new KEGG Mapper suite. The taxonomy mapping enables analysis of, for example, how functional links of genes in the pathway and physical links of genes on the chromosome are conserved among organism groups.

520 citations


Journal ArticleDOI
TL;DR: The EMBL-EBI search and sequence analysis tools frameworks as discussed by the authors provide integrated access to EMBL EBI's data resources and core bioinformatics analytical tools, allowing users to interact through user-friendly web applications, as well as via RESTful and SOAP-based APIs.
Abstract: The EMBL-EBI search and sequence analysis tools frameworks provide integrated access to EMBL-EBI's data resources and core bioinformatics analytical tools. EBI Search (https://www.ebi.ac.uk/ebisearch) provides a full-text search engine across nearly 5 billion entries, while the Job Dispatcher tools framework (https://www.ebi.ac.uk/services) enables the scientific community to perform a diverse range of sequence analysis using popular bioinformatics applications. Both allow users to interact through user-friendly web applications, as well as via RESTful and SOAP-based APIs. Here, we describe recent improvements to these services and updates made to accommodate the increasing data requirements during the COVID-19 pandemic.

497 citations


Journal ArticleDOI
Alex Bateman, Maria Jesus Martin, Sandra Orchard, Michele Magrane, Shadab Ahmad, Emanuele Alpi, Emily H Bowler-Barnett, Ramona Britto, Hema Bye-a-Jee, Austra Cukura, P. Denny, Tunca Doğan, ThankGod Ebenezer, Jun Fan, Penelope Garmiri, Leonardo Jose da Costa Gonzales, Emma Hatton-Ellis, Abdulrahman Hussein, Alexandr Ignatchenko, Giuseppe Insana, Rizwan Ishtiaq, Vishal Joshi, Dushyanth Jyothi, Swaathi Kandasaamy, Antonia Lock, Aurelien Luciani, Marija Lugarić, Jie Luo, Y. Lussi, Alistair MacDougall, Fábio Madeira, Mahdi Mahmoudy, Alok Mishra, Katie Moulang, Andrew Nightingale, Sangya Pundir, Guoying Qi, Shri K. Raman Raj, Pedro Duarte da Silva Fonseca Gândara Raposo, Daniel Rice, Rabie Saidi, Rafael Santos, Elena Speretta, James Stephenson, Prabhat Totoo, Edward Turner, N. Tyagi, Preethi Vasudev, Kate Warner, Xavier Watkins, Rossana Zaru, Hermann Zellner, Alan Bridge, Lucila Aimo, Ghislaine Argoud-Puy, Andrea H. Auchincloss, Kristian B. Axelsen, Parit Bansal, Delphine Baratin, Teresa M Batista Neto, Marie-Claude Blatter, Jerven Bolleman, Emmanuel Boutet, Lionel Breuza, B. Gil, C. Casals-Casas, Kamal Chikh Echioukh, Elisabeth Coudert, Béatrice A. Cuche, Edouard de Castro, Anne Estreicher, Maria Livia Famiglietti, Marc Feuermann, Elisabeth Gasteiger, Pascale Gaudet, Sebastien Gehant, Vivienne Baillie Gerritsen, Arnaud Gos, Nadine M. Gruaz, Chantal Hulo, Nevila Hyka-Nouspikel, Florence Jungo, Arnaud Kerhornou, Philippe Le Mercier, Damien Lieberherr, Patrick Masson, Anne Morgat, Venkatesh Muthukrishnan, Salvo Paesano, Ivo Pedruzzi, Sandrine Pilbout, Lucille Pourcel, Sylvain Poux, Monica Pozzato, Manuela Pruess, Nicole Redaschi, Catherine Rivoire, Christian J. A. Sigrist, K Sonesson, Shyamala Sundaram, Cathy H. Wu, Cecilia N. Arighi, Leslie Arminski, Chuming Chen, Yongxing Chen, Hongzhan Huang, Kati Laiho, Peter B. McGarvey, Darren A. Natale, Karen F. Ross, C. R. Vinayaka, Qinghua Wang, Yuqi Wang, Jian Zhang 
TL;DR:
Abstract: Abstract The aim of the UniProt Knowledgebase is to provide users with a comprehensive, high-quality and freely accessible set of protein sequences annotated with functional information. In this publication we describe enhancements made to our data processing pipeline and to our website to adapt to an ever-increasing information content. The number of sequences in UniProtKB has risen to over 227 million and we are working towards including a reference proteome for each taxonomic group. We continue to extract detailed annotations from the literature to update or create reviewed entries, while unreviewed entries are supplemented with annotations provided by automated systems using a variety of machine-learning techniques. In addition, the scientific community continues their contributions of publications and annotations to UniProt entries of their interest. Finally, we describe our new website (https://www.uniprot.org/), designed to enhance our users’ experience and make our data easily accessible to the research community. This interface includes access to AlphaFold structures for more than 85% of all entries as well as improved visualisations for subcellular localisation of proteins.

332 citations


Journal ArticleDOI
TL;DR: Two most recent upgrades to the Dali server for 3D protein structure comparison are reported: the foldomes of key organisms in the AlphaFold Database (version 1) are searchable by Dali, and structural alignments are annotated with protein families.
Abstract: Abstract Protein structure is key to understanding biological function. Structure comparison deciphers deep phylogenies, providing insight into functional conservation and functional shifts during evolution. Until recently, structural coverage of the protein universe was limited by the cost and labour involved in experimental structure determination. Recent breakthroughs in deep learning revolutionized structural bioinformatics by providing accurate structural models of numerous protein families for which no structural information existed. The Dali server for 3D protein structure comparison is widely used by crystallographers to relate new structures to pre-existing ones. Here, we report two most recent upgrades to the web server: (i) the foldomes of key organisms in the AlphaFold Database (version 1) are searchable by Dali, (ii) structural alignments are annotated with protein families. Using these new features, we discovered a novel functionally diverse subgroup within the WRKY/GCM1 clan. This was accomplished by linking the structurally characterized SWI/SNF and NAM families as well as the structural models of the CG-1 family and uncharacterized proteins to the structure of Gti1/Pac2, a previously known member of the WRKY/GCM1 clan. The Dali server is available at http://ekhidna2.biocenter.helsinki.fi/dali. This website is free and open to all users and there is no login requirement.

209 citations


Journal ArticleDOI
TL;DR: The Dali server as discussed by the authors provides structural coverage of the protein universe by linking the structurally characterized SWI/SNF and NAM families as well as structural models of the CG-1 family and uncharacterized proteins to the structure of Gti1/Pac2, a previously known member of the WRKY/GCM1 clan.
Abstract: Protein structure is key to understanding biological function. Structure comparison deciphers deep phylogenies, providing insight into functional conservation and functional shifts during evolution. Until recently, structural coverage of the protein universe was limited by the cost and labour involved in experimental structure determination. Recent breakthroughs in deep learning revolutionized structural bioinformatics by providing accurate structural models of numerous protein families for which no structural information existed. The Dali server for 3D protein structure comparison is widely used by crystallographers to relate new structures to pre-existing ones. Here, we report two most recent upgrades to the web server: (i) the foldomes of key organisms in the AlphaFold Database (version 1) are searchable by Dali, (ii) structural alignments are annotated with protein families. Using these new features, we discovered a novel functionally diverse subgroup within the WRKY/GCM1 clan. This was accomplished by linking the structurally characterized SWI/SNF and NAM families as well as the structural models of the CG-1 family and uncharacterized proteins to the structure of Gti1/Pac2, a previously known member of the WRKY/GCM1 clan. The Dali server is available at http://ekhidna2.biocenter.helsinki.fi/dali. This website is free and open to all users and there is no login requirement.

200 citations


Journal ArticleDOI
TL;DR: The InterPro database as discussed by the authors provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites, and provides a more user friendly access to the data.
Abstract: Abstract The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. Here, we report recent developments with InterPro (version 90.0) and its associated software, including updates to data content and to the website. These developments extend and enrich the information provided by InterPro, and provide a more user friendly access to the data. Additionally, we have worked on adding Pfam website features to the InterPro website, as the Pfam website will be retired in late 2022. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB. Moreover, we report the development of a card game as a method of engaging the non-scientific community. Finally, we discuss the benefits and challenges brought by the use of artificial intelligence for protein structure prediction.

172 citations


Journal ArticleDOI
TL;DR: The IPD-IMGT/HLA database as mentioned in this paper provides a stable and user-friendly repository of highly curated HLA sequences, which includes over 35 000 alleles of the human Major Histocompatibility Complex (MHC).
Abstract: Abstract It is 24 years since the IPD-IMGT/HLA Database, http://www.ebi.ac.uk/ipd/imgt/hla/, was first released, providing the HLA community with a searchable repository of highly curated HLA sequences. The database now contains over 35 000 alleles of the human Major Histocompatibility Complex (MHC) named by the WHO Nomenclature Committee for Factors of the HLA System. This complex contains the most polymorphic genes in the human genome and is now considered hyperpolymorphic. The IPD-IMGT/HLA Database provides a stable and user-friendly repository for this information. Uptake of Next Generation Sequencing technology in recent years has driven an increase in the number of alleles and the length of sequences submitted. As the size of the database has grown the traditional methods of accessing and presenting this data have been challenged, in response, we have developed a suite of tools providing an enhanced user experience to our traditional web-based users while creating new programmatic access for our bioinformatics user base. This suite of tools is powered by the IPD-API, an Application Programming Interface (API), providing scalable and flexible access to the database. The IPD-API provides a stable platform for our future development allowing us to meet the future challenges of the HLA field and needs of the community.

154 citations


Journal ArticleDOI
TL;DR: STRING as mentioned in this paper collects and integrates protein-protein interactions, both physical interactions as well as functional associations, from a number of sources: automated text mining of the scientific literature, computational interaction predictions from co-expression, conserved genomic context, databases of interaction experiments and known complexes/pathways from curated sources.
Abstract: Abstract Much of the complexity within cells arises from functional and regulatory interactions among proteins. The core of these interactions is increasingly known, but novel interactions continue to be discovered, and the information remains scattered across different database resources, experimental modalities and levels of mechanistic detail. The STRING database (https://string-db.org/) systematically collects and integrates protein–protein interactions—both physical interactions as well as functional associations. The data originate from a number of sources: automated text mining of the scientific literature, computational interaction predictions from co-expression, conserved genomic context, databases of interaction experiments and known complexes/pathways from curated sources. All of these interactions are critically assessed, scored, and subsequently automatically transferred to less well-studied organisms using hierarchical orthology information. The data can be accessed via the website, but also programmatically and via bulk downloads. The most recent developments in STRING (version 12.0) are: (i) it is now possible to create, browse and analyze a full interaction network for any novel genome of interest, by submitting its complement of encoded proteins, (ii) the co-expression channel now uses variational auto-encoders to predict interactions, and it covers two new sources, single-cell RNA-seq and experimental proteomics data and (iii) the confidence in each experimentally derived interaction is now estimated based on the detection method used, and communicated to the user in the web-interface. Furthermore, STRING continues to enhance its facilities for functional enrichment analysis, which are now fully available also for user-submitted genomes.

Journal ArticleDOI
Enis Afgan, Anton Nekrutenko, Björn Grüning, Daniel Blankenberg, Jeremy Goecks, Michael C. Schatz, Alexander E. Ostrovsky, Alexandru Mahmoud, Andrew Lonie, Anna Syme, Anne Fouilloux, Anthony Bretaudeau, Anup Kumar, Arthur C. Eschenlauer, Assunta D. Desanto, Aysam Guerler, Beatriz Serrano-Solano, Bérénice Batut, Bradley W. Langhorst, Bridget Carr, Bryan Raubenolt, Cameron J. Hyde, Catherine J. Bromhead, Christopher B. Barnett, Coline Royaux, Cristóbal L. García Gallardo, Daniel Fornika, Dannon Baker, Dave Bouvier, Dave Clements, David A. de Lima Morais, David Lopez Tabernero, Delphine Larivière, E. Nasr, Federico Zambelli, Florian Heyl, Fotis Psomopoulos, Frederik Coppens, Gareth Price, Gianmauro Cuccuru, Gildas Le Corguillé, Gregory Von Kuster, Gulsum Gudukbay, Helena Rasche, Hans-Rudolf Hotz, Ignacio Eguinoa, Igor V. Makunin, Isuru Ranawaka, James Taylor, Jayadev Joshi, Jennifer Hillman-Jackson, John Chilton, Kaivan Kamali, Keith Suderman, Krzysztof Poterlowicz, Yvan Le Bras, Lucille Lopez-Delisle, Luke Sargent, Madeline E. Bassetti, M. A. Tangaro, Marius Van Den Beek, Martin Čech, Matthias Bernt, Matthias Fahrner, Mehmet Tekman, Melanie Föll, Michael R. Crusoe, Miguel Angel Roncoroni, N. K. Kucher, Nathaniel Coraor, Nicholas Stoler, Nick Rhodes, Nicola Soranzo, Niko Pinter, Nuwan Goonasekera, Pablo Moreno, Pavankumar Videm, Petera Melanie, Pietro Mandreoli, Pratik D. Jagtap, Qiang Gu, Ralf J. M. Weber, Ross Lazarus, Ruben H.P. Vorderman, Saskia Hiltemann, Sergey Golitsynskiy, Shilpa Garg, Simon Bray, Simon Gladman, Simone Leo, Subina Mehta, Timothy J. Griffin, Vahid Jalili, Yves Vandenbrouck, Vi-Kwei Wen, Vijaykrishna Nagampalli, W. Bacon, W. L. De Koning, Wolf-Martin Maier, P. J. Briggs 
TL;DR: Key Galaxy technical developments include an improved user interface for launching large-scale analyses with many files, interactive tools for exploratory data analysis, and a complete suite of machine learning tools.
Abstract: Abstract Galaxy is a mature, browser accessible workbench for scientific computing. It enables scientists to share, analyze and visualize their own data, with minimal technical impediments. A thriving global community continues to use, maintain and contribute to the project, with support from multiple national infrastructure providers that enable freely accessible analysis and training services. The Galaxy Training Network supports free, self-directed, virtual training with >230 integrated tutorials. Project engagement metrics have continued to grow over the last 2 years, including source code contributions, publications, software packages wrapped as tools, registered users and their daily analysis jobs, and new independent specialized servers. Key Galaxy technical developments include an improved user interface for launching large-scale analyses with many files, interactive tools for exploratory data analysis, and a complete suite of machine learning tools. Important scientific developments enabled by Galaxy include Vertebrate Genome Project (VGP) assembly workflows and global SARS-CoV-2 collaborations.

Journal ArticleDOI
TL;DR: An overview of changes made to PubChem in the past two years is provided, including the integration of Google Patents data into PubChem, which greatly expanded the coverage of the PubChem Patent data collection; the creation of the Cell Line and Taxonomy data collections, which provide quick and easy access to chemical information for a given cell line and taxon.
Abstract: PubChem (https://pubchem.ncbi.nlm.nih.gov) is a popular chemical information resource that serves a wide range of use cases. In the past two years, a number of changes were made to PubChem. Data from more than 120 data sources was added to PubChem. Some major highlights include: the integration of Google Patents data into PubChem, which greatly expanded the coverage of the PubChem Patent data collection; the creation of the Cell Line and Taxonomy data collections, which provide quick and easy access to chemical information for a given cell line and taxon, respectively; and the update of the bioassay data model. In addition, new functionalities were added to the PubChem programmatic access protocols, PUG-REST and PUG-View, including support for target-centric data download for a given protein, gene, pathway, cell line, and taxon and the addition of the 'standardize' option to PUG-REST, which returns the standardized form of an input chemical structure. A significant update was also made to PubChemRDF. The present paper provides an overview of these changes.

Journal ArticleDOI
TL;DR: Gal as mentioned in this paper is a mature, browser accessible workbench for scientific computing, which enables scientists to share, analyze and visualize their own data, with minimal technical impediments. But it does not support large-scale analyses with many files.
Abstract: Galaxy is a mature, browser accessible workbench for scientific computing. It enables scientists to share, analyze and visualize their own data, with minimal technical impediments. A thriving global community continues to use, maintain and contribute to the project, with support from multiple national infrastructure providers that enable freely accessible analysis and training services. The Galaxy Training Network supports free, self-directed, virtual training with >230 integrated tutorials. Project engagement metrics have continued to grow over the last 2 years, including source code contributions, publications, software packages wrapped as tools, registered users and their daily analysis jobs, and new independent specialized servers. Key Galaxy technical developments include an improved user interface for launching large-scale analyses with many files, interactive tools for exploratory data analysis, and a complete suite of machine learning tools. Important scientific developments enabled by Galaxy include Vertebrate Genome Project (VGP) assembly workflows and global SARS-CoV-2 collaborations.

Journal ArticleDOI
TL;DR: This updated docking server, named CB-Dock2, reconfigured the input and output web interfaces, together with a highly automatic docking pipeline, making it a particularly efficient and easy-to-use tool for the bioinformatics and cheminformatics communities.
Abstract: Abstract Protein-ligand blind docking is a powerful method for exploring the binding sites of receptors and the corresponding binding poses of ligands. It has seen wide applications in pharmaceutical and biological researches. Previously, we proposed a blind docking server, CB-Dock, which has been under heavy use (over 200 submissions per day) by researchers worldwide since 2019. Here, we substantially improved the docking method by combining CB-Dock with our template-based docking engine to enhance the accuracy in binding site identification and binding pose prediction. In the benchmark tests, it yielded the success rate of ∼85% for binding pose prediction (RMSD < 2.0 Å), which outperformed original CB-Dock and most popular blind docking tools. This updated docking server, named CB-Dock2, reconfigured the input and output web interfaces, together with a highly automatic docking pipeline, making it a particularly efficient and easy-to-use tool for the bioinformatics and cheminformatics communities. The web server is freely available at https://cadd.labshare.cn/cb-dock2/.

Journal ArticleDOI
TL;DR: CB-Dock2 as discussed by the authors improved the CB-DOCK algorithm by combining the template-based docking engine with the template based docking engine to enhance the accuracy in binding site identification and binding pose prediction.
Abstract: Protein-ligand blind docking is a powerful method for exploring the binding sites of receptors and the corresponding binding poses of ligands. It has seen wide applications in pharmaceutical and biological researches. Previously, we proposed a blind docking server, CB-Dock, which has been under heavy use (over 200 submissions per day) by researchers worldwide since 2019. Here, we substantially improved the docking method by combining CB-Dock with our template-based docking engine to enhance the accuracy in binding site identification and binding pose prediction. In the benchmark tests, it yielded the success rate of ∼85% for binding pose prediction (RMSD < 2.0 Å), which outperformed original CB-Dock and most popular blind docking tools. This updated docking server, named CB-Dock2, reconfigured the input and output web interfaces, together with a highly automatic docking pipeline, making it a particularly efficient and easy-to-use tool for the bioinformatics and cheminformatics communities. The web server is freely available at https://cadd.labshare.cn/cb-dock2/.

Journal ArticleDOI
TL;DR: The NHGRI-EBI GWAS Catalog (www.ebi.ac.uk/gwas) is a FAIR knowledgebase providing detailed, structured, standardised and interoperable genome-wide association study (GWAS) data to >200 000 users per year from academic research, healthcare and industry as mentioned in this paper .
Abstract: Abstract The NHGRI-EBI GWAS Catalog (www.ebi.ac.uk/gwas) is a FAIR knowledgebase providing detailed, structured, standardised and interoperable genome-wide association study (GWAS) data to >200 000 users per year from academic research, healthcare and industry. The Catalog contains variant-trait associations and supporting metadata for >45 000 published GWAS across >5000 human traits, and >40 000 full P-value summary statistics datasets. Content is curated from publications or acquired via author submission of prepublication summary statistics through a new submission portal and validation tool. GWAS data volume has vastly increased in recent years. We have updated our software to meet this scaling challenge and to enable rapid release of submitted summary statistics. The scope of the repository has expanded to include additional data types of high interest to the community, including sequencing-based GWAS, gene-based analyses and copy number variation analyses. Community outreach has increased the number of shared datasets from under-represented traits, e.g. cancer, and we continue to contribute to awareness of the lack of population diversity in GWAS. Interoperability of the Catalog has been enhanced through links to other resources including the Polygenic Score Catalog and the International Mouse Phenotyping Consortium, refinements to GWAS trait annotation, and the development of a standard format for GWAS data.

Journal ArticleDOI
TL;DR: SynergyFinder as discussed by the authors is a free web-application for interactive analysis and visualization of multi-drug combination response data, which has become a popular tool for multi-dose combination data analytics, partly because the development of its functionality and graphical interface has been driven by a diverse user community.
Abstract: SynergyFinder (https://synergyfinder.fimm.fi) is a free web-application for interactive analysis and visualization of multi-drug combination response data. Since its first release in 2017, SynergyFinder has become a popular tool for multi-dose combination data analytics, partly because the development of its functionality and graphical interface has been driven by a diverse user community, including both chemical biologists and computational scientists. Here, we describe the latest upgrade of this community-effort, SynergyFinder release 3.0, introducing a number of novel features that support interactive multi-sample analysis of combination synergy, a novel consensus synergy score that combines multiple synergy scoring models, and an improved outlier detection functionality that eliminates false positive results, along with many other post-analysis options such as weighting of synergy by drug concentrations and distinguishing between different modes of synergy (potency and efficacy). Based on user requests, several additional improvements were also implemented, including new data visualizations and export options for multi-drug combinations. With these improvements, SynergyFinder 3.0 supports robust identification of consistent combinatorial synergies for multi-drug combinatorial discovery and clinical translation.

Journal ArticleDOI
TL;DR: An update to the popular tool DeepLoc with multi-localization prediction and improvements in both performance and interpretability is proposed, and it is found that the attention output correlates well with the position of sorting signals.
Abstract: Abstract The prediction of protein subcellular localization is of great relevance for proteomics research. Here, we propose an update to the popular tool DeepLoc with multi-localization prediction and improvements in both performance and interpretability. For training and validation, we curate eukaryotic and human multi-location protein datasets with stringent homology partitioning and enriched with sorting signal information compiled from the literature. We achieve state-of-the-art performance in DeepLoc 2.0 by using a pre-trained protein language model. It has the further advantage that it uses sequence input rather than relying on slower protein profiles. We provide two means of better interpretability: an attention output along the sequence and highly accurate prediction of nine different types of protein sorting signals. We find that the attention output correlates well with the position of sorting signals. The webserver is available at services.healthtech.dtu.dk/service.php?DeepLoc-2.0.

Journal ArticleDOI
TL;DR: DeepLoc-2.0 as discussed by the authors proposes an update to the popular tool DeepLoc with multi-localization prediction and improvements in both performance and interpretability, achieving state-of-the-art performance in DeepLoc 2.0.
Abstract: Abstract The prediction of protein subcellular localization is of great relevance for proteomics research. Here, we propose an update to the popular tool DeepLoc with multi-localization prediction and improvements in both performance and interpretability. For training and validation, we curate eukaryotic and human multi-location protein datasets with stringent homology partitioning and enriched with sorting signal information compiled from the literature. We achieve state-of-the-art performance in DeepLoc 2.0 by using a pre-trained protein language model. It has the further advantage that it uses sequence input rather than relying on slower protein profiles. We provide two means of better interpretability: an attention output along the sequence and highly accurate prediction of nine different types of protein sorting signals. We find that the attention output correlates well with the position of sorting signals. The webserver is available at services.healthtech.dtu.dk/service.php?DeepLoc-2.0.

Journal ArticleDOI
TL;DR: The BV-BRC as discussed by the authors merged the PAThosystems Resource Integration Center (PATRIC), the Influenza Research Database (IRD) and the Virus Pathogen Database and Analysis Resource (ViPR) BRCs to form the Bacterial and Viral Bioinformatics Resource Center.
Abstract: The National Institute of Allergy and Infectious Diseases (NIAID) established the Bioinformatics Resource Center (BRC) program to assist researchers with analyzing the growing body of genome sequence and other omics-related data. In this report, we describe the merger of the PAThosystems Resource Integration Center (PATRIC), the Influenza Research Database (IRD) and the Virus Pathogen Database and Analysis Resource (ViPR) BRCs to form the Bacterial and Viral Bioinformatics Resource Center (BV-BRC) https://www.bv-brc.org/. The combined BV-BRC leverages the functionality of the bacterial and viral resources to provide a unified data model, enhanced web-based visualization and analysis tools, bioinformatics services, and a powerful suite of command line tools that benefit the bacterial and viral research communities.

Journal ArticleDOI
TL;DR: This update collected all the ATAC-seq and whole-genome bisulfite-seq data for six model organisms with the latest genome assemblies and provided a panoramic view of the whole epigenomic landscape of ChIP-Atlas.
Abstract: Abstract ChIP-Atlas (https://chip-atlas.org) is a web service providing both GUI- and API-based data-mining tools to reveal the architecture of the transcription regulatory landscape. ChIP-Atlas is powered by comprehensively integrating all data sets from high-throughput ChIP-seq and DNase-seq, a method for profiling chromatin regions accessible to DNase. In this update, we further collected all the ATAC-seq and whole-genome bisulfite-seq data for six model organisms (human, mouse, rat, fruit fly, nematode, and budding yeast) with the latest genome assemblies. These together with ChIP-seq data can be visualized with the Peak Browser tool and a genome browser to explore the epigenomic landscape of a query genomic locus, such as its chromatin accessibility, DNA methylation status, and protein–genome interactions. This epigenomic landscape can also be characterized for multiple genes and genomic loci by querying with the Enrichment Analysis tool, which, for example, revealed that inflammatory bowel disease-associated SNPs are the most significantly hypo-methylated in neutrophils. Therefore, ChIP-Atlas provides a panoramic view of the whole epigenomic landscape. All datasets are free to download via either a simple button on the web page or an API.

Journal ArticleDOI
TL;DR: ChIP-Atlas as discussed by the authors is a web service providing both GUI-and API-based data-mining tools to reveal the architecture of the transcription regulatory landscape, including chromatin accessibility, DNA methylation status, and protein-genome interactions.
Abstract: ChIP-Atlas (https://chip-atlas.org) is a web service providing both GUI- and API-based data-mining tools to reveal the architecture of the transcription regulatory landscape. ChIP-Atlas is powered by comprehensively integrating all data sets from high-throughput ChIP-seq and DNase-seq, a method for profiling chromatin regions accessible to DNase. In this update, we further collected all the ATAC-seq and whole-genome bisulfite-seq data for six model organisms (human, mouse, rat, fruit fly, nematode, and budding yeast) with the latest genome assemblies. These together with ChIP-seq data can be visualized with the Peak Browser tool and a genome browser to explore the epigenomic landscape of a query genomic locus, such as its chromatin accessibility, DNA methylation status, and protein-genome interactions. This epigenomic landscape can also be characterized for multiple genes and genomic loci by querying with the Enrichment Analysis tool, which, for example, revealed that inflammatory bowel disease-associated SNPs are the most significantly hypo-methylated in neutrophils. Therefore, ChIP-Atlas provides a panoramic view of the whole epigenomic landscape. All datasets are free to download via either a simple button on the web page or an API.

Journal ArticleDOI
TL;DR: There is a 20% increase in overall CTD content and a novel tool that computationally generates four-unit information blocks connecting a chemical, gene, phenotype, and disease to construct potential molecular mechanistic pathways is presented.
Abstract: Abstract The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) harmonizes cross-species heterogeneous data for chemical exposures and their biological repercussions by manually curating and interrelating chemical, gene, phenotype, anatomy, disease, taxa, and exposure content from the published literature. This curated information is integrated to generate inferences, providing potential molecular mediators to develop testable hypotheses and fill in knowledge gaps for environmental health. This dual nature, acting as both a knowledgebase and a discoverybase, makes CTD a unique resource for the scientific community. Here, we report a 20% increase in overall CTD content for 17 100 chemicals, 54 300 genes, 6100 phenotypes, 7270 diseases and 202 000 exposure statements. We also present CTD Tetramers, a novel tool that computationally generates four-unit information blocks connecting a chemical, gene, phenotype, and disease to construct potential molecular mechanistic pathways. Finally, we integrate terms for human biological media used in the CTD Exposure module to corresponding CTD Anatomy pages, allowing users to survey the chemical profiles for any tissue-of-interest and see how these environmental biomarkers are related to phenotypes for any anatomical site. These, and other webpage visual enhancements, continue to promote CTD as a practical, user-friendly, and innovative resource for finding information and generating testable hypotheses about environmental health.

Journal ArticleDOI
TL;DR: The latest upgrade of this community-effort SynergyFinder release 3.0 is described, introducing a number of novel features that support interactive multi-sample analysis of combination synergy, a novel consensus synergy score that combines multiple synergy scoring models, and an improved outlier detection functionality that eliminates false positive results.
Abstract: Abstract SynergyFinder (https://synergyfinder.fimm.fi) is a free web-application for interactive analysis and visualization of multi-drug combination response data. Since its first release in 2017, SynergyFinder has become a popular tool for multi-dose combination data analytics, partly because the development of its functionality and graphical interface has been driven by a diverse user community, including both chemical biologists and computational scientists. Here, we describe the latest upgrade of this community-effort, SynergyFinder release 3.0, introducing a number of novel features that support interactive multi-sample analysis of combination synergy, a novel consensus synergy score that combines multiple synergy scoring models, and an improved outlier detection functionality that eliminates false positive results, along with many other post-analysis options such as weighting of synergy by drug concentrations and distinguishing between different modes of synergy (potency and efficacy). Based on user requests, several additional improvements were also implemented, including new data visualizations and export options for multi-drug combinations. With these improvements, SynergyFinder 3.0 supports robust identification of consistent combinatorial synergies for multi-drug combinatorial discovery and clinical translation.

Journal ArticleDOI
TL;DR: This NetSurfP update exploits recent advances in pre-trained protein language models to drastically improve the runtime of its predecessor by two orders of magnitude, while displaying similar prediction performance.
Abstract: Abstract Recent advances in machine learning and natural language processing have made it possible to profoundly advance our ability to accurately predict protein structures and their functions. While such improvements are significantly impacting the fields of biology and biotechnology at large, such methods have the downside of high demands in terms of computing power and runtime, hampering their applicability to large datasets. Here, we present NetSurfP-3.0, a tool for predicting solvent accessibility, secondary structure, structural disorder and backbone dihedral angles for each residue of an amino acid sequence. This NetSurfP update exploits recent advances in pre-trained protein language models to drastically improve the runtime of its predecessor by two orders of magnitude, while displaying similar prediction performance. We assessed the accuracy of NetSurfP-3.0 on several independent test datasets and found it to consistently produce state-of-the-art predictions for each of its output features, with a runtime that is up to to 600 times faster than the most commonly available methods performing the same tasks. The tool is freely available as a web server with a user-friendly interface to navigate the results, as well as a standalone downloadable package.

Journal ArticleDOI
TL;DR: RSAT as mentioned in this paper is a suite of 50 tools that enable the detection and analysis of cis-regulatory elements in genomic sequences, including genomic sequences scanning with known motifs, quality assessment, comparisons and clustering, analysis of regulatory variations and comparative genomics.
Abstract: RSAT (Regulatory Sequence Analysis Tools) enables the detection and the analysis of cis-regulatory elements in genomic sequences. This software suite performs (i) de novo motif discovery (including from genome-wide datasets like ChIP-seq/ATAC-seq) (ii) genomic sequences scanning with known motifs, (iii) motif analysis (quality assessment, comparisons and clustering), (iv) analysis of regulatory variations and (v) comparative genomics. RSAT comprises 50 tools. Six public Web servers (including a teaching server) are offered to meet the needs of different biological communities. RSAT philosophy and originality are: (i) a multi-modal access depending on the user needs, through web forms, command-line for local installation and programmatic web services, (ii) a support for virtually any genome (animals, bacteria, plants, totalizing over 10 000 genomes directly accessible). Since the 2018 NAR Web Software Issue, we have developed a large REST API, extended the support for additional genomes and external motif collections, enhanced some tools and Web forms, and developed a novel tool that builds or refine gene regulatory networks using motif scanning (network-interactions). The RSAT website provides extensive documentation, tutorials and published protocols. RSAT code is under open-source license and now hosted in GitHub. RSAT is available at http://www.rsat.eu/.

Journal ArticleDOI
TL;DR: The UCSC Genome Browser (http://genome.ucsc.edu) as discussed by the authors is an omics data consolidator, graphical viewer, and general bioinformatics resource that continues to serve the community as it enters its 23rd year.
Abstract: Abstract The UCSC Genome Browser (https://genome.ucsc.edu) is an omics data consolidator, graphical viewer, and general bioinformatics resource that continues to serve the community as it enters its 23rd year. This year has seen an emphasis in clinical data, with new tracks and an expanded Recommended Track Sets feature on hg38 as well as the addition of a single cell track group. SARS-CoV-2 continues to remain a focus, with regular annotation updates to the browser and continued curation of our phylogenetic sequence placing tool, hgPhyloPlace, whose tree has now reached over 12M sequences. Our GenArk resource has also grown, offering over 2500 hubs and a system for users to request any absent assemblies. We have expanded our bigBarChart display type and created new ways to visualize data via bigRmsk and dynseq display. Displaying custom annotations is now easier due to our chromAlias system which eliminates the requirement for renaming sequence names to the UCSC standard. Users involved in data generation may also be interested in our new tools and trackDb settings which facilitate the creation and display of their custom annotations.

Journal ArticleDOI
TL;DR: The IMPC portal delivers a substantial reference dataset that supports the enrichment of various domain-specific projects and databases, as well as the wider research and clinical community, where the IMPC genotype-phenotype knowledge contributes to the molecular diagnosis of patients affected by rare disorders.
Abstract: Abstract The International Mouse Phenotyping Consortium (IMPC; https://www.mousephenotype.org/) web portal makes available curated, integrated and analysed knockout mouse phenotyping data generated by the IMPC project consisting of 85M data points and over 95,000 statistically significant phenotype hits mapped to human diseases. The IMPC portal delivers a substantial reference dataset that supports the enrichment of various domain-specific projects and databases, as well as the wider research and clinical community, where the IMPC genotype–phenotype knowledge contributes to the molecular diagnosis of patients affected by rare disorders. Data from 9,000 mouse lines and 750 000 images provides vital resources enabling the interpretation of the ignorome, and advancing our knowledge on mammalian gene function and the mechanisms underlying phenotypes associated with human diseases. The resource is widely integrated and the lines have been used in over 4,600 publications indicating the value of the data and the materials.

Journal ArticleDOI
TL;DR:
Abstract: Abstract The Comprehensive Antibiotic Resistance Database (CARD; card.mcmaster.ca) combines the Antibiotic Resistance Ontology (ARO) with curated AMR gene (ARG) sequences and resistance-conferring mutations to provide an informatics framework for annotation and interpretation of resistomes. As of version 3.2.4, CARD encompasses 6627 ontology terms, 5010 reference sequences, 1933 mutations, 3004 publications, and 5057 AMR detection models that can be used by the accompanying Resistance Gene Identifier (RGI) software to annotate genomic or metagenomic sequences. Focused curation enhancements since 2020 include expanded β-lactamase curation, incorporation of likelihood-based AMR mutations for Mycobacterium tuberculosis, addition of disinfectants and antiseptics plus their associated ARGs, and systematic curation of resistance-modifying agents. This expanded curation includes 180 new AMR gene families, 15 new drug classes, 1 new resistance mechanism, and two new ontological relationships: evolutionary_variant_of and is_small_molecule_inhibitor. In silico prediction of resistomes and prevalence statistics of ARGs has been expanded to 377 pathogens, 21,079 chromosomes, 2,662 genomic islands, 41,828 plasmids and 155,606 whole-genome shotgun assemblies, resulting in collation of 322,710 unique ARG allele sequences. New features include the CARD:Live collection of community submitted isolate resistome data and the introduction of standardized 15 character CARD Short Names for ARGs to support machine learning efforts.