scispace - formally typeset
Search or ask a question

Showing papers by "Daniel H. Haft published in 2022"


Journal ArticleDOI
TL;DR: The InterPro database as discussed by the authors provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites, and provides a more user friendly access to the data.
Abstract: Abstract The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. Here, we report recent developments with InterPro (version 90.0) and its associated software, including updates to data content and to the website. These developments extend and enrich the information provided by InterPro, and provide a more user friendly access to the data. Additionally, we have worked on adding Pfam website features to the InterPro website, as the Pfam website will be retired in late 2022. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB. Moreover, we report the development of a card game as a method of engaging the non-scientific community. Finally, we discuss the benefits and challenges brought by the use of artificial intelligence for protein structure prediction.

172 citations


Journal ArticleDOI
TL;DR: For example, AlphaFold 2.0 has been used to predict the structure of representative protein sequences from all AntiFam 6.0 families as discussed by the authors , and the results showed a trend that the mean structure prediction confidence score pLDDT is higher for shorter sequences.
Abstract: Abstract Motivation The release of AlphaFold 2.0 has revolutionized our ability to determine protein structures from sequences. This tool also inadvertently opens up many unanticipated opportunities. In this article, we investigate the AntiFam resource, which contains 250 protein sequence families that we believe to be spurious protein translations. We would not expect proteins belonging to these families to fold into well-ordered globular structures. To test this hypothesis, we have attempted to computationally determine the structure of a representative sequence from all AntiFam 6.0 families. Results Although the large majority of families showed no evidence of globular structure, we have identified one example for which a globular structure is predicted. Proteins in this AntiFam entry indeed seem likely to be bona fide proteins, based on additional considerations, and thus AlphaFold provides a useful quality control for the AntiFam database. Conversely, known spurious proteins offer useful set of quality controls for AlphaFold. We have identified a trend that the mean structure prediction confidence score pLDDT is higher for shorter sequences. Of the 131 AntiFam representative sequences <100 amino acids in length, AlphaFold predicts a mean pLDDT of 80 or greater for six of them. Thus, particular care should be taken when applying AlphaFold to short protein sequences. Availability and implementation The AlphaFold predictions for representative sequences can be found at the following URL: https://drive.google.com/drive/folders/1u9OocRIAabGQn56GljoG1JTDAxjkY1ro. Supplementary information Supplementary data are available at Bioinformatics Advances online.

22 citations


Journal ArticleDOI
TL;DR: This work describes how to curate the genes, point mutations and blast rules, and hidden Markov models used in NCBI’s AMRFinderPlus, along with the quality-control steps taken to ensure database quality, and discusses how the computed analyses generated by those tools can be accessed through a web interface.
Abstract: Antimicrobial resistance (AMR) is a significant public health threat. Low-cost whole-genome sequencing, which is often used in surveillance programmes, provides an opportunity to assess AMR gene content in these genomes using in silico approaches. A variety of bioinformatic tools have been developed to identify these genomic elements. Most of those tools rely on reference databases of nucleotide or protein sequences and collections of models and rules for analysis. While the tools are critical for the identification of AMR genes, the databases themselves also provide significant utility for researchers, for applications ranging from sequence analysis to information about AMR phenotypes. Additionally, these databases can be evaluated by domain experts and others to ensure their accuracy. Here we describe how we curate the genes, point mutations and blast rules, and hidden Markov models used in NCBI’s AMRFinderPlus, along with the quality-control steps we take to ensure database quality. We also describe the web interfaces that display the full structure of the database and their newly developed cross-browser relationships. Then, using the Reference Gene Catalog as an example, we detail how the databases, rules and models are made publicly available, as well as how to access the software. In addition, as part of the Pathogen Detection system, we have analysed over 1 million publicly available genomes using AMRFinderPlus and its databases. We discuss how the computed analyses generated by those tools can be accessed through a web interface. Finally, we conclude with NCBI’s plans to make these databases accessible over the long-term.

5 citations


Posted ContentDOI
18 Jul 2022-bioRxiv
TL;DR: The locus SAO is named, because of the Selenocysteine-Assisted Organometallic (SAO) biochemistry implied by an uptake ABC transporter with apparent metal-binding selenocysteines, complementary metal efflux pump SaoE, the MerB-like cytosolic enzyme now called SaoL, and comparative genomics signatures suggesting energy metabolism rather than metal resistance.
Abstract: A novel protein family related to mercury resistance protein MerB, which cleaves Hg-C bonds of organomercurial compounds, is a newly recognized selenoprotein, typically seen truncated in sequence databases at CU (cysteine-selenocysteine) dipeptide sites fifty residues before the true C-terminus. Inspection shows this protein occurs in a nine-gene neighborhood conserved in more than fifty bacterial species, taxonomically diverse but exclusively anaerobic, including spirochetes, deltaproteobacteria, and Gram-positive spore-formers Clostridium difficile and C. botulinum. Three included families are novel selenoproteins in most instances, including two ABC transporter subunits, one a substrate-binding protein with another CU motif, the other a permease subunit with selenocysteine at the substrate-gating position. Phylogenetic profiling shows a strong pattern of co-occurrence with Stickland metabolism selenoproteins, but an even closer link to a group of 8Fe-9S cofactor-type double-cubane proteins. These 8Fe-9S enzymes vary in count and in genome location but frequently sit next to the nine-gene locus. We have named the locus SAO, because of the Selenocysteine-Assisted Organometallic (SAO) biochemistry implied by an uptake ABC transporter with apparent metal-binding selenocysteines, complementary metal efflux pump SaoE, the MerB-like cytosolic enzyme now called SaoL, and comparative genomics signatures suggesting energy metabolism rather than metal resistance. Hypothesizing cycles of formation and dismutation of organometallic compounds involved in fermentative metabolism, we examined methylmercury formation proteins, and discovered most HgcA proteins are selenoproteins as well, with a CU motif N-terminal to the previously predicted start. Seeking additional rare and overlooked selenoproteins, tricky because of their rarity, could help reveal more candidate cryptic biochemical processes.