scispace - formally typeset
Journal ArticleDOI: 10.1016/J.JMB.2021.166915

The DBSAV Database: Predicting Deleteriousness of Single Amino Acid Variations in the Human Proteome.

04 Mar 2021-Journal of Molecular Biology (Academic Press)-Vol. 433, Iss: 11, pp 166915-166915
Abstract: Deleterious single amino acid variation (SAV) is one of the leading causes of human diseases. Evaluating the functional impact of SAVs is crucial for diagnosis of genetic disorders. We previously developed a deep convolutional neural network predictor, DeepSAV, to evaluate the deleterious effects of SAVs on protein function based on various sequence, structural, and functional properties. DeepSAV scores of rare SAVs observed in the human population are aggregated into a gene-level score called GTS (Gene Tolerance of rare SAVs) that reflects a gene's tolerance to deleterious missense mutations and serves as a useful tool to study gene-disease associations. In this study, we aim to enhance the performance of DeepSAV by using expanded datasets of pathogenic and benign variants, more features, and neural network optimization. We found that multiple sequence alignments built from vertebrate-level orthologs yield better prediction results compared to those built from mammalian-level orthologs. For multiple sequence alignments built from BLAST searches, optimal performance was achieved with a sequence identify cutoff of 50% to remove distant homologs. The new version of DeepSAV exhibits the best performance among standalone predictors of deleterious effects of SAVs. We developed the DBSAV database ( that reports GTS scores of human genes and DeepSAV scores of SAVs in the human proteome, including pathogenic and benign SAVs, population-level SAVs, and all possible SAVs by single nucleotide variations. This database serves as a useful resource for research of human SAVs and their relationships with protein functions and human diseases.

... read more

Topics: Population (51%)

5 results found

Open accessJournal ArticleDOI: 10.1126/SCIENCE.ABJ8754
Minkyung Baek1, Frank DiMaio1, Ivan Anishchenko1, Justas Dauparas1  +30 moreInstitutions (13)
20 Aug 2021-Science
Abstract: DeepMind presented notably accurate predictions at the recent 14th Critical Assessment of Structure Prediction (CASP14) conference. We explored network architectures that incorporate related ideas and obtained the best performance with a three-track network in which information at the one-dimensional (1D) sequence level, the 2D distance map level, and the 3D coordinate level is successively transformed and integrated. The three-track network produces structure predictions with accuracies approaching those of DeepMind in CASP14, enables the rapid solution of challenging x-ray crystallography and cryo-electron microscopy structure modeling problems, and provides insights into the functions of proteins of currently unknown structure. The network also enables rapid generation of accurate protein-protein complex models from sequence information alone, short-circuiting traditional approaches that require modeling of individual subunits followed by docking. We make the method available to the scientific community to speed biological research.

... read more

190 Citations

Open accessPosted ContentDOI: 10.1101/2021.06.14.448402
Minkyung Baek1, Frank DiMaio1, Ivan Anishchenko1, Justas Dauparas1  +30 moreInstitutions (12)
15 Jun 2021-bioRxiv
Abstract: DeepMind presented remarkably accurate protein structure predictions at the CASP14 conference. We explored network architectures incorporating related ideas and obtained the best performance with a 3-track network in which information at the 1D sequence level, the 2D distance map level, and the 3D coordinate level is successively transformed and integrated. The 3-track network produces structure predictions with accuracies approaching those of DeepMind in CASP14, enables rapid solution of challenging X-ray crystallography and cryo-EM structure modeling problems, and provides insights into the functions of proteins of currently unknown structure. The network also enables rapid generation of accurate models of protein-protein complexes from sequence information alone, short circuiting traditional approaches which require modeling of individual subunits followed by docking. We make the method available to the scientific community to speed biological research. One-Sentence Summary Accurate protein structure modeling enables rapid solution of structure determination problems and provides insights into biological function.

... read more

Topics: Network architecture (53%)

10 Citations

Open accessPosted ContentDOI: 10.1101/2021.11.17.468998
Neeladri Sen, Ivan Anishchenko1, Nicola Bordin2, Ian Sillitoe2  +3 moreInstitutions (3)
19 Nov 2021-bioRxiv
Abstract: Mutations in human proteins lead to diseases. The structure of these proteins can help understand the mechanism of such diseases and develop therapeutics against them. With improved deep learning techniques such as RoseTTAFold and AlphaFold, we can predict the structure of these proteins even in the absence of structural homologues. We modeled and extracted the domains from 553 disease-associated human proteins. We noticed that the model quality was higher and the RMSD lower between AlphaFold and RoseTTAFold models for domains that could be assigned to CATH families as compared to those which could be assigned to Pfam families of unknown structure or could not be assigned to either. We predicted ligand-binding sites, protein-protein interfaces, conserved residues and destabilising effects caused by residue mutations in these predicted structures. We then explored whether the disease-associated mutations were in the proximity of these predicted functional sites or if they destabilized the protein structure based on ddG calculations. We could explain 80% of these disease-associated mutations based on proximity to functional sites or structural destabilization. Usage of models from the two state-of-the-art techniques provide better confidence in our predictions, and we explain 93 additional mutations based on RoseTTAFold models which could not be explained based solely on AlphaFold models.

... read more

Topics: Protein structure (53%)

Open accessPosted ContentDOI: 10.1101/2021.09.14.460228
Jimin Pei1, Jing Zhang1, Qian Cong1Institutions (1)
14 Sep 2021-bioRxiv
Abstract: Recent development of deep-learning methods has led to a breakthrough in the prediction accuracy of 3-dimensional protein structures. Extending these methods to protein pairs is expected to allow large-scale detection of protein-protein interactions and modeling protein complexes at the proteome level. We applied RoseTTAFold and AlphaFold2, two of the latest deep-learning methods for structure predictions, to analyze coevolution of human proteins residing in mitochondria, an organelle of vital importance in many cellular processes including energy production, metabolism, cell death, and antiviral response. Variations in mitochondrial proteins have been linked to a plethora of human diseases and genetic conditions. RoseTTAFold, with high computational speed, was used to predict the coevolution of about 95% of mitochondrial protein pairs. Top-ranked pairs were further subject to the modeling of the complex structures by AlphaFold2, which also produced contact probability with high precision and in many cases consistent with RoseTTAFold. Most of the top ranked pairs with high contact probability were supported by known protein-protein interactions and/or similarities to experimental structural complexes. For high-scoring pairs without experimental complex structures, our coevolution analyses and structural models shed light on the details of their interfaces, including CHCHD4-AIFM1, MTERF3-TRUB2, FMC1-ATPAF2, ECSIT-NDUFAF1 and COQ7-COQ9, among others. We also identified novel PPIs (PYURF-NDUFAF5, LYRM1-MTRF1L and COA8-COX10) for several proteins without experimentally characterized interaction partners, leading to predictions of their molecular functions and the biological processes they are involved in.

... read more

Topics: Protein structure (51%)

50 results found

Open accessJournal ArticleDOI: 10.1093/NAR/25.17.3389
Abstract: The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

... read more

Topics: Substitution matrix (57%), Sequence database (54%), Sequence profiling tool (53%) ... read more

66,744 Citations

Open accessJournal ArticleDOI: 10.1038/GIM.2015.30
Sue Richards1, Nazneen Aziz2, Nazneen Aziz3, Sherri J. Bale4  +9 moreInstitutions (11)
Abstract: Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology

... read more

11,349 Citations

Open accessJournal ArticleDOI: 10.1093/NAR/GKH131
Abstract: To provide the scientific community with a single, centralized, authoritative resource for protein sequences and functional information, the Swiss-Prot, TrEMBL and PIR protein database activities have united to form the Universal Protein Knowledgebase (UniProt) consortium. Our mission is to provide a comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and query interfaces. The central database will have two sections, corresponding to the familiar Swiss-Prot (fully manually curated entries) and TrEMBL (enriched with automated classification, annotation and extensive cross-references). For convenient sequence searches, UniProt also provides several non-redundant sequence databases. The UniProt NREF (UniRef) databases provide representative subsets of the knowledgebase suitable for efficient searching. The comprehensive UniProt Archive (UniParc) is updated daily from many public source databases. The UniProt databases can be accessed online ( or downloaded in several formats ( The scientific community is encouraged to submit data for inclusion in UniProt.

... read more

Topics: UniProt (68%)

6,522 Citations

Open accessJournal ArticleDOI: 10.1038/NG.2892
Martin Kircher1, Daniela Witten1, Preti Jain, Brian J. O'Roak2  +3 moreInstitutions (2)
01 Mar 2014-Nature Genetics
Abstract: Our capacity to sequence human genomes has exceeded our ability to interpret genetic variation. Current genomic annotations tend to exploit a single information type (e.g. conservation) and/or are restricted in scope (e.g. to missense changes). Here, we describe Combined Annotation Dependent Depletion (CADD), a framework that objectively integrates many diverse annotations into a single, quantitative score. We implement CADD as a support vector machine trained to differentiate 14.7 million high-frequency human derived alleles from 14.7 million simulated variants. We pre-compute “C-scores” for all 8.6 billion possible human single nucleotide variants and enable scoring of short insertions/deletions. C-scores correlate with allelic diversity, annotations of functionality, pathogenicity, disease severity, experimentally measured regulatory effects, and complex trait associations, and highly rank known pathogenic variants within individual genomes. The ability of CADD to prioritize functional, deleterious, and pathogenic variants across many functional categories, effect sizes and genetic architectures is unmatched by any current annotation.

... read more

Topics: Genome-wide association study (54%), Genomics (51%)

4,148 Citations

Open accessJournal ArticleDOI: 10.1093/NAR/GKY1049
Topics: UniProt (68%)

3,758 Citations

No. of citations received by the Paper in previous years