scispace - formally typeset
Search or ask a question
Author

Philipp Rentzsch

Bio: Philipp Rentzsch is an academic researcher from Charité. The author has contributed to research in topics: splice & Computer science. The author has an hindex of 2, co-authored 2 publications receiving 1235 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: The latest updates to CADD are reviewed, including the most recent version, 1.4, which supports the human genome build GRCh38, and also present updates to the website that include simplified variant lookup, extended documentation, an Application Program Interface and improved mechanisms for integrating CADD scores into other tools or applications.
Abstract: Combined Annotation-Dependent Depletion (CADD) is a widely used measure of variant deleteriousness that can effectively prioritize causal variants in genetic analyses, particularly highly penetrant contributors to severe Mendelian disorders. CADD is an integrative annotation built from more than 60 genomic features, and can score human single nucleotide variants and short insertion and deletions anywhere in the reference assembly. CADD uses a machine learning model trained on a binary distinction between simulated de novo variants and variants that have arisen and become fixed in human populations since the split between humans and chimpanzees; the former are free of selective pressure and may thus include both neutral and deleterious alleles, while the latter are overwhelmingly neutral (or, at most, weakly deleterious) by virtue of having survived millions of years of purifying selection. Here we review the latest updates to CADD, including the most recent version, 1.4, which supports the human genome build GRCh38. We also present updates to our website that include simplified variant lookup, extended documentation, an Application Program Interface and improved mechanisms for integrating CADD scores into other tools or applications. CADD scores, software and documentation are available at https://cadd.gs.washington.edu.

2,091 citations

Journal ArticleDOI
TL;DR: The authors integrated two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu ), a widely used tool for genome-wide variant effect prediction that was previously developed to weight and integrate diverse collections of genomic annotations.
Abstract: Splicing of genomic exons into mRNAs is a critical prerequisite for the accurate synthesis of human proteins. Genetic variants impacting splicing underlie a substantial proportion of genetic disease, but are challenging to identify beyond those occurring at donor and acceptor dinucleotides. To address this, various methods aim to predict variant effects on splicing. Recently, deep neural networks (DNNs) have been shown to achieve better results in predicting splice variants than other strategies. It has been unclear how best to integrate such process-specific scores into genome-wide variant effect predictors. Here, we use a recently published experimental data set to compare several machine learning methods that score variant effects on splicing. We integrate the best of those approaches into general variant effect prediction models and observe the effect on classification of known pathogenic variants. We integrate two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu ), a widely used tool for genome-wide variant effect prediction that we previously developed to weight and integrate diverse collections of genomic annotations. With this new model, CADD-Splice, we show that inclusion of splicing DNN effect scores substantially improves predictions across multiple variant categories, without compromising overall performance. While splice effect scores show superior performance on splice variants, specialized predictors cannot compete with other variant scores in general variant interpretation, as the latter account for nonsense and missense effects that do not alter splicing. Although only shown here for splice scores, we believe that the applied approach will generalize to other specific molecular processes, providing a path for the further improvement of genome-wide variant effect prediction.

252 citations


Cited by
More filters
Posted ContentDOI
29 Apr 2019-bioRxiv
TL;DR: This work uses unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity, enabling state-of-the-art supervised prediction of mutational effect and secondary structure, and improving state- of- the-art features for long-range contact prediction.
Abstract: In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In biology, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Learning the natural distribution of evolutionary protein sequence variation is a logical step toward predictive and generative modeling for biology. To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million sequences spanning evolutionary diversity. The resulting model maps raw sequences to representations of biological properties without labels or prior domain knowledge. The learned representation space organizes sequences at multiple levels of biological granularity from the biochemical to proteomic levels. Learning recovers information about protein structure: secondary structure and residue-residue contacts can be extracted by linear projections from learned representations. With small amounts of labeled data, the ability to identify tertiary contacts is further improved. Learning on full sequence diversity rather than individual protein families increases recoverable information about secondary structure. We show the networks generalize by adapting them to variant activity prediction from sequences only, with results that are comparable to a state-of-the-art variant predictor that uses evolutionary and structurally derived features.

748 citations

01 Nov 2017
TL;DR: ChromHMM combines multiple genome-wide epigenomic maps, and uses combinatorial and spatial mark patterns to infer a complete annotation for each cell type, and provides an automated enrichment analysis of the resulting annotations to facilitate the functional interpretations of each chromatin state.
Abstract: Noncoding DNA regions have central roles in human biology, evolution, and disease. ChromHMM helps to annotate the noncoding genome using epigenomic information across one or multiple cell types. It combines multiple genome-wide epigenomic maps, and uses combinatorial and spatial mark patterns to infer a complete annotation for each cell type. ChromHMM learns chromatin-state signatures using a multivariate hidden Markov model (HMM) that explicitly models the combinatorial presence or absence of each mark. ChromHMM uses these signatures to generate a genome-wide annotation for each cell type by calculating the most probable state for each genomic segment. ChromHMM provides an automated enrichment analysis of the resulting annotations to facilitate the functional interpretations of each chromatin state. ChromHMM is distinguished by its modeling emphasis on combinations of marks, its tight integration with downstream functional enrichment analyses, its speed, and its ease of use. Chromatin states are learned, annotations are produced, and enrichments are computed within 1 d.

364 citations

Journal ArticleDOI
TL;DR: The data provide a structural basis of potential resistance against SARS‐CoV‐2 infection driven by ACE2 allelic variants.
Abstract: The recent pandemic of COVID-19, caused by SARS-CoV-2, is unarguably the most fearsome compared with the earlier outbreaks caused by other coronaviruses, SARS-CoV and MERS-CoV. Human ACE2 is now established as a receptor for the SARS-CoV-2 spike protein. Where variations in the viral spike protein, in turn, lead to the cross-species transmission of the virus, genetic variations in the host receptor ACE2 may also contribute to the susceptibility and/or resistance against the viral infection. This study aims to explore the binding of the proteins encoded by different human ACE2 allelic variants with SARS-CoV-2 spike protein. Briefly, coding variants of ACE2 corresponding to the reported binding sites for its attachment with coronavirus spike protein were selected and molecular models of these variants were constructed by homology modeling. The models were then superimposed over the native ACE2 and ACE2-spike protein complex, to observe structural changes in the ACE2 variants and their intermolecular interactions with SARS-CoV-2 spike protein, respectively. Despite strong overall structural similarities, the spatial orientation of the key interacting residues varies in the ACE2 variants compared with the wild-type molecule. Most ACE2 variants showed a similar binding affinity for SARS-CoV-2 spike protein as observed in the complex structure of wild-type ACE2 and SARS-CoV-2 spike protein. However, ACE2 alleles, rs73635825 (S19P) and rs143936283 (E329G) showed noticeable variations in their intermolecular interactions with the viral spike protein. In summary, our data provide a structural basis of potential resistance against SARS-CoV-2 infection driven by ACE2 allelic variants.

288 citations

Journal ArticleDOI
TL;DR: A review of the Human Gene Mutation Database aims to highlight how to make the most out of HGMD data in each setting.
Abstract: The Human Gene Mutation Database (HGMD®) constitutes a comprehensive collection of published germline mutations in nuclear genes that are thought to underlie, or are closely associated with human inherited disease At the time of writing (June 2020), the database contains in excess of 289,000 different gene lesions identified in over 11,100 genes manually curated from 72,987 articles published in over 3100 peer-reviewed journals There are primarily two main groups of users who utilise HGMD on a regular basis; research scientists and clinical diagnosticians This review aims to highlight how to make the most out of HGMD data in each setting

268 citations

Journal ArticleDOI
TL;DR: The authors integrated two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu ), a widely used tool for genome-wide variant effect prediction that was previously developed to weight and integrate diverse collections of genomic annotations.
Abstract: Splicing of genomic exons into mRNAs is a critical prerequisite for the accurate synthesis of human proteins. Genetic variants impacting splicing underlie a substantial proportion of genetic disease, but are challenging to identify beyond those occurring at donor and acceptor dinucleotides. To address this, various methods aim to predict variant effects on splicing. Recently, deep neural networks (DNNs) have been shown to achieve better results in predicting splice variants than other strategies. It has been unclear how best to integrate such process-specific scores into genome-wide variant effect predictors. Here, we use a recently published experimental data set to compare several machine learning methods that score variant effects on splicing. We integrate the best of those approaches into general variant effect prediction models and observe the effect on classification of known pathogenic variants. We integrate two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu ), a widely used tool for genome-wide variant effect prediction that we previously developed to weight and integrate diverse collections of genomic annotations. With this new model, CADD-Splice, we show that inclusion of splicing DNN effect scores substantially improves predictions across multiple variant categories, without compromising overall performance. While splice effect scores show superior performance on splice variants, specialized predictors cannot compete with other variant scores in general variant interpretation, as the latter account for nonsense and missense effects that do not alter splicing. Although only shown here for splice scores, we believe that the applied approach will generalize to other specific molecular processes, providing a path for the further improvement of genome-wide variant effect prediction.

252 citations