scispace - formally typeset
Search or ask a question
Journal ArticleDOI

CADD: predicting the deleteriousness of variants throughout the human genome.

08 Jan 2019-Nucleic Acids Research (Oxford University Press)-Vol. 47
TL;DR: The latest updates to CADD are reviewed, including the most recent version, 1.4, which supports the human genome build GRCh38, and also present updates to the website that include simplified variant lookup, extended documentation, an Application Program Interface and improved mechanisms for integrating CADD scores into other tools or applications.
Abstract: Combined Annotation-Dependent Depletion (CADD) is a widely used measure of variant deleteriousness that can effectively prioritize causal variants in genetic analyses, particularly highly penetrant contributors to severe Mendelian disorders. CADD is an integrative annotation built from more than 60 genomic features, and can score human single nucleotide variants and short insertion and deletions anywhere in the reference assembly. CADD uses a machine learning model trained on a binary distinction between simulated de novo variants and variants that have arisen and become fixed in human populations since the split between humans and chimpanzees; the former are free of selective pressure and may thus include both neutral and deleterious alleles, while the latter are overwhelmingly neutral (or, at most, weakly deleterious) by virtue of having survived millions of years of purifying selection. Here we review the latest updates to CADD, including the most recent version, 1.4, which supports the human genome build GRCh38. We also present updates to our website that include simplified variant lookup, extended documentation, an Application Program Interface and improved mechanisms for integrating CADD scores into other tools or applications. CADD scores, software and documentation are available at https://cadd.gs.washington.edu.
Citations
More filters
Posted ContentDOI
29 Apr 2019-bioRxiv
TL;DR: This work uses unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity, enabling state-of-the-art supervised prediction of mutational effect and secondary structure, and improving state- of- the-art features for long-range contact prediction.
Abstract: In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In biology, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Learning the natural distribution of evolutionary protein sequence variation is a logical step toward predictive and generative modeling for biology. To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million sequences spanning evolutionary diversity. The resulting model maps raw sequences to representations of biological properties without labels or prior domain knowledge. The learned representation space organizes sequences at multiple levels of biological granularity from the biochemical to proteomic levels. Learning recovers information about protein structure: secondary structure and residue-residue contacts can be extracted by linear projections from learned representations. With small amounts of labeled data, the ability to identify tertiary contacts is further improved. Learning on full sequence diversity rather than individual protein families increases recoverable information about secondary structure. We show the networks generalize by adapting them to variant activity prediction from sequences only, with results that are comparable to a state-of-the-art variant predictor that uses evolutionary and structurally derived features.

748 citations


Cites background from "CADD: predicting the deleteriousnes..."

  • ...Computational variant effect predictors are useful for assessing the effect of point mutations (Gray et al., 2018; Adzhubei et al., 2013; Kumar et al., 2009; Hecht et al., 2015; Rentzsch et al., 2018)....

    [...]

01 Nov 2017
TL;DR: ChromHMM combines multiple genome-wide epigenomic maps, and uses combinatorial and spatial mark patterns to infer a complete annotation for each cell type, and provides an automated enrichment analysis of the resulting annotations to facilitate the functional interpretations of each chromatin state.
Abstract: Noncoding DNA regions have central roles in human biology, evolution, and disease. ChromHMM helps to annotate the noncoding genome using epigenomic information across one or multiple cell types. It combines multiple genome-wide epigenomic maps, and uses combinatorial and spatial mark patterns to infer a complete annotation for each cell type. ChromHMM learns chromatin-state signatures using a multivariate hidden Markov model (HMM) that explicitly models the combinatorial presence or absence of each mark. ChromHMM uses these signatures to generate a genome-wide annotation for each cell type by calculating the most probable state for each genomic segment. ChromHMM provides an automated enrichment analysis of the resulting annotations to facilitate the functional interpretations of each chromatin state. ChromHMM is distinguished by its modeling emphasis on combinations of marks, its tight integration with downstream functional enrichment analyses, its speed, and its ease of use. Chromatin states are learned, annotations are produced, and enrichments are computed within 1 d.

364 citations

Journal ArticleDOI
TL;DR: The data provide a structural basis of potential resistance against SARS‐CoV‐2 infection driven by ACE2 allelic variants.
Abstract: The recent pandemic of COVID-19, caused by SARS-CoV-2, is unarguably the most fearsome compared with the earlier outbreaks caused by other coronaviruses, SARS-CoV and MERS-CoV. Human ACE2 is now established as a receptor for the SARS-CoV-2 spike protein. Where variations in the viral spike protein, in turn, lead to the cross-species transmission of the virus, genetic variations in the host receptor ACE2 may also contribute to the susceptibility and/or resistance against the viral infection. This study aims to explore the binding of the proteins encoded by different human ACE2 allelic variants with SARS-CoV-2 spike protein. Briefly, coding variants of ACE2 corresponding to the reported binding sites for its attachment with coronavirus spike protein were selected and molecular models of these variants were constructed by homology modeling. The models were then superimposed over the native ACE2 and ACE2-spike protein complex, to observe structural changes in the ACE2 variants and their intermolecular interactions with SARS-CoV-2 spike protein, respectively. Despite strong overall structural similarities, the spatial orientation of the key interacting residues varies in the ACE2 variants compared with the wild-type molecule. Most ACE2 variants showed a similar binding affinity for SARS-CoV-2 spike protein as observed in the complex structure of wild-type ACE2 and SARS-CoV-2 spike protein. However, ACE2 alleles, rs73635825 (S19P) and rs143936283 (E329G) showed noticeable variations in their intermolecular interactions with the viral spike protein. In summary, our data provide a structural basis of potential resistance against SARS-CoV-2 infection driven by ACE2 allelic variants.

288 citations

Journal ArticleDOI
TL;DR: A review of the Human Gene Mutation Database aims to highlight how to make the most out of HGMD data in each setting.
Abstract: The Human Gene Mutation Database (HGMD®) constitutes a comprehensive collection of published germline mutations in nuclear genes that are thought to underlie, or are closely associated with human inherited disease At the time of writing (June 2020), the database contains in excess of 289,000 different gene lesions identified in over 11,100 genes manually curated from 72,987 articles published in over 3100 peer-reviewed journals There are primarily two main groups of users who utilise HGMD on a regular basis; research scientists and clinical diagnosticians This review aims to highlight how to make the most out of HGMD data in each setting

268 citations


Cites methods from "CADD: predicting the deleteriousnes..."

  • ...The score is computed by HGMD using a supervised machine learning approach known as Random Forest (Breiman 2001), and is based upon multiple lines of evidence, including HGMD literature support for pathogenicity (placed on a scale of 1–10, with 1 being the lowest score and 10 being the highest), evolutionary conservation (100-way vertebrate alignment), variant allele frequency and in silico pathogenicity prediction including CADD (Rentzsch et al. 2019), PolyPhen2 (Adzhubei et al....

    [...]

  • ...The score is computed by HGMD using a supervised machine learning approach known as Random Forest (Breiman 2001), and is based upon multiple lines of evidence, including HGMD literature support for pathogenicity (placed on a scale of 1–10, with 1 being the lowest score and 10 being the highest), evolutionary conservation (100-way vertebrate alignment), variant allele frequency and in silico pathogenicity prediction including CADD (Rentzsch et al....

    [...]

Journal ArticleDOI
TL;DR: The authors integrated two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu ), a widely used tool for genome-wide variant effect prediction that was previously developed to weight and integrate diverse collections of genomic annotations.
Abstract: Splicing of genomic exons into mRNAs is a critical prerequisite for the accurate synthesis of human proteins. Genetic variants impacting splicing underlie a substantial proportion of genetic disease, but are challenging to identify beyond those occurring at donor and acceptor dinucleotides. To address this, various methods aim to predict variant effects on splicing. Recently, deep neural networks (DNNs) have been shown to achieve better results in predicting splice variants than other strategies. It has been unclear how best to integrate such process-specific scores into genome-wide variant effect predictors. Here, we use a recently published experimental data set to compare several machine learning methods that score variant effects on splicing. We integrate the best of those approaches into general variant effect prediction models and observe the effect on classification of known pathogenic variants. We integrate two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu ), a widely used tool for genome-wide variant effect prediction that we previously developed to weight and integrate diverse collections of genomic annotations. With this new model, CADD-Splice, we show that inclusion of splicing DNN effect scores substantially improves predictions across multiple variant categories, without compromising overall performance. While splice effect scores show superior performance on splice variants, specialized predictors cannot compete with other variant scores in general variant interpretation, as the latter account for nonsense and missense effects that do not alter splicing. Although only shown here for splice scores, we believe that the applied approach will generalize to other specific molecular processes, providing a path for the further improvement of genome-wide variant effect prediction.

252 citations

References
More filters
Journal Article
TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.

47,974 citations


"CADD: predicting the deleteriousnes..." refers methods in this paper

  • ...4, a logistic regression model was fit using a fully open source pipeline based on SciPy (44) and scikit-learn (45)....

    [...]

Posted Content
TL;DR: Scikit-learn as mentioned in this paper is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems.
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from this http URL.

28,898 citations

Journal ArticleDOI
14 Jan 2005-Cell
TL;DR: In a four-genome analysis of 3' UTRs, approximately 13,000 regulatory relationships were detected above the estimate of false-positive predictions, thereby implicating as miRNA targets more than 5300 human genes, which represented 30% of the gene set.

11,624 citations

Journal ArticleDOI
TL;DR: A new method and the corresponding software tool, PolyPhen-2, which is different from the early tool polyPhen1 in the set of predictive features, alignment pipeline, and the method of classification is presented and performance, as presented by its receiver operating characteristic curves, was consistently superior.
Abstract: To the Editor: Applications of rapidly advancing sequencing technologies exacerbate the need to interpret individual sequence variants. Sequencing of phenotyped clinical subjects will soon become a method of choice in studies of the genetic causes of Mendelian and complex diseases. New exon capture techniques will direct sequencing efforts towards the most informative and easily interpretable protein-coding fraction of the genome. Thus, the demand for computational predictions of the impact of protein sequence variants will continue to grow. Here we present a new method and the corresponding software tool, PolyPhen-2 (http://genetics.bwh.harvard.edu/pph2/), which is different from the early tool PolyPhen1 in the set of predictive features, alignment pipeline, and the method of classification (Fig. 1a). PolyPhen-2 uses eight sequence-based and three structure-based predictive features (Supplementary Table 1) which were selected automatically by an iterative greedy algorithm (Supplementary Methods). Majority of these features involve comparison of a property of the wild-type (ancestral, normal) allele and the corresponding property of the mutant (derived, disease-causing) allele, which together define an amino acid replacement. Most informative features characterize how well the two human alleles fit into the pattern of amino acid replacements within the multiple sequence alignment of homologous proteins, how distant the protein harboring the first deviation from the human wild-type allele is from the human protein, and whether the mutant allele originated at a hypermutable site2. The alignment pipeline selects the set of homologous sequences for the analysis using a clustering algorithm and then constructs and refines their multiple alignment (Supplementary Fig. 1). The functional significance of an allele replacement is predicted from its individual features (Supplementary Figs. 2–4) by Naive Bayes classifier (Supplementary Methods). Figure 1 PolyPhen-2 pipeline and prediction accuracy. (a) Overview of the algorithm. (b) Receiver operating characteristic (ROC) curves for predictions made by PolyPhen-2 using five-fold cross-validation on HumDiv (red) and HumVar3 (light green). UniRef100 (solid ... We used two pairs of datasets to train and test PolyPhen-2. We compiled the first pair, HumDiv, from all 3,155 damaging alleles with known effects on the molecular function causing human Mendelian diseases, present in the UniProt database, together with 6,321 differences between human proteins and their closely related mammalian homologs, assumed to be non-damaging (Supplementary Methods). The second pair, HumVar3, consists of all the 13,032 human disease-causing mutations from UniProt, together with 8,946 human nsSNPs without annotated involvement in disease, which were treated as non-damaging. We found that PolyPhen-2 performance, as presented by its receiver operating characteristic curves, was consistently superior compared to PolyPhen (Fig. 1b) and it also compared favorably with the three other popular prediction tools4–6 (Fig. 1c). For a false positive rate of 20%, PolyPhen-2 achieves the rate of true positive predictions of 92% and 73% on HumDiv and HumVar, respectively (Supplementary Table 2). One reason for a lower accuracy of predictions on HumVar is that nsSNPs assumed to be non-damaging in HumVar contain a sizable fraction of mildly deleterious alleles. In contrast, most of amino acid replacements assumed non-damaging in HumDiv must be close to selective neutrality. Because alleles that are even mildly but unconditionally deleterious cannot be fixed in the evolving lineage, no method based on comparative sequence analysis is ideal for discriminating between drastically and mildly deleterious mutations, which are assigned to the opposite categories in HumVar. Another reason is that HumDiv uses an extra criterion to avoid possible erroneous annotations of damaging mutations. For a mutation, PolyPhen-2 calculates Naive Bayes posterior probability that this mutation is damaging and reports estimates of false positive (the chance that the mutation is classified as damaging when it is in fact non-damaging) and true positive (the chance that the mutation is classified as damaging when it is indeed damaging) rates. A mutation is also appraised qualitatively, as benign, possibly damaging, or probably damaging (Supplementary Methods). The user can choose between HumDiv- and HumVar-trained PolyPhen-2. Diagnostics of Mendelian diseases requires distinguishing mutations with drastic effects from all the remaining human variation, including abundant mildly deleterious alleles. Thus, HumVar-trained PolyPhen-2 should be used for this task. In contrast, HumDiv-trained PolyPhen-2 should be used for evaluating rare alleles at loci potentially involved in complex phenotypes, dense mapping of regions identified by genome-wide association studies, and analysis of natural selection from sequence data, where even mildly deleterious alleles must be treated as damaging.

11,571 citations


"CADD: predicting the deleteriousnes..." refers background in this paper

  • ...Name Type Description 1 (Chrom) String Chromosome 2 (Pos) integer Position (1-based) 3 Ref factor Reference allele (default: N) 4 Alt factor Observed allele (default: N) 5 Type factor Event type (SNV, DEL, INS) 6 Length integer Number of inserted/deleted bases 7 (Annotype) factor CodingTranscript, Intergenic, MotifFeature, NonCodingTranscript, RegulatoryFeature, Transcript 8 Consequence factor VEP consequence, priority selected by potential impact (default: UNKNOWN) 9 (ConsScore) integer Custom deleterious score assigned to Consequence 10 (ConsDetail) string Trimmed VEP consequence prior to simplification 11 GC float Percent GC in a window of +/- 75bp (default: 0.42) 12 CpG float Percent CpG in a window of +/- 75bp (default: 0.02) 13 MotifECount integer Total number of overlapping motifs (default: 0) 14 (MotifEName) string Name of sequence motif the position overlaps 15 MotifEHIPos bool Is the position considered highly informative for an overlapping motif by VEP (default: 0) 16 MotifEScoreChng float VEP score change for the overlapping motif site (default: 0) 17 oAA factor Reference amino acid (default: unknown) 18 nAA factor Amino acid of observed variant (default: unknown) 19 (GeneID) string ENSEMBL GeneID 20 (FeatureID) string ENSEMBL feature ID (Transcript ID or regulatory feature ID) 21 (GeneName) string GeneName provided in ENSEMBL annotation 22 (CCDS) string Consensus Coding Sequence ID 23 (Intron) string Intron number/Total number of exons 24 (Exon) string Exon number/Total number of exons 25 cDNApos float Base position from transcription start (default: 0*) 26 relcDNApos float Relative position in transcript (default: 0) 27 CDSpos float Base position from coding start (default: 0*) 28 relCDSpos float Relative position in coding sequence (default: 0) 29 protPos float Amino acid position from coding start (default: 0*) 30 relprotPos float Relative position in protein codon (default: 0) 31 Domain factor Domain annotation inferred from VEP annotation (ncoils, sigp, lcompl, hmmpanther, ndomain = "other named domain") (default: UD) 32 Dst2Splice float Distance to splice site in 20bp; positive: exonic, negative: intronic (default: 0) 33 Dst2SplType factor Closest splice site is ACCEPTOR or DONOR (default: unknown) 34 MinDistTSS float Distance to closest Transcribed Sequence Start (TSS) (default: 5.5) 35 MinDistTSE float Distance to closest Transcribed Sequence End (TSE) (default: 5.5) 36 SIFTcat factor SIFT category of change (default: UD) 37 SIFTval float SIFT score (default: 0*) 38 PolyPhenCat factor PolyPhen2 category of change (default: UD) 39 PolyPhenVal float PolyPhen2 score (default: 0*) 40 priPhCons float Primate PhastCons conservation score (excl. human) (default: 0.115) 41 mamPhCons float Mammalian PhastCons conservation score (excl. human) (default: 0.079) 42 verPhCons float Vertebrate PhastCons conservation score (excl. human) (default: 0.094) 43 priPhyloP float Primate PhyloP score (excl. human) (default: -0.033) 44 mamPhyloP float Mammalian PhyloP score (excl. human) (default: -0.038) 45 verPhyloP float Vertebrate PhyloP score (excl. human) (default: 0.017) 46 bStatistic integer Background selection score (default: 800) 47 targetScan integer targetscan (default: 0*) 48 mirSVR-Score float mirSVR-Score (default: 0*) 49 mirSVR-E float mirSVR-E (default: 0) 50 mirSVR-Aln integer mirSVR-Aln (default: 0) 51 cHmmTssA float Proportion of 127 cell types in cHmmTssA state (default: 0.0667*) 52 cHmmTssAFlnk float Proportion of 127 cell types in cHmmTssAFlnk state (default: 0.0667) 53 cHmmTxFlnk float Proportion of 127 cell types in cHmmTxFlnk state (default: 0.0667) 54 cHmmTx float Proportion of 127 cell types in cHmmTx state (default: 0.0667) 55 cHmmTxWk float Proportion of 127 cell types in cHmmTxWk state (default: 0.0667) 56 cHmmEnhG float Proportion of 127 cell types in cHmmEnhG state (default: 0.0667) 57 cHmmEnh float Proportion of 127 cell types in cHmmEnh state (default: 0.0667) 58 cHmmZnfRpts float Proportion of 127 cell types in cHmmZnfRpts state (default: 0.0667) 59 cHmmHet float Proportion of 127 cell types in cHmmHet state (default: 0.0667) 60 cHmmTssBiv float Proportion of 127 cell types in cHmmTssBiv state (default: 0.0667) 61 cHmmBivFlnk float Proportion of 127 cell types in cHmmBivFlnk state (default: 0.0667) 62 cHmmEnhBiv float Proportion of 127 cell types in cHmmEnhBiv state (default: 0.0667) 63 cHmmReprPC float Proportion of 127 cell types in cHmmReprPC state (default: 0.0667) 64 cHmmReprPCWk float Proportion of 127 cell types in cHmmReprPCWk state (default: 0.0667) 65 cHmmQuies float Proportion of 127 cell types in cHmmQuies state (default: 0.0667) 66 GerpRS float Gerp element score (default: 0) 67 GerpRSpval float Gerp element p-Value (default: 0) 68 GerpN float Neutral evolution score defined by GERP++ (default: 1.91) 69 GerpS float Rejected Substitution score defined by GERP++ (default: -0.2) 70 TFBS float Number of different overlapping ChIP transcription factor binding sites (default: 0) 71 TFBSPeaks float Number of overlapping ChIP transcription factor binding site peaks summed over different cell types/tissue (default: 0) 72 TFBSPeaksMax float Maximum value of overlapping ChIP transcription factor binding site peaks across cell types/tissue (default: 0) 73 tOverlapMotifs float Number of overlapping predicted TF motifs (default: 0) 74 motifDist float Reference minus alternate allele difference in nucleotide frequency within an predicted overlapping motif (default: 0) 75 Segway factor Result of genomic segmentation algorithm (default: unknown) 76 EncH3K27Ac float Maximum ENCODE H3K27 acetylation level (default: 0) 77 EncH3K4Me1 float Maximum ENCODE H3K4 methylation level (default: 0) 78 EncH3K4Me3 float Maximum ENCODE H3K4 trimethylation level (default: 0) 79 EncExp float Maximum ENCODE expression value (default: 0) 80 EncNucleo float Maximum of ENCODE Nucleosome position track score (default: 0) 81 EncOCC integer ENCODE open chromatin code (default: 5) 82 EncOCCombPVal float ENCODE combined p-Value (PHRED-scale) of Faire, Dnase, polII, CTCF, Myc evidence for open chromatin (default: 0) 83 EncOCDNasePVal float p-Value (PHRED-scale) of Dnase evidence for open chromatin (default: 0) 84 EncOCFairePVal float p-Value (PHRED-scale) of Faire evidence for open chromatin (default: 0) 85 EncOCpolIIPVal float p-Value (PHRED-scale) of polII evidence for open chromatin (default: 0) 86 EncOCctcfPVal float p-Value (PHRED-scale) of CTCF evidence for open chromatin (default: 0) 87 EncOCmycPVal float p-Value (PHRED-scale) of Myc evidence for open chromatin (default: 0) 88 EncOCDNaseSig float Peak signal for Dnase evidence of open chromatin (default: 0) 89 EncOCFaireSig float Peak signal for Faire evidence of open chromatin (default: 0) 90 EncOCpolIISig float Peak signal for polII evidence of open chromatin (default: 0) 91 EncOCctcfSig float Peak signal for CTCF evidence of open chromatin (default: 0) 92 EncOCmycSig float Peak signal for Myc evidence of open chromatin (default: 0) 93 Grantham float Grantham score: oAA,nAA (default: 0*) 94 Dist2Mutation float Distance between the closest gnomAD SNV up and downstream (position itself excluded) (default: 0*) 95 Freq100bp integer Number of frequent (MAF > 0.05) gnomAD SNV in 100 bp window nearby (default: 0) 96 Rare100bp integer Number of rare (MAF < 0.05) gnomAD SNV in 100 bp window nearby (default: 0) 97 Sngl100bp integer Number of single occurrence gnomAD SNV in 100 bp window nearby (default: 0) 98 Freq1000bp integer Number of frequent (MAF > 0.05) gnomAD SNV in 1000 bp window nearby (default: 0) 99 Rare1000bp integer Number of rare (MAF < 0.05) gnomAD SNV in 1000 bp window nearby (default: 0) 100 Sngl1000bp integer Number of single occurrence gnomAD SNV in 1000 bp window nearby (default: 0) 101 Freq10000bp integer Number of frequent (MAF > 0.05) gnomAD SNV in 10000 bp window nearby (default: 0) 102 Rare10000bp integer Number of rare (MAF < 0.05) gnomAD SNV in 10000 bp window nearby (default: 0) 103 Sngl10000bp integer Number of single occurrence gnomAD SNV in 10000 bp window nearby (default: 0) 104 dbscSNV-ada_score float Adaboost classifier score from dbscSNV (default: 0*) 105 dbscSNV-rf_score float Random forest classifier score from dbscSNV (default: 0*) 106 RawScore float Raw score from the model 107 PHRED float CADD PHRED Score * A Boolean indicator variable was created in order to handle undefined values....

    [...]

  • ...Examples of annotations include transcript information like distance to exon-intron boundaries, DNase hypersensitivity, transcription factor binding, expression levels in commonly studied cell lines and amino acid substitution scores for protein coding sequences like Grantham (20), SIFT (21) and PolyPhen2 (22)....

    [...]

  • ...Name Type Description 1 (Chrom) string Chromosome 2 (Pos) integer Position (1-based) 3 Ref factor Reference allele (default: N) 4 Alt factor Observed allele (default: N) 5 Type factor Event type (SNV, DEL, INS) 6 Length integer Number of inserted/deleted bases 7 (AnnoType) factor CodingTranscript, Intergenic, MotifFeature, NonCodingTranscript, RegulatoryFeature, Transcript 8 Consequence factor VEP consequence, priority selected by potential impact (default: UNKNOWN) 9 (ConsScore) integer Custom deleterious score assigned to Consequence 10 (ConsDetail) string Trimmed VEP consequence prior to simplification 11 GC float Percent GC in a window of +/- 75bp (default: 0.42) 12 CpG float Percent CpG in a window of +/- 75bp (default: 0.02) 13 motifECount integer Total number of overlapping motifs (default: 0) 14 (motifEName) string Name of sequence motif the position overlaps 15 motifEHIPos bool Is the position considered highly informative for an overlapping motif by VEP (default: 0) 16 motifEScoreChng float VEP score change for the overlapping motif site (default: 0) 17 oAA factor Reference amino acid (default: unknown) 18 nAA factor Amino acid of observed variant (default: unknown) 19 (GeneID) string ENSEMBL GeneID 20 (FeatureID) string ENSEMBL feature ID (Transcript ID or regulatory feature ID) 21 (GeneName) string GeneName provided in ENSEMBL annotation 22 (CCDS) string Consensus Coding Sequence ID 23 (Intron) string Intron number/Total number of exons 24 (Exon) string Exon number/Total number of exons 25 cDNApos float Base position from transcription start (default: 0*) 26 relcDNApos float Relative position in transcript (default: 0) 27 CDSpos float Base position from coding start (default: 0*) 28 relCDSpos float Relative position in coding sequence (default: 0) 29 protPos float Amino acid position from coding start (default: 0*) 30 relProtPos float Relative position in protein codon (default: 0) 31 Domain factor Domain annotation inferred from VEP annotation (ncoils, sigp, lcompl, hmmpanther, ndomain = "other named domain") (default: UD) 32 Dst2Splice float Distance to splice site in 20bp; positive: exonic, negative: intronic (default: 0) 33 Dst2SplType factor Closest splice site is ACCEPTOR or DONOR (default: unknown) 34 minDistTSS float Distance to closest Transcribed Sequence Start (TSS) (default: 5.5) 35 minDistTSE float Distance to closest Transcribed Sequence End (TSE) (default: 5.5) 36 SIFTcat factor SIFT category of change (default: UD) 37 SIFTval float SIFT score (default: 0*) 38 PolyPhenCat factor PolyPhen2 category of change (default: UD) 39 PolyPhenVal float PolyPhen2 score (default: 0*) 40 priPhCons float Primate PhastCons conservation score (excl. human) (default: 0.0) 41 mamPhCons float Mammalian PhastCons conservation score (excl. human) (default: 0.0) 42 verPhCons float Vertebrate PhastCons conservation score (excl. human) (default: 0.0) 43 priPhyloP float Primate PhyloP score (excl. human) (default: -0.029) 44 mamPhyloP float Mammalian PhyloP score (excl. human) (default: -0.005) 45 verPhyloP float Vertebrate PhyloP score (excl. human) (default: 0.042) 46 bStatistic integer Background selection score (default: 800) 47 targetScan integer targetscan (default: 0*) 48 mirSVR-Score float mirSVR-Score (default: 0*) 49 mirSVR-E float mirSVR-E (default: 0) 50 mirSVR-Aln integer mirSVR-Aln (default: 0) 51 cHmm_E1 float Number of 48 cell types in chromHMM state E1_poised (default: 1.92*) 52 cHmm_E2 float Number of 48 cell types in chromHMM state E2_repressed (default: 1.92) 53 cHmm_E3 float Number of 48 cell types in chromHMM state E3_dead (default: 1.92) 54 cHmm_E4 float Number of 48 cell types in chromHMM state E4_dead (default: 1.92) 55 cHmm_E5 float Number of 48 cell types in chromHMM state E5_repressed (default: 1.92) 56 cHmm_E6 float Number of 48 cell types in chromHMM state E6_repressed (default: 1.92) 57 cHmm_E7 float Number of 48 cell types in chromHMM state E7_weak (default: 1.92) 58 cHmm_E8 float Number of 48 cell types in chromHMM state E8_gene (default: 1.92) 59 cHmm_E9 float Number of 48 cell types in chromHMM state E9_gene (default: 1.92) 60 cHmm_E10 float Number of 48 cell types in chromHMM state E10_gene (default: 1.92) 61 cHmm_E11 float Number of 48 cell types in chromHMM state E11_gene (default: 1.92) 62 cHmm_E12 float Number of 48 cell types in chromHMM state E12_distal (default: 1.92) 63 cHmm_E13 float Number of 48 cell types in chromHMM state E13_distal (default: 1.92) 64 cHmm_E14 float Number of 48 cell types in chromHMM state E14_distal (default: 1.92) 65 cHmm_E15 float Number of 48 cell types in chromHMM state E15_weak (default: 1.92) 66 cHmm_E16 float Number of 48 cell types in chromHMM state E16_tss (default: 1.92) 67 cHmm_E17 float Number of 48 cell types in chromHMM state E17_proximal (default: 1.92) 68 cHmm_E18 float Number of 48 cell types in chromHMM state E18_proximal (default: 1.92) 69 cHmm_E19 float Number of 48 cell types in chromHMM state E19_tss (default: 1.92) 70 cHmm_E20 float Number of 48 cell types in chromHMM state E20_poised (default: 1.92) 71 cHmm_E21 float Number of 48 cell types in chromHMM state E21_dead (default: 1.92) 72 cHmm_E22 float Number of 48 cell types in chromHMM state E22_repressed (default: 1.92) 73 cHmm_E23 float Number of 48 cell types in chromHMM state E23_weak (default: 1.92) 74 cHmm_E24 float Number of 48 cell types in chromHMM state E24_distal (default: 1.92) 75 cHmm_E25 float Number of 48 cell types in chromHMM state E25_distal (default: 1.92) 76 GerpRS float Gerp element score (default: 0) 77 GerpRSpval float Gerp element p-Value (default: 0) 78 GerpN float Neutral evolution score defined by GERP++ (default: 3.0) 79 GerpS float Rejected Substitution score defined by GERP++ (default: -0.2) 80 tOverlapMotifs float Number of overlapping predicted TF motifs 81 motifDist float Reference minus alternate allele difference in nucleotide frequency within an predicted overlapping motif (default: 0) 82 EncodeH3K4me1-sum float Sum of Encode H3K4me1 levels (from 13 cell lines) (default: 0.76) 83 EncodeH3K4me1-max float Maximum Encode H3K4me1 level (from 13 cell lines) (default: 0.37) 84 EncodeH3K4me2-sum float Sum of Encode H3K4me2 levels (from 14 cell lines) (default: 0.73) 85 EncodeH3K4me2-max float Maximum Encode H3K4me2 level (from 14 cell lines) (default: 0.37) 86 EncodeH3K4me3-sum float Sum of Encode H3K4me3 levels (from 14 cell lines) (default: 0.81) 87 EncodeH3K4me3-max float Maximum Encode H3K4me3 level (from 14 cell lines) (default: 0.38) 88 EncodeH3K9ac-sum float Sum of Encode H3K9ac levels (from 13 cell lines) (default: 0.82) 89 EncodeH3K9ac-max float Maximum Encode H3K9ac level (from 13 cell lines) (default: 0.41) 90 EncodeH3K9me3-sum float Sum of Encode H3K9me3 levels (from 14 cell lines) (default: 0.81) 91 EncodeH3K9me3-max float Maximum Encode H3K9me3 level (from 14 cell lines) (default: 0.38) 92 EncodeH3K27ac-sum float Sum of Encode H3K27ac levels (from 14 cell lines) (default: 0.74) 93 EncodeH3K27ac-max float Maximum Encode H3K27ac level (from 14 cell lines) (default: 0.36) 94 EncodeH3K27me3-sum float Sum of Encode H3K27me3 levels (from 14 cell lines) (default: 0.93) 95 EncodeH3K27me3-max float Maximum Encode H3K27me3 level (from 14 cell lines) (default: 0.47) 96 EncodeH3K36me3-sum float Sum of Encode H3K36me3 levels (from 10 cell lines) (default: 0.71) 97 EncodeH3K36me3-max float Maximum Encode H3K36me3 level (from 10 cell lines) (default: 0.39) 98 EncodeH3K79me2-sum float Sum of Encode H3K79me2 levels (from 13 cell lines) (default: 0.64) 99 EncodeH3K79me2-max float Maximum Encode H3K79me2 level (from 13 cell lines) (default: 0.34) 100 EncodeH4K20me1-sum float Sum of Encode H4K20me1 levels (from 11 cell lines) (default: 0.88) 101 EncodeH4K20me1-max float Maximum Encode H4K20me1 level (from 11 cell lines) (default: 0.47) 102 EncodeH2AFZ-sum float Sum of Encode H2AFZ levels (from 13 cell lines) (default: 0.9) 103 EncodeH2AFZ-max float Maximum Encode H2AFZ level (from 13 cell lines) (default: 0.42) 104 EncodeDNase-sum float Sum of Encode DNase-seq levels (from 12 cell lines) (default: 0.0) 105 EncodeDNase-max float Maximum Encode DNase-seq level (from 12 cell lines) (default: 0.0) 106 EncodetotalRNA-sum float Sum of Encode totalRNA-seq levels (from 10 cell lines always minus and plus strand) (default: 0.0) 107 EncodetotalRNA-max float Maximum Encode totalRNA-seq level (from 10 cell lines, minus and plus strand separately) (default: 0.0) 108 Grantham float Grantham score: oAA,nAA (default: 0*) 109 Dist2Mutation float Distance between the closest BRAVO SNV up and downstream (position itself excluded) (default: 0*) 110 Freq100bp integer Number of frequent (MAF > 0.05) BRAVO SNV in 100 bp window nearby (default: 0) 111 Rare100bp integer Number of rare (MAF < 0.05) BRAVO SNV in 100 bp window nearby (default: 0) 112 Sngl100bp integer Number of single occurrence BRAVO SNV in 100 bp window nearby (default: 0) 113 Freq1000bp integer Number of frequent (MAF > 0.05) BRAVO SNV in 1000 bp window nearby (default: 0) 114 Rare1000bp integer Number of rare (MAF < 0.05) BRAVO SNV in 1000 bp window nearby (default: 0) 115 Sngl1000bp integer Number of single occurrence BRAVO SNV in 1000 bp window nearby (default: 0) 116 Freq10000bp integer Number of frequent (MAF > 0.05) BRAVO SNV in 10000 bp window nearby (default: 0) 117 Rare10000bp integer Number of rare (MAF < 0.05) BRAVO SNV in 10000 bp window nearby (default: 0) 118 Sngl10000bp integer Number of single occurrence BRAVO SNV in 10000 bp window nearby (default: 0) 119 EnsembleRegulatoryFeature factor Matches in the Ensemble Regulatory Built (similar to annotype) (default: NA) 120 dbscSNV-ada_score float Adaboost classifier score from dbscSNV (default: 0*) 121 dbscSNV-rf_score float Random forest classifier score from dbscSNV (default: 0*) 122 RemapOverlapTF integer Remap number of different transcription factors binding (default: -0.5) 123 RemapOverlapCL integer Remap number of different transcription factor - cell line combinations binding (default: -0.5) 124 RawScore float Raw score from the model 125 PHRED float CADD PHRED Score * A Boolean indicator variable was created in order to handle undefined values....

    [...]

Journal ArticleDOI
TL;DR: The ANNOVAR tool to annotate single nucleotide variants and insertions/deletions, such as examining their functional consequence on genes, inferring cytogenetic bands, reporting functional importance scores, finding variants in conserved regions, or identifying variants reported in the 1000 Genomes Project and dbSNP is developed.
Abstract: High-throughput sequencing platforms are generating massive amounts of genetic variation data for diverse genomes, but it remains a challenge to pinpoint a small subset of functionally important variants. To fill these unmet needs, we developed the ANNOVAR tool to annotate single nucleotide variants (SNVs) and insertions/deletions, such as examining their functional consequence on genes, inferring cytogenetic bands, reporting functional importance scores, finding variants in conserved regions, or identifying variants reported in the 1000 Genomes Project and dbSNP. ANNOVAR can utilize annotation databases from the UCSC Genome Browser or any annotation data set conforming to Generic Feature Format version 3 (GFF3). We also illustrate a 'variants reduction' protocol on 4.7 million SNVs and indels from a human genome, including two causal mutations for Miller syndrome, a rare recessive disease. Through a stepwise procedure, we excluded variants that are unlikely to be causal, and identified 20 candidate genes including the causal gene. Using a desktop computer, ANNOVAR requires ∼4 min to perform gene-based annotation and ∼15 min to perform variants reduction on 4.7 million variants, making it practical to handle hundreds of human genomes in a day. ANNOVAR is freely available at http://www.openbioinformatics.org/annovar/.

10,461 citations


"CADD: predicting the deleteriousnes..." refers methods in this paper

  • ...In addition, our SNV scores are available through a number of third-party sources, such as dbNSFP (48), as a plugin for Ensembl VEP, ANNOVAR (49), SeattleSeq (50), ExAC/gnomAD (8) and PopViz (51)....

    [...]

Related Papers (5)
18 Aug 2016-Nature
Monkol Lek, Konrad J. Karczewski, Konrad J. Karczewski, Eric Vallabh Minikel, Eric Vallabh Minikel, Kaitlin E. Samocha, Eric Banks, Timothy Fennell, Anne H. O’Donnell-Luria, Anne H. O’Donnell-Luria, Anne H. O’Donnell-Luria, James S. Ware, Andrew J. Hill, Andrew J. Hill, Andrew J. Hill, Beryl B. Cummings, Beryl B. Cummings, Taru Tukiainen, Taru Tukiainen, Daniel P. Birnbaum, Jack A. Kosmicki, Laramie E. Duncan, Laramie E. Duncan, Karol Estrada, Karol Estrada, Fengmei Zhao, Fengmei Zhao, James Zou, Emma Pierce-Hoffman, Emma Pierce-Hoffman, Joanne Berghout, David Neil Cooper, Nicole A. Deflaux, Mark A. DePristo, Ron Do, Jason Flannick, Jason Flannick, Menachem Fromer, Laura D. Gauthier, Jackie Goldstein, Jackie Goldstein, Namrata Gupta, Daniel P. Howrigan, Daniel P. Howrigan, Adam Kiezun, Mitja I. Kurki, Mitja I. Kurki, Ami Levy Moonshine, Pradeep Natarajan, Lorena Orozco, Gina M. Peloso, Gina M. Peloso, Ryan Poplin, Manuel A. Rivas, Valentin Ruano-Rubio, Samuel A. Rose, Douglas M. Ruderfer, Khalid Shakir, Peter D. Stenson, Christine Stevens, Brett Thomas, Brett Thomas, Grace Tiao, María Teresa Tusié-Luna, Ben Weisburd, Hong-Hee Won, Dongmei Yu, David Altshuler, David Altshuler, Diego Ardissino, Michael Boehnke, John Danesh, Stacey Donnelly, Roberto Elosua, Jose C. Florez, Jose C. Florez, Stacey Gabriel, Gad Getz, Gad Getz, Stephen J. Glatt, Christina M. Hultman, Sekar Kathiresan, Markku Laakso, Steven A. McCarroll, Steven A. McCarroll, Mark I. McCarthy, Mark I. McCarthy, Dermot P.B. McGovern, Ruth McPherson, Benjamin M. Neale, Benjamin M. Neale, Aarno Palotie, Shaun Purcell, Danish Saleheen, Jeremiah M. Scharf, Pamela Sklar, Patrick F. Sullivan, Patrick F. Sullivan, Jaakko Tuomilehto, Ming T. Tsuang, Hugh Watkins, Hugh Watkins, James G. Wilson, Mark J. Daly, Mark J. Daly, Daniel G. MacArthur, Daniel G. MacArthur