scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Predicting Functional Effect of Human Missense Mutations Using PolyPhen-2

TL;DR: PolyPhen‐2 (Polymorphism Phenotyping v2), available as software and via a Web server, predicts the possible impact of amino acid substitutions on the stability and function of human proteins using structural and comparative evolutionary considerations.
Abstract: PolyPhen-2 (Polymorphism Phenotyping v2), available as software and via a Web server, predicts the possible impact of amino acid substitutions on the stability and function of human proteins using structural and comparative evolutionary considerations. It performs functional annotation of single-nucleotide polymorphisms (SNPs), maps coding SNPs to gene transcripts, extracts protein sequence annotations and structural attributes, and builds conservation profiles. It then estimates the probability of the missense mutation being damaging based on a combination of all these properties. PolyPhen-2 features include a high-quality multiple protein sequence alignment pipeline and a prediction method employing machine-learning classification. The software also integrates the UCSC Genome Browser's human genome annotations and MultiZ multiple alignments of vertebrate genomes with the human genome. PolyPhen-2 is capable of analyzing large volumes of data produced by next-generation sequencing projects, thanks to built-in support for high-performance computing environments like Grid Engine and Platform LSF.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
05 Apr 2018-Cell
TL;DR: This study reports a PanCancer and PanSoftware analysis spanning 9,423 tumor exomes (comprising all 33 of The Cancer Genome Atlas projects) and using 26 computational tools to catalog driver genes and mutations, identifying 299 driver genes with implications regarding their anatomical sites and cancer/cell types.

1,623 citations


Cites background or methods from "Predicting Functional Effect of Hum..."

  • ...PolyPhen2 Polymorphism Phenotyping v2 (PolyPhen2) (Adzhubei et al., 2013) is a machine learning approach that computes the functional impact of missense mutations....

    [...]

  • ...The collection was comprised of 8 mutation-level algorithms (SIFT [Ng and Henikoff, 2002], PolyPhen2 [Adzhubei et al., 2013], MutationAssessor [Reva et al., 2011], transFIC [Gonzalez-Perez et al., 2012], fathmm [Shihab et al., 2013], CHASM [Carter et al., 2009], CanDrA [Mao et al., 2013] and VEST [Carter et al., 2013]), 4 structure-based (HotSpot3D [Niu et al., 2016], HotMAPS [Tokheim et al., 2016a], 3DHotSpots.org [Gao et al., 2017] and e-Driver3D [Porta-Pardo et al., 2015]), 2 network and –omic integration tools (OncoIMPACT [Bertrand et al., 2015], DriverNet [Bashashati et al., 2012]), and 2 algorithms to identify clinically-actionable events (PHIAL [Van Allen et al., 2014] and DEPO [S.Q. Sun, R.J. Mashl, S. Sengupta, A.D. Scott, W. Wang, P. Batra, L.-B. Wang, M.A. Wyczalkowski, L. Ding, unpublished data])....

    [...]

  • ...We utilized four tools that distinguish pathogenic mutations from benign polymorphisms on a population level (SIFT [Ng and Henikoff, 2002], PolyPhen2 [Adzhubei et al., 2013], VEST (version 3 scores) [Carter et al., 2013] and MutationAssessor [Reva et al., 2011]), four tools specifically designed to distinguish between driver and passenger somatic mutations (CHASM [Wong et al., 2011], CanDrA [Carter et al., 2013], fathmm [Shihab et al., 2013] and transFIC [Gonzalez-Perez et al., 2012]) and four tools that leverage information from protein structures (HotSpot3D [Niu et al., 2016], HotMAPS [Tokheim et al., 2016a], 3DHotSpot.org [Gao et al., 2017] and e-Driver3D [Porta-Pardo et al., 2015])....

    [...]

  • ...…Reva et al., 2011 http://mutationassessor.org/r3/ SIFT Ng and Henikoff, 2002 http://sift.jcvi.org PolyPhen2 Adzhubei et al., 2013 http://genetics.bwh.harvard.edu/pph2/ fathmm Shihab et al., 2013 http://fathmm.biocompute.org.uk transFIC Gonzalez-Perez et…...

    [...]

  • ...REAGENT or RESOURCE SOURCE IDENTIFIER Deposited Data Public MC3 MAF Ellrott et al., 2018 https://gdc.cancer.gov/about-data/publications/mc3-2017 Clinical Data Liu et al., 2018 https://gdc.cancer.gov/about-data/publications/pancanatlas Target Drug Database - Phial Van Allen et al., 2014 https://github.com/vanallenlab/2017-tcga-mc3_phial DEPO S.S., L.D., S.Q. Sun, R.J. Mashl, A.D. Scott, W. Wang, P. Batra, L.-B. Wang, and M.A. Wyczalkowski, unpublished data http://depo-dinglab.ddns.net OncoKB Chakravarty et al., 2017 http://oncokb.org Mutation Validation Ng et al., 2018 N/A Software and Algorithms 20/20+ Tokheim et al., 2016b https://github.com/KarchinLab/2020plus MutSig2CV Lawrence et al., 2014 http://archive.broadinstitute.org/cancer/cga/mutsig_run MuSiC2 Dees et al., 2012 https://github.com/ding-lab/MuSiC2 OncodriveCLUST Tamborero et al., 2013a http://bg.upf.edu/group/projects/oncodrive-clust.php OncodriveFML Mularoni et al., 2016 http://bbglab.irbbarcelona.org/oncodrivefml/home ActiveDriver Reimand and Bader, 2013 http://individual.utoronto.ca/reimand/ActiveDriver/ CompositeDriver This paper https://github.com/khuranalab/CompositeDriver HotMAPS Tokheim et al., 2016a https://github.com/KarchinLab/HotMAPS CHASM Carter et al., 2009 http://www.cravat.us/CRAVAT/ VEST Carter et al., 2013 http://www.cravat.us/CRAVAT/ e-Driver Porta-Pardo and Godzik, 2014 https://github.com/eduardporta/e-Driver CanDrA Mao et al., 2013 http://bioinformatics.mdanderson.org/main/CanDrA HotSpot3D Niu et al., 2016 https://github.com/ding-lab/hotspot3d 3DHotSpots.org Gao et al., 2017 http://3dhotspots.org/3d/ e-Driver3D Porta-Pardo et al., 2015 https://github.com/eduardporta/e-Driver DriverNET Bashashati et al., 2012 http://www.shahlab.ca OncoIMPACT Bertrand et al., 2015 https://github.com/CSB5/OncoIMPACT MutationAssessor Reva et al., 2011 http://mutationassessor.org/r3/ SIFT Ng and Henikoff, 2002 http://sift.jcvi.org PolyPhen2 Adzhubei et al., 2013 http://genetics.bwh.harvard.edu/pph2/ fathmm Shihab et al., 2013 http://fathmm.biocompute.org.uk transFIC Gonzalez-Perez et al., 2012 http://bbglab.irbbarcelona.org/transfic/home CTAT-score This Paper https://gdc.cancer.gov MSIsensor Niu et al., 2014 https://github.com/ding-lab/msisensor...

    [...]

Posted ContentDOI
29 Apr 2019-bioRxiv
TL;DR: This work uses unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity, enabling state-of-the-art supervised prediction of mutational effect and secondary structure, and improving state- of- the-art features for long-range contact prediction.
Abstract: In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In biology, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Learning the natural distribution of evolutionary protein sequence variation is a logical step toward predictive and generative modeling for biology. To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million sequences spanning evolutionary diversity. The resulting model maps raw sequences to representations of biological properties without labels or prior domain knowledge. The learned representation space organizes sequences at multiple levels of biological granularity from the biochemical to proteomic levels. Learning recovers information about protein structure: secondary structure and residue-residue contacts can be extracted by linear projections from learned representations. With small amounts of labeled data, the ability to identify tertiary contacts is further improved. Learning on full sequence diversity rather than individual protein families increases recoverable information about secondary structure. We show the networks generalize by adapting them to variant activity prediction from sequences only, with results that are comparable to a state-of-the-art variant predictor that uses evolutionary and structurally derived features.

748 citations


Cites background from "Predicting Functional Effect of Hum..."

  • ...Computational variant effect predictors are useful for assessing the effect of point mutations (Gray et al., 2018; Adzhubei et al., 2013; Kumar et al., 2009; Hecht et al., 2015; Rentzsch et al., 2018)....

    [...]

Journal ArticleDOI
TL;DR: Reassessment of assumptions about the complexity of the genomic and phenomic architecture of DCM is warranted, which will require comprehensive genomic studies in much larger cohorts of rigorously phenotyped probands and family members than previously examined.
Abstract: Remarkable progress has been made in understanding the genetic basis of dilated cardiomyopathy (DCM). Rare variants in >30 genes, some also involved in other cardiomyopathies, muscular dystrophy, or syndromic disease, perturb a diverse set of important myocardial proteins to produce a final DCM phenotype. Large, publicly available datasets have provided the opportunity to evaluate previously identified DCM-causing mutations, and to examine the population frequency of sequence variants similar to those that have been observed to cause DCM. The frequency of these variants, whether associated with dilated or hypertrophic cardiomyopathy, is greater than estimates of disease prevalence. This mismatch might be explained by one or more of the following possibilities: that the penetrance of DCM-causing mutations is lower than previously thought, that some variants are noncausal, that DCM prevalence is higher than previously estimated, or that other more-complex genomics underlie DCM. Reassessment of our assumptions about the complexity of the genomic and phenomic architecture of DCM is warranted. Much about the genomic basis of DCM remains to be investigated, which will require comprehensive genomic studies in much larger cohorts of rigorously phenotyped probands and family members than previously examined.

728 citations

Journal ArticleDOI
TL;DR: These frequent DDR gene alterations in many human cancers have functional consequences that may determine cancer progression and guide therapy and a new machine-learning-based classifier developed from gene expression data allowed to identify alterations that phenocopy deleterious TP53 mutations.

706 citations


Cites methods from "Predicting Functional Effect of Hum..."

  • ...To estimate the probability of missense mutations being damaging, we further annotated these missense mutations using six commonly used functional prediction algorithms (Figure S1D): PolyPhen-2 (Adzhubei et al., 2013), SIFT (Kumar et al....

    [...]

  • ...…being damaging, we further annotated these missense mutations using six commonly used functional prediction algorithms (Figure S1D): PolyPhen-2 (Adzhubei et al., 2013), SIFT (Kumar et al., 2009), Mutation Taster (Schwarz et al., 2014), Mutation Assessor (Reva et al., 2011), LR and LRT (Chun…...

    [...]

References
More filters
Journal ArticleDOI
TL;DR: ClUSTAL X is a new windows interface for the widely-used progressive multiple sequence alignment program CLUSTAL W, providing an integrated system for performing multiple sequence and profile alignments and analysing the results.
Abstract: CLUSTAL X is a new windows interface for the widely-used progressive multiple sequence alignment program CLUSTAL W. The new system is easy to use, providing an integrated system for performing multiple sequence and profile alignments and analysing the results. CLUSTAL X displays the sequence alignment in a window on the screen. A versatile sequence colouring scheme allows the user to highlight conserved features in the alignment. Pull-down menus provide all the options required for traditional multiple sequence and profile alignment. New features include: the ability to cut-and-paste sequences to change the order of the alignment, selection of a subset of the sequences to be realigned, and selection of a sub-range of the alignment to be realigned and inserted back into the original alignment. Alignment quality analysis can be performed and low-scoring segments or exceptional residues can be highlighted. Quality analysis and realignment of selected residue ranges provide the user with a powerful tool to improve and refine difficult alignments and to trap errors in input sequences. CLUSTAL X has been compiled on SUN Solaris, IRIX5.3 on Silicon Graphics, Digital UNIX on DECstations, Microsoft Windows (32 bit) for PCs, Linux ELF for x86 PCs, and Macintosh PowerMac.

38,522 citations


"Predicting Functional Effect of Hum..." refers methods in this paper

  • ...Searching for Mutations 7.20.5 Current Protocols in Human Genetics Supplement 76 Figure 7.20.3 Detailed results of the PolyPhen-2 analysis for a single variant query with the multiple sequence alignment and 3-D-structure protein viewer panels expanded the multiple sequence alignment panel displays a fixed 75-residue wide window surrounding the variant’s position (the column indicated by black frame), with the alignment colored using the ClustalX (Thompson et al., 1997) scheme for all columns above 50% conservation threshold....

    [...]

  • ...…panels expanded the multiple sequence alignment panel displays a fixed 75-residue wide window surrounding the variant’s position (the column indicated by black frame), with the alignment colored using the ClustalX (Thompson et al., 1997) scheme for all columns above 50% conservation threshold....

    [...]

Journal ArticleDOI
TL;DR: A new method and the corresponding software tool, PolyPhen-2, which is different from the early tool polyPhen1 in the set of predictive features, alignment pipeline, and the method of classification is presented and performance, as presented by its receiver operating characteristic curves, was consistently superior.
Abstract: To the Editor: Applications of rapidly advancing sequencing technologies exacerbate the need to interpret individual sequence variants. Sequencing of phenotyped clinical subjects will soon become a method of choice in studies of the genetic causes of Mendelian and complex diseases. New exon capture techniques will direct sequencing efforts towards the most informative and easily interpretable protein-coding fraction of the genome. Thus, the demand for computational predictions of the impact of protein sequence variants will continue to grow. Here we present a new method and the corresponding software tool, PolyPhen-2 (http://genetics.bwh.harvard.edu/pph2/), which is different from the early tool PolyPhen1 in the set of predictive features, alignment pipeline, and the method of classification (Fig. 1a). PolyPhen-2 uses eight sequence-based and three structure-based predictive features (Supplementary Table 1) which were selected automatically by an iterative greedy algorithm (Supplementary Methods). Majority of these features involve comparison of a property of the wild-type (ancestral, normal) allele and the corresponding property of the mutant (derived, disease-causing) allele, which together define an amino acid replacement. Most informative features characterize how well the two human alleles fit into the pattern of amino acid replacements within the multiple sequence alignment of homologous proteins, how distant the protein harboring the first deviation from the human wild-type allele is from the human protein, and whether the mutant allele originated at a hypermutable site2. The alignment pipeline selects the set of homologous sequences for the analysis using a clustering algorithm and then constructs and refines their multiple alignment (Supplementary Fig. 1). The functional significance of an allele replacement is predicted from its individual features (Supplementary Figs. 2–4) by Naive Bayes classifier (Supplementary Methods). Figure 1 PolyPhen-2 pipeline and prediction accuracy. (a) Overview of the algorithm. (b) Receiver operating characteristic (ROC) curves for predictions made by PolyPhen-2 using five-fold cross-validation on HumDiv (red) and HumVar3 (light green). UniRef100 (solid ... We used two pairs of datasets to train and test PolyPhen-2. We compiled the first pair, HumDiv, from all 3,155 damaging alleles with known effects on the molecular function causing human Mendelian diseases, present in the UniProt database, together with 6,321 differences between human proteins and their closely related mammalian homologs, assumed to be non-damaging (Supplementary Methods). The second pair, HumVar3, consists of all the 13,032 human disease-causing mutations from UniProt, together with 8,946 human nsSNPs without annotated involvement in disease, which were treated as non-damaging. We found that PolyPhen-2 performance, as presented by its receiver operating characteristic curves, was consistently superior compared to PolyPhen (Fig. 1b) and it also compared favorably with the three other popular prediction tools4–6 (Fig. 1c). For a false positive rate of 20%, PolyPhen-2 achieves the rate of true positive predictions of 92% and 73% on HumDiv and HumVar, respectively (Supplementary Table 2). One reason for a lower accuracy of predictions on HumVar is that nsSNPs assumed to be non-damaging in HumVar contain a sizable fraction of mildly deleterious alleles. In contrast, most of amino acid replacements assumed non-damaging in HumDiv must be close to selective neutrality. Because alleles that are even mildly but unconditionally deleterious cannot be fixed in the evolving lineage, no method based on comparative sequence analysis is ideal for discriminating between drastically and mildly deleterious mutations, which are assigned to the opposite categories in HumVar. Another reason is that HumDiv uses an extra criterion to avoid possible erroneous annotations of damaging mutations. For a mutation, PolyPhen-2 calculates Naive Bayes posterior probability that this mutation is damaging and reports estimates of false positive (the chance that the mutation is classified as damaging when it is in fact non-damaging) and true positive (the chance that the mutation is classified as damaging when it is indeed damaging) rates. A mutation is also appraised qualitatively, as benign, possibly damaging, or probably damaging (Supplementary Methods). The user can choose between HumDiv- and HumVar-trained PolyPhen-2. Diagnostics of Mendelian diseases requires distinguishing mutations with drastic effects from all the remaining human variation, including abundant mildly deleterious alleles. Thus, HumVar-trained PolyPhen-2 should be used for this task. In contrast, HumDiv-trained PolyPhen-2 should be used for evaluating rare alleles at loci potentially involved in complex phenotypes, dense mapping of regions identified by genome-wide association studies, and analysis of natural selection from sequence data, where even mildly deleterious alleles must be treated as damaging.

11,571 citations


"Predicting Functional Effect of Hum..." refers background or methods in this paper

  • ...PolyPhen-2 (Adzhubei et al., 2010) is an automatic tool for prediction of the possible impact of an amino acid substitution on the structure and function of a human protein....

    [...]

  • ...…published estimate (for version 2.0.0) is that, for a false positive rate of 20%, PolyPhen-2 achieves true positive prediction rates of 92% on the HumDiv dataset and 73% on the HumVar dataset (Adzhubei et al. 2010), and our unpublished estimates for newer versions show slightly better performance....

    [...]

Journal ArticleDOI
TL;DR: Jalview 2 is a system for interactive WYSIWYG editing, analysis and annotation of multiple sequence alignments that employs web services for sequence alignment, secondary structure prediction and the retrieval of alignments, sequences, annotation and structures from public databases and any DAS 1.53 compliant sequence or annotation server.
Abstract: Summary: Jalview Version 2 is a system for interactive WYSIWYG editing, analysis and annotation of multiple sequence alignments. Core features include keyboard and mouse-based editing, multiple views and alignment overviews, and linked structure display with Jmol. Jalview 2 is available in two forms: a lightweight Java applet for use in web applications, and a powerful desktop application that employs web services for sequence alignment, secondary structure prediction and the retrieval of alignments, sequences, annotation and structures from public databases and any DAS 1.53 compliant sequence or annotation server. Availability: The Jalview 2 Desktop application and JalviewLite applet are made freely available under the GPL, and can be downloaded from www.jalview.org Contact: g.j.barton@dundee.ac.uk

7,926 citations


"Predicting Functional Effect of Hum..." refers background in this paper

  • ...Jalview Version 2-a multiple sequence alignment editor and analysis workbench....

    [...]

  • ...Clicking on the link at the bottom of the alignment panel opens the Jalview (Waterhouse et al., 2009) alignment viewer applet with the complete multiple alignment loaded....

    [...]

  • ...Click the link at the bottom of the panel to open an interactive alignment viewer (Jalview, http://www.jalview.org/) (Waterhouse et al., 2009) to scroll through the complete alignment....

    [...]

Book
15 Aug 2000
TL;DR: This chapter discusses the molecular basis of evolution, the evolution of organisms based on the fossil record, and the implications of these events for phylogenetic inference.
Abstract: 1. Molecular basis of evolution 2. Evolutionary changes of amino acid sequences 3. Evolutionary changes of DNA sequences 4. Synonymous and nonsynonymous nucleotide substitutions 5. Phylogenetic trees 6. Phylogenetic inference: Distance methods 7. Phylogenetic inference: Maximum parsimony methods 8. Phylogenetic inference: Maximum likelihood methods 9. Accuracies and statistical tests of phylogenetic trees 10. Molecular clocks and linearized trees 11. Ancestral nucleotide and amino acid sequences 12. Genetic polymorphism and evolution 13. Population trees from genetic markers 14. Perspectives Appendices A. Mathematical sumbols and notations B. Geological timescale C. Geological events in the Cenozoic and Meszoic eras D. Evolution of organisms based on the fossil record

5,629 citations

Journal ArticleDOI
06 Jul 2012-Science
TL;DR: The findings suggest that most human variation is rare, not shared between populations, and that rare variants are likely to play a role in human health, and show that large sample sizes will be required to associate rare variants with complex traits.
Abstract: As a first step toward understanding how rare variants contribute to risk for complex diseases, we sequenced 15,585 human protein-coding genes to an average median depth of 111× in 2440 individuals of European (n = 1351) and African (n = 1088) ancestry. We identified over 500,000 single-nucleotide variants (SNVs), the majority of which were rare (86% with a minor allele frequency less than 0.5%), previously unknown (82%), and population-specific (82%). On average, 2.3% of the 13,595 SNVs each person carried were predicted to affect protein function of ~313 genes per genome, and ~95.7% of SNVs predicted to be functionally important were rare. This excess of rare functional variants is due to the combined effects of explosive, recent accelerated population growth and weak purifying selection. Furthermore, we show that large sample sizes will be required to associate rare variants with complex traits.

1,680 citations


"Predicting Functional Effect of Hum..." refers background in this paper

  • ...…rare alleles that cause Mendelian disease (Bamshad et al., 2011), scanning for potentially medically actionable alleles in an individual’s genome (Ashley et al., 2010), and profiling the spectrum of rare variation uncovered by deep sequencing of large populations (Tennessen et al., 2012)....

    [...]

Related Papers (5)
18 Aug 2016-Nature
Monkol Lek, Konrad J. Karczewski, Konrad J. Karczewski, Eric Vallabh Minikel, Eric Vallabh Minikel, Kaitlin E. Samocha, Eric Banks, Timothy Fennell, Anne H. O’Donnell-Luria, Anne H. O’Donnell-Luria, Anne H. O’Donnell-Luria, James S. Ware, Andrew J. Hill, Andrew J. Hill, Andrew J. Hill, Beryl B. Cummings, Beryl B. Cummings, Taru Tukiainen, Taru Tukiainen, Daniel P. Birnbaum, Jack A. Kosmicki, Laramie E. Duncan, Laramie E. Duncan, Karol Estrada, Karol Estrada, Fengmei Zhao, Fengmei Zhao, James Zou, Emma Pierce-Hoffman, Emma Pierce-Hoffman, Joanne Berghout, David Neil Cooper, Nicole A. Deflaux, Mark A. DePristo, Ron Do, Jason Flannick, Jason Flannick, Menachem Fromer, Laura D. Gauthier, Jackie Goldstein, Jackie Goldstein, Namrata Gupta, Daniel P. Howrigan, Daniel P. Howrigan, Adam Kiezun, Mitja I. Kurki, Mitja I. Kurki, Ami Levy Moonshine, Pradeep Natarajan, Lorena Orozco, Gina M. Peloso, Gina M. Peloso, Ryan Poplin, Manuel A. Rivas, Valentin Ruano-Rubio, Samuel A. Rose, Douglas M. Ruderfer, Khalid Shakir, Peter D. Stenson, Christine Stevens, Brett Thomas, Brett Thomas, Grace Tiao, María Teresa Tusié-Luna, Ben Weisburd, Hong-Hee Won, Dongmei Yu, David Altshuler, David Altshuler, Diego Ardissino, Michael Boehnke, John Danesh, Stacey Donnelly, Roberto Elosua, Jose C. Florez, Jose C. Florez, Stacey Gabriel, Gad Getz, Gad Getz, Stephen J. Glatt, Christina M. Hultman, Sekar Kathiresan, Markku Laakso, Steven A. McCarroll, Steven A. McCarroll, Mark I. McCarthy, Mark I. McCarthy, Dermot P.B. McGovern, Ruth McPherson, Benjamin M. Neale, Benjamin M. Neale, Aarno Palotie, Shaun Purcell, Danish Saleheen, Jeremiah M. Scharf, Pamela Sklar, Patrick F. Sullivan, Patrick F. Sullivan, Jaakko Tuomilehto, Ming T. Tsuang, Hugh Watkins, Hugh Watkins, James G. Wilson, Mark J. Daly, Mark J. Daly, Daniel G. MacArthur, Daniel G. MacArthur