scispace - formally typeset
Search or ask a question
Journal ArticleDOI

SIFT: predicting amino acid changes that affect protein function

01 Jul 2003-Nucleic Acids Research (Oxford University Press)-Vol. 31, Iss: 13, pp 3812-3814
TL;DR: SIFT is a program that predicts whether an amino acid substitution affects protein function so that users can prioritize substitutions for further study and can distinguish between functionally neutral and deleterious amino acid changes in mutagenesis studies and on human polymorphisms.
Abstract: Single nucleotide polymorphism (SNP) studies and random mutagenesis projects identify amino acid substitutions in protein-coding regions. Each substitution has the potential to affect protein function. SIFT (Sorting Intolerant From Tolerant) is a program that predicts whether an amino acid substitution affects protein function so that users can prioritize substitutions for further study. We have shown that SIFT can distinguish between functionally neutral and deleterious amino acid changes in mutagenesis studies and on human polymorphisms. SIFT is available at http://blocks.fhcrc.org/sift/SIFT.html.
Citations
More filters
Journal ArticleDOI
TL;DR: A new method and the corresponding software tool, PolyPhen-2, which is different from the early tool polyPhen1 in the set of predictive features, alignment pipeline, and the method of classification is presented and performance, as presented by its receiver operating characteristic curves, was consistently superior.
Abstract: To the Editor: Applications of rapidly advancing sequencing technologies exacerbate the need to interpret individual sequence variants. Sequencing of phenotyped clinical subjects will soon become a method of choice in studies of the genetic causes of Mendelian and complex diseases. New exon capture techniques will direct sequencing efforts towards the most informative and easily interpretable protein-coding fraction of the genome. Thus, the demand for computational predictions of the impact of protein sequence variants will continue to grow. Here we present a new method and the corresponding software tool, PolyPhen-2 (http://genetics.bwh.harvard.edu/pph2/), which is different from the early tool PolyPhen1 in the set of predictive features, alignment pipeline, and the method of classification (Fig. 1a). PolyPhen-2 uses eight sequence-based and three structure-based predictive features (Supplementary Table 1) which were selected automatically by an iterative greedy algorithm (Supplementary Methods). Majority of these features involve comparison of a property of the wild-type (ancestral, normal) allele and the corresponding property of the mutant (derived, disease-causing) allele, which together define an amino acid replacement. Most informative features characterize how well the two human alleles fit into the pattern of amino acid replacements within the multiple sequence alignment of homologous proteins, how distant the protein harboring the first deviation from the human wild-type allele is from the human protein, and whether the mutant allele originated at a hypermutable site2. The alignment pipeline selects the set of homologous sequences for the analysis using a clustering algorithm and then constructs and refines their multiple alignment (Supplementary Fig. 1). The functional significance of an allele replacement is predicted from its individual features (Supplementary Figs. 2–4) by Naive Bayes classifier (Supplementary Methods). Figure 1 PolyPhen-2 pipeline and prediction accuracy. (a) Overview of the algorithm. (b) Receiver operating characteristic (ROC) curves for predictions made by PolyPhen-2 using five-fold cross-validation on HumDiv (red) and HumVar3 (light green). UniRef100 (solid ... We used two pairs of datasets to train and test PolyPhen-2. We compiled the first pair, HumDiv, from all 3,155 damaging alleles with known effects on the molecular function causing human Mendelian diseases, present in the UniProt database, together with 6,321 differences between human proteins and their closely related mammalian homologs, assumed to be non-damaging (Supplementary Methods). The second pair, HumVar3, consists of all the 13,032 human disease-causing mutations from UniProt, together with 8,946 human nsSNPs without annotated involvement in disease, which were treated as non-damaging. We found that PolyPhen-2 performance, as presented by its receiver operating characteristic curves, was consistently superior compared to PolyPhen (Fig. 1b) and it also compared favorably with the three other popular prediction tools4–6 (Fig. 1c). For a false positive rate of 20%, PolyPhen-2 achieves the rate of true positive predictions of 92% and 73% on HumDiv and HumVar, respectively (Supplementary Table 2). One reason for a lower accuracy of predictions on HumVar is that nsSNPs assumed to be non-damaging in HumVar contain a sizable fraction of mildly deleterious alleles. In contrast, most of amino acid replacements assumed non-damaging in HumDiv must be close to selective neutrality. Because alleles that are even mildly but unconditionally deleterious cannot be fixed in the evolving lineage, no method based on comparative sequence analysis is ideal for discriminating between drastically and mildly deleterious mutations, which are assigned to the opposite categories in HumVar. Another reason is that HumDiv uses an extra criterion to avoid possible erroneous annotations of damaging mutations. For a mutation, PolyPhen-2 calculates Naive Bayes posterior probability that this mutation is damaging and reports estimates of false positive (the chance that the mutation is classified as damaging when it is in fact non-damaging) and true positive (the chance that the mutation is classified as damaging when it is indeed damaging) rates. A mutation is also appraised qualitatively, as benign, possibly damaging, or probably damaging (Supplementary Methods). The user can choose between HumDiv- and HumVar-trained PolyPhen-2. Diagnostics of Mendelian diseases requires distinguishing mutations with drastic effects from all the remaining human variation, including abundant mildly deleterious alleles. Thus, HumVar-trained PolyPhen-2 should be used for this task. In contrast, HumDiv-trained PolyPhen-2 should be used for evaluating rare alleles at loci potentially involved in complex phenotypes, dense mapping of regions identified by genome-wide association studies, and analysis of natural selection from sequence data, where even mildly deleterious alleles must be treated as damaging.

11,571 citations

Journal ArticleDOI
TL;DR: The ANNOVAR tool to annotate single nucleotide variants and insertions/deletions, such as examining their functional consequence on genes, inferring cytogenetic bands, reporting functional importance scores, finding variants in conserved regions, or identifying variants reported in the 1000 Genomes Project and dbSNP is developed.
Abstract: High-throughput sequencing platforms are generating massive amounts of genetic variation data for diverse genomes, but it remains a challenge to pinpoint a small subset of functionally important variants. To fill these unmet needs, we developed the ANNOVAR tool to annotate single nucleotide variants (SNVs) and insertions/deletions, such as examining their functional consequence on genes, inferring cytogenetic bands, reporting functional importance scores, finding variants in conserved regions, or identifying variants reported in the 1000 Genomes Project and dbSNP. ANNOVAR can utilize annotation databases from the UCSC Genome Browser or any annotation data set conforming to Generic Feature Format version 3 (GFF3). We also illustrate a 'variants reduction' protocol on 4.7 million SNVs and indels from a human genome, including two causal mutations for Miller syndrome, a rare recessive disease. Through a stepwise procedure, we excluded variants that are unlikely to be causal, and identified 20 candidate genes including the causal gene. Using a desktop computer, ANNOVAR requires ∼4 min to perform gene-based annotation and ∼15 min to perform variants reduction on 4.7 million variants, making it practical to handle hundreds of human genomes in a day. ANNOVAR is freely available at http://www.openbioinformatics.org/annovar/.

10,461 citations


Cites background or methods from "SIFT: predicting amino acid changes..."

  • ...Nevertheless, for human genomes, SIFT/PolyPhen scores for all possible non-synonymous mutations can be *To whom correspondence should be addressed....

    [...]

  • ...One additional useful feature of ANNOVAR is the ability to filter variants against pre-computed functional importance scores, such as SIFT scores (9), for all possible non-synonymous mutations in the human genome....

    [...]

  • ...ANNOVAR can also download pre-computed SIFT scores for all human nonsynonymous mutations, to help annotate human exomes by filter-based annotation procedure....

    [...]

  • ...We confirmed that it was predicted as benign by SIFT as well (SIFT score=0.18)....

    [...]

  • ...Since pre-computed SIFT scores can be used, ANNOVAR is very efficient in the annotation, requiring several minutes to handle an exome using a modern desktop computer....

    [...]

Journal ArticleDOI
TL;DR: This protocol describes the use of the 'Sorting Tolerant From Intolerant' (SIFT) algorithm in predicting whether an AAS affects protein function.
Abstract: The effect of genetic mutation on phenotype is of significant interest in genetics. The type of genetic mutation that causes a single amino acid substitution (AAS) in a protein sequence is called a non-synonymous single nucleotide polymorphism (nsSNP). An nsSNP could potentially affect the function of the protein, subsequently altering the carrier's phenotype. This protocol describes the use of the 'Sorting Tolerant From Intolerant' (SIFT) algorithm in predicting whether an AAS affects protein function. To assess the effect of a substitution, SIFT assumes that important positions in a protein sequence have been conserved throughout evolution and therefore substitutions at these positions may affect protein function. Thus, by using sequence homology, SIFT predicts the effects of all possible substitutions at each position in the protein sequence. The protocol typically takes 5–20 min, depending on the input. SIFT is available as an online tool ( http://sift-dna.org ).

6,154 citations

Journal ArticleDOI
26 Sep 2008-Science
TL;DR: Recurrent mutations in the active site of isocitrate dehydrogenase 1 (IDH1) occurred in a large fraction of young patients and in most patients with secondary GBMs and were associated with an increase in overall survival.
Abstract: Glioblastoma multiforme (GBM) is the most common and lethal type of brain cancer. To identify the genetic alterations in GBMs, we sequenced 20,661 protein coding genes, determined the presence of amplifications and deletions using high-density oligonucleotide arrays, and performed gene expression analyses using next-generation sequencing technologies in 22 human tumor samples. This comprehensive analysis led to the discovery of a variety of genes that were not known to be altered in GBMs. Most notably, we found recurrent mutations in the active site of isocitrate dehydrogenase 1 (IDH1) in 12% of GBM patients. Mutations in IDH1 occurred in a large fraction of young patients and in most patients with secondary GBMs and were associated with an increase in overall survival. These studies demonstrate the value of unbiased genomic analyses in the characterization of human brain cancer and identify a potentially useful genetic alteration for the classification and targeted therapy of GBMs.

5,250 citations

Journal ArticleDOI
TL;DR: The ability of CADD to prioritize functional, deleterious and pathogenic variants across many functional categories, effect sizes and genetic architectures is unmatched by any current single-annotation method.
Abstract: Our capacity to sequence human genomes has exceeded our ability to interpret genetic variation. Current genomic annotations tend to exploit a single information type (e.g. conservation) and/or are restricted in scope (e.g. to missense changes). Here, we describe Combined Annotation Dependent Depletion (CADD), a framework that objectively integrates many diverse annotations into a single, quantitative score. We implement CADD as a support vector machine trained to differentiate 14.7 million high-frequency human derived alleles from 14.7 million simulated variants. We pre-compute “C-scores” for all 8.6 billion possible human single nucleotide variants and enable scoring of short insertions/deletions. C-scores correlate with allelic diversity, annotations of functionality, pathogenicity, disease severity, experimentally measured regulatory effects, and complex trait associations, and highly rank known pathogenic variants within individual genomes. The ability of CADD to prioritize functional, deleterious, and pathogenic variants across many functional categories, effect sizes and genetic architectures is unmatched by any current annotation.

4,956 citations

References
More filters
Journal ArticleDOI
TL;DR: In addition to maintaining the GenBank(R) nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides data analysis and retrieval resources for the data in GenBank and other biological data made available through NCBI’s website.
Abstract: In addition to maintaining the GenBank(R) nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides data analysis and retrieval resources for the data in GenBank and other biological data made available through NCBI's website. NCBI resources include Entrez, PubMed, PubMed Central, LocusLink, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Electronic PCR, OrfFinder, Spidey, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, Cancer Chromosome Aberration Project (CCAP), Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Clusters of Orthologous Groups (COGs) database, Retroviral Genotyping Tools, SARS Coronavirus Resource, SAGEmap, Gene Expression Omnibus (GEO), Online Mendelian Inheritance in Man (OMIM), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD) and the Conserved Domain Architecture Retrieval Tool (CDART). Augmenting many of the web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of the resources can be accessed through the NCBI home page at: http://www.ncbi.nlm.nih.gov.

9,604 citations

Journal ArticleDOI
TL;DR: The dbSNP database is a general catalog of genome variation to address the large-scale sampling designs required by association studies, gene mapping and evolutionary biology, and is integrated with other sources of information at NCBI such as GenBank, PubMed, LocusLink and the Human Genome Project data.
Abstract: In response to a need for a general catalog of genome variation to address the large-scale sampling designs required by association studies, gene mapping and evolutionary biology, the National Center for Biotechnology Information (NCBI) has established the dbSNP database [S.T.Sherry, M.Ward and K.Sirotkin (1999) Genome Res., 9, 677–679]. Submissions to dbSNP will be integrated with other sources of information at NCBI such as GenBank, PubMed, LocusLink and the Human Genome Project data. The complete contents of dbSNP are available to the public at website: http://www.ncbi.nlm.nih.gov/SNP. The complete contents of dbSNP can also be downloaded in multiple formats via anonymous FTP at ftp:// ncbi.nlm.nih.gov/snp/.

6,449 citations


"SIFT: predicting amino acid changes..." refers methods in this paper

  • ...When SIFT was applied to the non-synonymous SNPs in dbSNP (5), a database of putative SNPs, 25% of the variants were predicted to be deleterious....

    [...]

Journal ArticleDOI
TL;DR: The Human Proteomics Initiative (HPI), a major project to annotate all known human sequences according to the quality standards of SWISS-PROT, is described.
Abstract: SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. Recent developments of the database include: cross-references to additional databases; a variety of new documentation files and improvements to TrEMBL, a computer annotated supplement to SWISS-PROT. TrEMBL consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except the CDS already included in SWISS-PROT. The URLs for SWISS-PROT on the WWW are: http://www.expasy.ch/sprot and http://www. ebi.ac.uk/sprot

3,244 citations

Journal ArticleDOI
TL;DR: A tool that uses sequence homology to predict whether a substitution affects protein function is constructed, which may be used to identify plausible disease candidates among the SNPs that cause missense substitutions.
Abstract: Many missense substitutions are identified in single nucleotide polymorphism (SNP) data and large-scale random mutagenesis projects. Each amino acid substitution potentially affects protein function. We have constructed a tool that uses sequence homology to predict whether a substitution affects protein function. SIFT, which sorts intolerant from tolerant substitutions, classifies substitutions as tolerated or deleterious. A higher proportion of substitutions predicted to be deleterious by SIFT gives an affected phenotype than substitutions predicted to be deleterious by substitution scoring matrices in three test cases. Using SIFT before mutagenesis studies could reduce the number of functional assays required and yield a higher proportion of affected phenotypes. may be used to identify plausible disease candidates among the SNPs that cause missense substitutions.

2,374 citations


"SIFT: predicting amino acid changes..." refers background or methods in this paper

  • ...The SIFT algorithm and software have been described previously (2,3)....

    [...]

  • ...We have shown that with only one sequence homologous to the test protein, SIFT can predict twice as many neutral substitutions correctly compared to a substitution scoring matrix (2)....

    [...]

  • ...SIFT (Sorting Intolerant From Tolerant) uses sequence homology to predict whether an amino acid substitution will affect protein function and hence, potentially alter phenotype (2,3)....

    [...]

  • ...If this normalized value is less than a cutoff, the substitution is predicted to be deleterious (2)....

    [...]

Journal ArticleDOI
TL;DR: A World Wide Web server is presented to predict the effect of an nsSNP on protein structure and function and the dependence of selective pressure on the structural and functional properties of proteins is studied.
Abstract: Human single nucleotide polymorphisms (SNPs) represent the most frequent type of human population DNA variation. One of the main goals of SNP research is to understand the genetics of the human phenotype variation and especially the genetic basis of human complex diseases. Non-synonymous coding SNPs (nsSNPs) comprise a group of SNPs that, together with SNPs in regulatory regions, are believed to have the highest impact on phenotype. Here we present a World Wide Web server to predict the effect of an nsSNP on protein structure and function. The prediction method enabled analysis of the publicly available SNP database HGVbase, which gave rise to a dataset of nsSNPs with predicted functionality. The dataset was further used to compare the effect of various structural and functional characteristics of amino acid substitutions responsible for phenotypic display of nsSNPs. We also studied the dependence of selective pressure on the structural and functional properties of proteins. We found that in our dataset the selection pressure against deleterious SNPs depends on the molecular function of the protein, although it is insensitive to several other protein features considered. The strongest selective pressure was detected for proteins involved in transcription regulation.

2,276 citations


"SIFT: predicting amino acid changes..." refers methods in this paper

  • ...When it is not feasible to conduct experiments on all substitutions, SIFT and other similar prediction tools (13) may be useful in prioritizing which Figure 1....

    [...]