scispace - formally typeset
Open accessJournal ArticleDOI: 10.1038/GIM.2015.30

Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology.

05 Mar 2015-Genetics in Medicine (Springer Nature)-Vol. 17, Iss: 5, pp 405-424
Abstract: Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology

...read more

Citations
  More

Open accessJournal ArticleDOI: 10.1038/NATURE19057
18 Aug 2016-Nature
Abstract: Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC). This catalogue of human genetic diversity contains an average of one variant every eight bases of the exome, and provides direct evidence for the presence of widespread mutational recurrence. We have used this catalogue to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; identifying 3,230 genes with near-complete depletion of predicted protein-truncating variants, with 72% of these genes having no currently established human disease phenotype. Finally, we demonstrate that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human 'knockout' variants in protein-coding genes.

...read more

Topics: Exome (62%), Genomics (54%), Genetic variation (53%) ...read more

7,679 Citations


Open accessJournal ArticleDOI: 10.1093/NAR/GKV1222
Melissa J. Landrum1, Jennifer M. Lee1, Mark L. Benson1, Garth Brown1  +15 moreInstitutions (1)
Abstract: ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/) at the National Center for Biotechnology Information (NCBI) is a freely available archive for interpretations of clinical significance of variants for reported conditions. The database includes germline and somatic variants of any size, type or genomic location. Interpretations are submitted by clinical testing laboratories, research laboratories, locus-specific databases, OMIM®, GeneReviews™, UniProt, expert panels and practice guidelines. In NCBI's Variation submission portal, submitters upload batch submissions or use the Submission Wizard for single submissions. Each submitted interpretation is assigned an accession number prefixed with SCV. ClinVar staff review validation reports with data types such as HGVS (Human Genome Variation Society) expressions; however, clinical significance is reported directly from submitters. Interpretations are aggregated by variant-condition combination and assigned an accession number prefixed with RCV. Clinical significance is calculated for the aggregate record, indicating consensus or conflict in the submitted interpretations. ClinVar uses data standards, such as HGVS nomenclature for variants and MedGen identifiers for conditions. The data are available on the web as variant-specific views; the entire data set can be downloaded via ftp. Programmatic access for ClinVar records is available through NCBI's E-utilities. Future development includes providing a variant-centric XML archive and a web page for details of SCV submissions.

...read more

1,680 Citations


Open accessPosted ContentDOI: 10.1101/030338
30 Oct 2015-bioRxiv
Abstract: Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) sequence data for 60,706 individuals of diverse ethnicities. The resulting catalogue of human genetic diversity has unprecedented resolution, with an average of one variant every eight bases of coding sequence and the presence of widespread mutational recurrence. The deep catalogue of variation provided by the Exome Aggregation Consortium (ExAC) can be used to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; we identify 3,230 genes with near-complete depletion of truncating variants, 79% of which have no currently established human disease phenotype. Finally, we show that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human knockout variants in protein-coding genes.

...read more

  • Figure 1 | Patterns of genetic variation in 60,706 humans. a, The size and diversity of public reference exome data sets. ExAC exceeds previous data sets in size for all studied populations. b, Principal component analysis (PCA) dividing ExAC individuals into five continental populations. PC2 and PC3 are shown; additional PCs are in Extended Data Fig. 5a. c, The allele frequency spectrum of ExAC highlights that the majority of genetic variants are rare and novel (absent from prior databases of genetic variation, such as dbSNP). d, The proportion of possible variation observed by mutational context and functional class. Over half of all possible CpG transitions are observed. Error bars represent standard error of the mean. e, f, The number (e), and frequency distribution (proportion singleton; f) of indels, by size. Compared to in-frame indels, frameshift variants are less common (have a higher proportion of singletons, a proxy for predicted deleteriousness on gene product). Error bars indicate 95% confidence intervals.
    Figure 1 | Patterns of genetic variation in 60,706 humans. a, The size and diversity of public reference exome data sets. ExAC exceeds previous data sets in size for all studied populations. b, Principal component analysis (PCA) dividing ExAC individuals into five continental populations. PC2 and PC3 are shown; additional PCs are in Extended Data Fig. 5a. c, The allele frequency spectrum of ExAC highlights that the majority of genetic variants are rare and novel (absent from prior databases of genetic variation, such as dbSNP). d, The proportion of possible variation observed by mutational context and functional class. Over half of all possible CpG transitions are observed. Error bars represent standard error of the mean. e, f, The number (e), and frequency distribution (proportion singleton; f) of indels, by size. Compared to in-frame indels, frameshift variants are less common (have a higher proportion of singletons, a proxy for predicted deleteriousness on gene product). Error bars indicate 95% confidence intervals.
  • Figure 2 | Mutational recurrence at large sample sizes. a, Proportion of validated de novo variants from two external data sets that are independently found in ExAC, separated by functional class and mutational context. Error bars represent standard error of the mean. Colours are consistent in a–d. b, Number of unique variants observed, by mutational context, as a function of number of individuals (downsampled from ExAC). CpG transitions, the most likely mutational event, begin reaching saturation at ~ 20,000 individuals. c, The site frequency spectrum is shown for each mutational context. d, For doubletons (variants with an allele count (AC) of 2), mutation rate is positively correlated with the likelihood of being found in two individuals of different continental populations. e, The mutability-adjusted proportion of singletons (MAPS) is shown across functional classes. Error bars represent standard error of the mean of the proportion of singletons.
    Figure 2 | Mutational recurrence at large sample sizes. a, Proportion of validated de novo variants from two external data sets that are independently found in ExAC, separated by functional class and mutational context. Error bars represent standard error of the mean. Colours are consistent in a–d. b, Number of unique variants observed, by mutational context, as a function of number of individuals (downsampled from ExAC). CpG transitions, the most likely mutational event, begin reaching saturation at ~ 20,000 individuals. c, The site frequency spectrum is shown for each mutational context. d, For doubletons (variants with an allele count (AC) of 2), mutation rate is positively correlated with the likelihood of being found in two individuals of different continental populations. e, The mutability-adjusted proportion of singletons (MAPS) is shown across functional classes. Error bars represent standard error of the mean of the proportion of singletons.
  • Figure 4 | Filtering for Mendelian variant discovery. a, Predicted missense and protein-truncating variants in 500 randomly chosen ExAC individuals were filtered based on allele frequency (AF) information from ESP, or from the remaining ExAC individuals. At a 0.1% allele frequency filter, ExAC provides greater power to remove candidate variants, leaving an average of 154 variants for analysis, compared to 1,090 after filtering against ESP. Popmax allele frequency also provides greater power than global allele frequency, particularly when populations are unequally sampled. b, Estimates of allele frequency in Europeans based on ESP are more precise at higher allele frequencies. Sampling variance and ascertainment bias make allele frequency estimates unreliable, posing problems for Mendelian variant filtration. 69% of ESP European singletons are not seen a second time in ExAC (tall bar at left), illustrating the dangers of filtering on very low allele counts. c, Allele frequency spectrum of disease-causing variants in the Human Gene Mutation Database (HGMD) and/or pathogenic or probable pathogenic variants in ClinVar for well-characterized autosomal dominant and autosomal recessive disease genes28. Most are not found in ExAC; however, many of the reportedly pathogenic variants found in ExAC are at too high a frequency to be consistent with disease prevalence and penetrance. d, Literature review of variants with > 1% global allele frequency or > 1% Latin American or South Asian population allele frequency confirmed there is insufficient evidence for pathogenicity for the majority of these variants. Variants were reclassified by American College of Medical Genetics and Genomics (ACMG) guidelines24.
    Figure 4 | Filtering for Mendelian variant discovery. a, Predicted missense and protein-truncating variants in 500 randomly chosen ExAC individuals were filtered based on allele frequency (AF) information from ESP, or from the remaining ExAC individuals. At a 0.1% allele frequency filter, ExAC provides greater power to remove candidate variants, leaving an average of 154 variants for analysis, compared to 1,090 after filtering against ESP. Popmax allele frequency also provides greater power than global allele frequency, particularly when populations are unequally sampled. b, Estimates of allele frequency in Europeans based on ESP are more precise at higher allele frequencies. Sampling variance and ascertainment bias make allele frequency estimates unreliable, posing problems for Mendelian variant filtration. 69% of ESP European singletons are not seen a second time in ExAC (tall bar at left), illustrating the dangers of filtering on very low allele counts. c, Allele frequency spectrum of disease-causing variants in the Human Gene Mutation Database (HGMD) and/or pathogenic or probable pathogenic variants in ClinVar for well-characterized autosomal dominant and autosomal recessive disease genes28. Most are not found in ExAC; however, many of the reportedly pathogenic variants found in ExAC are at too high a frequency to be consistent with disease prevalence and penetrance. d, Literature review of variants with > 1% global allele frequency or > 1% Latin American or South Asian population allele frequency confirmed there is insufficient evidence for pathogenicity for the majority of these variants. Variants were reclassified by American College of Medical Genetics and Genomics (ACMG) guidelines24.
Topics: Exome (57%), Genetic variation (54%), Human genetic variation (53%) ...read more

1,552 Citations


Open accessJournal ArticleDOI: 10.1038/GIM.2016.190
Sarah S. Kalia1, Kathy Adelman, Sherri J. Bale2, Wendy K. Chung3  +13 moreInstitutions (14)
Abstract: Disclaimer: These recommendations are designed primarily as an educational resource for medical geneticists and other healthcare providers to help them provide quality medical services. Adherence to these recommendations is completely voluntary and does not necessarily assure a successful medical outcome. These recommendations should not be considered inclusive of all proper procedures and tests or exclusive of other procedures and tests that are reasonably directed toward obtaining the same results. In determining the propriety of any specific procedure or test, the clinician should apply his or her own professional judgment to the specific clinical circumstances presented by the individual patient or specimen. Clinicians are encouraged to document the reasons for the use of a particular procedure or test, whether or not it is in conformance with this statement. Clinicians also are advised to take notice of the date this statement was adopted and to consider other medical and scientific information that becomes available after that date. It also would be prudent to consider whether intellectual property interests may restrict the performance of certain tests and other procedures.To promote standardized reporting of actionable information from clinical genomic sequencing, in 2013, the American College of Medical Genetics and Genomics (ACMG) published a minimum list of genes to be reported as incidental or secondary findings. The goal was to identify and manage risks for selected highly penetrant genetic disorders through established interventions aimed at preventing or significantly reducing morbidity and mortality. The ACMG subsequently established the Secondary Findings Maintenance Working Group to develop a process for curating and updating the list over time. We describe here the new process for accepting and evaluating nominations for updates to the secondary findings list. We also report outcomes from six nominations received in the initial 15 months after the process was implemented. Applying the new process while upholding the core principles of the original policy statement resulted in the addition of four genes and removal of one gene; one gene did not meet criteria for inclusion. The updated secondary findings minimum list includes 59 medically actionable genes recommended for return in clinical genomic sequencing. We discuss future areas of focus, encourage continued input from the medical community, and call for research on the impact of returning genomic secondary findings.Genet Med 19 2, 249-255.

...read more

Topics: Personal genomics (53%), Return of results (51%)

1,069 Citations


Open accessJournal ArticleDOI: 10.1200/PO.17.00011
16 May 2017-
Abstract: PurposeWith prospective clinical sequencing of tumors emerging as a mainstay in cancer care, an urgent need exists for a clinical support tool that distills the clinical implications associated with specific mutation events into a standardized and easily interpretable format. To this end, we developed OncoKB, an expert-guided precision oncology knowledge base.MethodsOncoKB annotates the biologic and oncogenic effects and prognostic and predictive significance of somatic molecular alterations. Potential treatment implications are stratified by the level of evidence that a specific molecular alteration is predictive of drug response on the basis of US Food and Drug Administration labeling, National Comprehensive Cancer Network guidelines, disease-focused expert group recommendations, and scientific literature.ResultsTo date, > 3,000 unique mutations, fusions, and copy number alterations in 418 cancer-associated genes have been annotated. To test the utility of OncoKB, we annotated all genomic events in 5,98...

...read more

887 Citations


References
  More

Open accessJournal ArticleDOI: 10.1038/NMETH0410-248
Ivan Adzhubei1, Steffen Schmidt2, Leonid Peshkin3, Vasily Ramensky4  +4 moreInstitutions (5)
01 Apr 2010-Nature Methods
Abstract: To the Editor: Applications of rapidly advancing sequencing technologies exacerbate the need to interpret individual sequence variants. Sequencing of phenotyped clinical subjects will soon become a method of choice in studies of the genetic causes of Mendelian and complex diseases. New exon capture techniques will direct sequencing efforts towards the most informative and easily interpretable protein-coding fraction of the genome. Thus, the demand for computational predictions of the impact of protein sequence variants will continue to grow. Here we present a new method and the corresponding software tool, PolyPhen-2 (http://genetics.bwh.harvard.edu/pph2/), which is different from the early tool PolyPhen1 in the set of predictive features, alignment pipeline, and the method of classification (Fig. 1a). PolyPhen-2 uses eight sequence-based and three structure-based predictive features (Supplementary Table 1) which were selected automatically by an iterative greedy algorithm (Supplementary Methods). Majority of these features involve comparison of a property of the wild-type (ancestral, normal) allele and the corresponding property of the mutant (derived, disease-causing) allele, which together define an amino acid replacement. Most informative features characterize how well the two human alleles fit into the pattern of amino acid replacements within the multiple sequence alignment of homologous proteins, how distant the protein harboring the first deviation from the human wild-type allele is from the human protein, and whether the mutant allele originated at a hypermutable site2. The alignment pipeline selects the set of homologous sequences for the analysis using a clustering algorithm and then constructs and refines their multiple alignment (Supplementary Fig. 1). The functional significance of an allele replacement is predicted from its individual features (Supplementary Figs. 2–4) by Naive Bayes classifier (Supplementary Methods). Figure 1 PolyPhen-2 pipeline and prediction accuracy. (a) Overview of the algorithm. (b) Receiver operating characteristic (ROC) curves for predictions made by PolyPhen-2 using five-fold cross-validation on HumDiv (red) and HumVar3 (light green). UniRef100 (solid ... We used two pairs of datasets to train and test PolyPhen-2. We compiled the first pair, HumDiv, from all 3,155 damaging alleles with known effects on the molecular function causing human Mendelian diseases, present in the UniProt database, together with 6,321 differences between human proteins and their closely related mammalian homologs, assumed to be non-damaging (Supplementary Methods). The second pair, HumVar3, consists of all the 13,032 human disease-causing mutations from UniProt, together with 8,946 human nsSNPs without annotated involvement in disease, which were treated as non-damaging. We found that PolyPhen-2 performance, as presented by its receiver operating characteristic curves, was consistently superior compared to PolyPhen (Fig. 1b) and it also compared favorably with the three other popular prediction tools4–6 (Fig. 1c). For a false positive rate of 20%, PolyPhen-2 achieves the rate of true positive predictions of 92% and 73% on HumDiv and HumVar, respectively (Supplementary Table 2). One reason for a lower accuracy of predictions on HumVar is that nsSNPs assumed to be non-damaging in HumVar contain a sizable fraction of mildly deleterious alleles. In contrast, most of amino acid replacements assumed non-damaging in HumDiv must be close to selective neutrality. Because alleles that are even mildly but unconditionally deleterious cannot be fixed in the evolving lineage, no method based on comparative sequence analysis is ideal for discriminating between drastically and mildly deleterious mutations, which are assigned to the opposite categories in HumVar. Another reason is that HumDiv uses an extra criterion to avoid possible erroneous annotations of damaging mutations. For a mutation, PolyPhen-2 calculates Naive Bayes posterior probability that this mutation is damaging and reports estimates of false positive (the chance that the mutation is classified as damaging when it is in fact non-damaging) and true positive (the chance that the mutation is classified as damaging when it is indeed damaging) rates. A mutation is also appraised qualitatively, as benign, possibly damaging, or probably damaging (Supplementary Methods). The user can choose between HumDiv- and HumVar-trained PolyPhen-2. Diagnostics of Mendelian diseases requires distinguishing mutations with drastic effects from all the remaining human variation, including abundant mildly deleterious alleles. Thus, HumVar-trained PolyPhen-2 should be used for this task. In contrast, HumDiv-trained PolyPhen-2 should be used for evaluating rare alleles at loci potentially involved in complex phenotypes, dense mapping of regions identified by genome-wide association studies, and analysis of natural selection from sequence data, where even mildly deleterious alleles must be treated as damaging.

...read more

Topics: Multiple sequence alignment (54%), Mutation (genetic algorithm) (53%), Sequence analysis (50%) ...read more

10,175 Citations


Journal ArticleDOI: 10.1038/NPROT.2009.86
Priyank Kumar1, Steven Henikoff2, Steven Henikoff3, Pauline C. Ng1  +1 moreInstitutions (3)
25 Jun 2009-Nature Protocols
Abstract: The effect of genetic mutation on phenotype is of significant interest in genetics. The type of genetic mutation that causes a single amino acid substitution (AAS) in a protein sequence is called a non-synonymous single nucleotide polymorphism (nsSNP). An nsSNP could potentially affect the function of the protein, subsequently altering the carrier's phenotype. This protocol describes the use of the 'Sorting Tolerant From Intolerant' (SIFT) algorithm in predicting whether an AAS affects protein function. To assess the effect of a substitution, SIFT assumes that important positions in a protein sequence have been conserved throughout evolution and therefore substitutions at these positions may affect protein function. Thus, by using sequence homology, SIFT predicts the effects of all possible substitutions at each position in the protein sequence. The protocol typically takes 5–20 min, depending on the input. SIFT is available as an online tool ( http://sift-dna.org ).

...read more

Topics: Protein sequencing (51%)

5,634 Citations


Open accessJournal ArticleDOI: 10.1038/NG.2892
Martin Kircher1, Daniela Witten1, Preti Jain, Brian J. O'Roak2  +3 moreInstitutions (2)
01 Mar 2014-Nature Genetics
Abstract: Our capacity to sequence human genomes has exceeded our ability to interpret genetic variation. Current genomic annotations tend to exploit a single information type (e.g. conservation) and/or are restricted in scope (e.g. to missense changes). Here, we describe Combined Annotation Dependent Depletion (CADD), a framework that objectively integrates many diverse annotations into a single, quantitative score. We implement CADD as a support vector machine trained to differentiate 14.7 million high-frequency human derived alleles from 14.7 million simulated variants. We pre-compute “C-scores” for all 8.6 billion possible human single nucleotide variants and enable scoring of short insertions/deletions. C-scores correlate with allelic diversity, annotations of functionality, pathogenicity, disease severity, experimentally measured regulatory effects, and complex trait associations, and highly rank known pathogenic variants within individual genomes. The ability of CADD to prioritize functional, deleterious, and pathogenic variants across many functional categories, effect sizes and genetic architectures is unmatched by any current annotation.

...read more

Topics: Genome-wide association study (54%), Genomics (51%)

4,148 Citations


Journal ArticleDOI: 10.1038/NMETH0810-575
01 Aug 2010-Nature Methods
Abstract: (simple_aae) or at alterations causing complex changes in the amino acid sequence (complex_aae). To train the classifier, we generated a dataset with all available and suitable common polymorphisms and known diseasecausing mutations extracted from common databases and the literature. We cross-validated the classifier five times including all three prediction models and obtained an overall accuracy of 91.1 ± 0.1%. We also compared MutationTaster with similar applications (Panther3, Pmut4, PolyPhen and PolyPhen-2 (ref. 5) and ‘screening for non-acceptable polymorphisms’ (SNAP)6) and analyzed the identical 1,000 disease-linked mutations and 1,000 polymorphisms with all programs. For this comparison, we used only alterations causing single amino acid exchanges. MutationTaster performed best in terms of accuracy and speed (Table 1). A description of all training and validation procedures and detailed statistics are available in Supplementary Methods. MutationTaster can be used via an intuitive web interface to analyze single mutations as well as in batch mode. To streamline and to standardize the analysis of NGS data, we provide Perl scripts that can process data from all major platforms (Roche 454, Illumina Genome Analyzer and ABI SOLiD). MutationTaster hence allows the efficient filtering of NGS data for alterations with high disease-causing potential (see Supplementary Methods for an example). Present limitations of the software comprise its inability to analyze insertion-deletions greater than 12 base pairs and alterations spanning an intron-exon border. Also, analysis of non-exonic alterations is restricted to Kozak consensus sequence, splice sites and poly(A) signal. We will add tests for other sequence motifs in the near future. MutationTaster is available at http://www.mutationtaster.org/.

...read more

Topics: Sequence (medicine) (73%)

2,345 Citations


Open accessJournal ArticleDOI: 10.1371/JOURNAL.PONE.0046688
08 Oct 2012-PLOS ONE
Abstract: As next-generation sequencing projects generate massive genome-wide sequence variation data, bioinformatics tools are being developed to provide computational predictions on the functional effects of sequence variations and narrow down the search of casual variants for disease phenotypes. Different classes of sequence variations at the nucleotide level are involved in human diseases, including substitutions, insertions, deletions, frameshifts, and non-sense mutations. Frameshifts and non-sense mutations are likely to cause a negative effect on protein function. Existing prediction tools primarily focus on studying the deleterious effects of single amino acid substitutions through examining amino acid conservation at the position of interest among related sequences, an approach that is not directly applicable to insertions or deletions. Here, we introduce a versatile alignment-based score as a new metric to predict the damaging effects of variations not limited to single amino acid substitutions but also in-frame insertions, deletions, and multiple amino acid substitutions. This alignment-based score measures the change in sequence similarity of a query sequence to a protein sequence homolog before and after the introduction of an amino acid variation to the query sequence. Our results showed that the scoring scheme performs well in separating disease-associated variants (n = 21,662) from common polymorphisms (n = 37,022) for UniProt human protein variations, and also in separating deleterious variants (n = 15,179) from neutral variants (n = 17,891) for UniProt non-human protein variations. In our approach, the area under the receiver operating characteristic curve (AUC) for the human and non-human protein variation datasets is ∼0.85. We also observed that the alignment-based score correlates with the deleteriousness of a sequence variation. In summary, we have developed a new algorithm, PROVEAN (Protein Variation Effect Analyzer), which provides a generalized approach to predict the functional effects of protein sequence variations including single or multiple amino acid substitutions, and in-frame insertions and deletions. The PROVEAN tool is available online at http://provean.jcvi.org.

...read more

Topics: Sequence alignment (56%), Protein sequencing (54%), Sequence (medicine) (52%) ...read more

2,097 Citations


Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
202215
20213,590
20202,774
20192,040
20181,402
2017955