scispace - formally typeset
Search or ask a question

Showing papers on "Variant Call Format published in 2018"


Journal ArticleDOI
TL;DR: ClinVar continues to make improvements to its search and retrieval functions.
Abstract: ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/) is a freely available, public archive of human genetic variants and interpretations of their significance to disease, maintained at the National Institutes of Health. Interpretations of the clinical significance of variants are submitted by clinical testing laboratories, research laboratories, expert panels and other groups. ClinVar aggregates data by variant-disease pairs, and by variant (or set of variants). Data aggregated by variant are accessible on the website, in an improved set of variant call format files and as a new comprehensive XML report. ClinVar recently started accepting submissions that are focused primarily on providing phenotypic information for individuals who have had genetic testing. Submissions may come from clinical providers providing their own interpretation of the variant ('provider interpretation') or from groups such as patient registries that primarily provide phenotypic information from patients ('phenotyping only'). ClinVar continues to make improvements to its search and retrieval functions. Several new fields are now indexed for more precise searching, and filters allow the user to narrow down a large set of search results.

2,345 citations


Journal ArticleDOI
TL;DR: VCF2SM is a Python script that integrates sequencing depth information of polymorphisms in variant call format (VCF) files and SuperMASSA software for quantitative genotype calling and was successfully applied in analyzing GBS data from diverse panels and full-sib mapping populations of polyploid species.
Abstract: Genotyping-by-sequencing (GBS) has been used broadly in genetic studies for several species, especially those with agricultural importance. However, its use is still limited in autopolyploid species because genotype calling software generally fails to properly distinguish heterozygous classes based on allele dosage. VCF2SM is a Python script that integrates sequencing depth information of polymorphisms in variant call format (VCF) files and SuperMASSA software for quantitative genotype calling. VCFs can be obtained from any variant discovery software that outputs exact allele sequencing depth, such as a modified version of the Tassel-GBS pipeline provided here. VCF2SM was successfully applied in analyzing GBS data from diverse panels (alfalfa and potato) and full-sib mapping populations (alfalfa and switchgrass) of polyploid species. We demonstrate that our approach can help plant geneticists working with autopolyploid species to advance their studies by distinguishing allele dosage from GBS data.

42 citations


Journal ArticleDOI
TL;DR: SV-plaudit is a framework for rapidly curating structural variant (SV) predictions that will become a standard step in variant calling pipelines and the crowd-sourced curation of other biological results.
Abstract: SV-plaudit is a framework for rapidly curating structural variant (SV) predictions. For each SV, we generate an image that visualizes the coverage and alignment signals from a set of samples. Images are uploaded to our cloud framework where users assess the quality of each image using a client-side web application. Reports can then be generated as a tab-delimited file or annotated Variant Call Format (VCF) file. As a proof of principle, nine researchers collaborated for 1 hour to evaluate 1,350 SVs each. We anticipate that SV-plaudit will become a standard step in variant calling pipelines and the crowd-sourced curation of other biological results. Code available at https://github.com/jbelyeu/SV-plaudit Demonstration video available at https://www.youtube.com/watch?v=ono8kHMKxDs

32 citations


Journal ArticleDOI
TL;DR: A user-friendly web server called “m6ASNP” is presented that is dedicated to the identification of genetic variants that target m6A modification sites and is believed to be a very convenient tool that can be used to boost further functional studies investigating genetic variants.
Abstract: Background Large-scale genome sequencing projects have identified many genetic variants for diverse diseases. A major goal of these projects is to characterize these genetic variants to provide insight into their function and roles in diseases. N6-methyladenosine (m6A) is one of the most abundant RNA modifications in eukaryotes. Recent studies have revealed that aberrant m6A modifications are involved in many diseases. Findings In this study, we present a user-friendly web server called "m6ASNP" that is dedicated to the identification of genetic variants that target m6A modification sites. A random forest model was implemented in m6ASNP to predict whether the methylation status of an m6A site is altered by the variants that surround the site. In m6ASNP, genetic variants in a standard variant call format (VCF) are accepted as the input data, and the output includes an interactive table that contains the genetic variants annotated by m6A function. In addition, statistical diagrams and a genome browser are provided to visualize the characteristics and to annotate the genetic variants. Conclusions We believe that m6ASNP is a very convenient tool that can be used to boost further functional studies investigating genetic variants. The web server "m6ASNP" is implemented in JAVA and PHP and is freely available at [60].

32 citations


Journal ArticleDOI
TL;DR: A simple bioinformatic tool is developed that identifies variation at RMA sites and provides correct annotations for all such substitutions, which enhances the accuracy of next-generation sequencing–based methods in clinical practice.

24 citations


Journal ArticleDOI
TL;DR: Seshat, a Web service for annotating TP53 information derived from sequencing data, provides multiple statistical information for each TP53 variant including database frequency, functional activity, or pathogenicity.
Abstract: Accurate annotation of genomic variants in human diseases is essential to allow personalized medicine. Assessment of somatic and germline TP53 alterations has now reached the clinic and is required in several circumstances such as the identification of the most effective cancer therapy for patients with chronic lymphocytic leukemia (CLL). Here, we present Seshat, a Web service for annotating TP53 information derived from sequencing data. A flexible framework allows the use of standard file formats such as Mutation Annotation Format (MAF) or Variant Call Format (VCF), as well as common TXT files. Seshat performs accurate variant annotations using the Human Genome Variation Society (HGVS) nomenclature and the stable TP53 genomic reference provided by the Locus Reference Genomic (LRG). In addition, using the 2017 release of the UMD_TP53 database, Seshat provides multiple statistical information for each TP53 variant including database frequency, functional activity, or pathogenicity. The information is delivered in standardized output tables that minimize errors and facilitate comparison of mutational data across studies. Seshat is a beneficial tool to interpret the ever-growing TP53 sequencing data generated by multiple sequencing platforms and it is freely available via the TP53 Website, http://p53.fr or directly at http://vps338341.ovh.net/.

20 citations


Journal ArticleDOI
TL;DR: The proposed model allows to represent genetic test results in health records in a structured format, allowing both automated processing and clinical decision support and is extensible via external references, allowing to keep track of data provenance and adapt to future domain changes.

16 citations


Journal ArticleDOI
TL;DR: A method to infer copy number that uses variant call format (VCF) data as input and is implemented in the R package vcfR and validated with the model system of Saccharomyces cerevisiae and applied to the oomycete Phytophthora infestans.
Abstract: Inference of copy number variation presents a technical challenge because variant callers typically require the copy number of a genome or genomic region to be known a priori. Here we present a method to infer copy number that uses variant call format (VCF) data as input and is implemented in the R package vcfR. This method is based on the relative frequency of each allele (in both genic and non-genic regions) sequenced at heterozygous positions throughout a genome. These heterozygous positions are summarized by using arbitrarily sized windows of heterozygous positions, binning the allele frequencies, and selecting the bin with the greatest abundance of positions. This provides a non-parametric summary of the frequency that alleles were sequenced at. The method is applicable to organisms that have reference genomes that consist of full chromosomes or sub-chromosomal contigs. In contrast to other software designed to detect copy number variation, our method does not rely on an assumption of base ploidy, but instead infers it. We validated these approaches with the model system of Saccharomyces cerevisiae and applied it to the oomycete Phytophthora infestans, both known to vary in copy number. This functionality has been incorporated into the current release of the R package vcfR to provide modular and flexible methods to investigate copy number variation in genomic projects.

15 citations


Journal ArticleDOI
TL;DR: SNitty as discussed by the authors is a web application that allows interactive visualization and interrogation of variant call format files by using B-allele frequencies of single nucleotide polymorphisms and single-nucleotide variants, coverage metrics, and copy numbers analysis results.

13 citations


Posted ContentDOI
06 Jun 2018-bioRxiv
TL;DR: PeCanPIE is a web- and cloud-based platform for annotation, identification, and classification of variations in known or putative disease genes, applied to classify variant pathogenicity in cancer predisposition genes in two large-scale investigations involving >4,000 pediatric cancer patients.
Abstract: Variant interpretation in the era of next-generation sequencing (NGS) is challenging. While many resources and guidelines are available to assist with this task, few integrated end-to-end tools exist. Here we present "PeCanPIE" — the Pediatric Cancer Variant Pathogenicity Information Exchange, a web- and cloud-based platform for annotation, identification, and classification of variations in known or putative disease genes. Starting from a set of variants in Variant Call Format (VCF), variants are annotated, ranked by putative pathogenicity, and presented for formal classification using a decision-support interface based on published guidelines from the American College of Medical Genetics and Genomics (ACMG). The system can accept files containing millions of variants and handle single-nucleotide variants (SNVs), simple insertions/deletions (indels), multiple-nucleotide variants (MNVs), and complex substitutions. PeCanPIE has been applied to classify variant pathogenicity in cancer predisposition genes in two large-scale investigations involving >4,000 pediatric cancer patients, and serves as a repository for the expert-reviewed results. While PeCanPIE9s web-based interface was designed to be accessible to non-bioinformaticians, its back end pipelines may also be run independently on the cloud, facilitating direct integration and broader adoption. PeCanPIE is publicly available and free for research use.

11 citations


Journal ArticleDOI
15 Nov 2018
TL;DR: It is shown that the pipeline is efficient in RADSeq-based marker selection for Arabidopsis thaliana, and the visualization of SNPs and Indels has been very helpful and has provided valuable insights on marker selection.
Abstract: The discovery and assessment of genetic variants for Next Generation Sequencing (NGS), including Restriction site Associated DNA sequencing (RADSeq), is an important task in bioinformatics and comparative genetics. The genetic variants can be single-nucleotide polymorphisms (SNPs), insertions and deletions (Indels) when compared to a reference genome. Usually, the short reads are aligned to a reference genome at first using NGS alignment software, such as the Burrows- Wheeler Aligner (BWA). The alignment is usually stored into a BAM file, a binary format of standard SAM (Sequence Alignment/Map) protocol. Then analysis software, such as Genome analysis Toolkit (GATK) or SAMTools, together with scripts written in R programming language, could provide an efficient solution for calling variants. In this project, we focus on RADSeq-based marker selection for Arabidopsis thaliana. RADSeq consists of short reads which do not cover the whole reference genome. In order to obtain four call-sets of SNPs as output in Variant Call Format (VCF), SNPs have been called by GATK or SAMTools. Then VCF files have been visualized by Integrative Genomics Viewer (IGV) software. We found that the visualization of SNPs and Indels has been very helpful and has provided us with valuable insights on marker selection. We found that applying Chi-Square test for all target genotypes, which are homozygous reference 0/0, heterozygous variants 0/1 and homozygous variants 1/1, to test Hardy-Weinberg Equilibrium (HWE) in order to reduce false positive rate significantly. We show that our pipeline is efficient in RADSeq-based marker selection.

Journal Article
Ahmed Alfares1
TL;DR: A custom filtration process and strategy targeting a specific population provide excellent detection rates in less time and should be considered as a first-tier laboratory workflow for analysis.
Abstract: Objective Interpreting whole-exome sequencing (WES) data are challenging, requiring extensive time, and effort to review all the variants in the variant call format Here, we examined the application of custom filters to narrow the number of candidate variants in a consanguineous population that requires further analysis Methods In 100 cases undergoing WES, we applied a custom filtration process to look primarily for homozygous variants in autosomal recessive (AR) disorders, and second for variants in either autosomal dominant or x-linked disorders Results Most identified disease-causing variants were homozygous in AR disorders By applying our custom filtration process, we narrowed the number of candidate variants requiring further analysis to 5-15 per case while maintaining a high detection rate and completing analysis in around 45 min Conclusion A custom filtration process and strategy targeting a specific population provide excellent detection rates in less time and should be considered as a first-tier laboratory workflow for analysis

Journal ArticleDOI
TL;DR: OVAS is an offline open-source modular-driven analysis environment designed to annotate and extract useful variants from Variant Call Format files, and process them under an inheritance context through a top-down filtering schema of swappable modules, run entirely off a live bootable medium and accessed locally through a web-browser.
Abstract: The advent of modern high-throughput genetics continually broadens the gap between the rising volume of sequencing data, and the tools required to process them. The need to pinpoint a small subset of functionally important variants has now shifted towards identifying the critical differences between normal variants and disease-causing ones. The ever-increasing reliance on cloud-based services for sequence analysis and the non-transparent methods they utilize has prompted the need for more in-situ services that can provide a safer and more accessible environment to process patient data, especially in circumstances where continuous internet usage is limited. To address these issues, we herein propose our standalone Open-source Variant Analysis Sequencing (OVAS) pipeline; consisting of three key stages of processing that pertain to the separate modes of annotation, filtering, and interpretation. Core annotation performs variant-mapping to gene-isoforms at the exon/intron level, append functional data pertaining the type of variant mutation, and determine hetero/homozygosity. An extensive inheritance-modelling module in conjunction with 11 other filtering components can be used in sequence ranging from single quality control to multi-file penetrance model specifics such as X-linked recessive or mosaicism. Depending on the type of interpretation required, additional annotation is performed to identify organ specificity through gene expression and protein domains. In the course of this paper we analysed an autosomal recessive case study. OVAS made effective use of the filtering modules to recapitulate the results of the study by identifying the prescribed compound-heterozygous disease pattern from exome-capture sequence input samples. OVAS is an offline open-source modular-driven analysis environment designed to annotate and extract useful variants from Variant Call Format (VCF) files, and process them under an inheritance context through a top-down filtering schema of swappable modules, run entirely off a live bootable medium and accessed locally through a web-browser.

25 Jul 2018
TL;DR: Splice Site Variant Analyzer (SSVA) fills a void in splice site variant analysis by merging the output from several databases to provide researchers with a free and comprehensive analysis of the pathogenicity ofsplice site variants in a single step at runtime.
Abstract: We present Splice Site Variant Analyzer (SSVA) to simplify the characterization of deleterious and benign variants in or around splice sites. SSVA uses a Variant Call Format (VCF) file to query variants in humans against the Annovar database, MaxEntScan software, and the Conserved Domain Database. From Annovar, SSVA calculates the GERP score, the Exac score for each population, the allele frequency from the 1000 Genomes Project, and the likelihood score that the variant affects splicing. From MaxEntScan, SSVA calculates a splice site efficiency score based on the sequence. Finally, SSVA uses the Conserved Domain Database through rpsblast to determine if conserved domains are affected by the variant. SSVA presents each of these scores in a single output file that allows researchers to easily classify each splice site variant as pathogenic or benign. SSVA fills a void in splice site variant analysis by merging the output from several databases to provide researchers with a free and comprehensive analysis of the pathogenicity of splice site variants in a single step at runtime.

Journal ArticleDOI
20 Jan 2018
TL;DR: SNPs as output in Variant Call Format (VCF) have been visualized by Integrative Genomics Viewer (IGV) software and it is found that the visualization of SNPs and Indels is helpful and provides valuable insights on marker selection.
Abstract: Motivation: The discovery and assessment genetic variants for Next Generation Sequencing (NGS), including Restriction site Associated DNA sequencing (RADSeq), is an important task in bioinformatics and comparative genetics. The genetic variants can be single-nucleotide polymorphisms (SNPs), insertions and deletions (Indels) when compared to a reference genome. Usually, the short reads are aligned to a reference genome at first using NGS alignment software, such as the Burrows- Wheeler Aligner (BWA). The alignment is usually stored into a BAM file, a binary format of standard SAM (Sequence Alignment/Map) protocol. Then analysis software, such as Genome analysis Toolkit (GATK) or SAMTools [30] [31], together with scripts written in R programming language, could provide an efficient solution for calling variants. We focused on RADSeq-based marker selection for Arabidopsis thaliana . RADSeq consists short reads that do not cover the whole reference genome. Finally, SNPs as output in Variant Call Format (VCF) have been visualized by Integrative Genomics Viewer (IGV) software. We found that the visualization of SNPs and Indels is helpful and provides us with valuable insights on marker selection. We found that applying Chi-Square test for all target genotypes, which are homozygous reference 0/0, heterozygous variants 0/1 and homozygous variants 1/1, to test Hardy-Weinberg Equilibrium (HWE) in order to reduce false positive rate significantly and we showed that our pipeline is efficient in RADSeq-based marker selection.

Posted ContentDOI
21 Dec 2018-bioRxiv
TL;DR: This work demonstrates that existing population-wide WGS call-sets can be mined for CNVs with minimal computational overhead, delivering insight into a less well-studied, yet potentially impactful class of genetic variant.
Abstract: Copy number variants (CNVs) are large deletions or duplications at least 50 to 200 base pairs long. They play an important role in multiple disorders, but accurate calling of CNVs remains challenging. Most current approaches to CNV detection use raw read alignments, which are computationally intensive to process. We use a regression tree-based approach to call CNVs from whole-genome sequencing (WGS, >18x) variant call-sets in 6,898 samples across four European cohorts, and describe a rich large variation landscape comprising 1,320 CNVs. 61.8% of detected events have been previously reported in the Database of Genomic Variants. 23% of high-quality deletions affect entire genes, and we recapitulate known events such as the GSTM1 and RHD gene deletions. We test for association between the detected deletions and 275 protein levels in 1,457 individuals to assess the potential clinical impact of the detected CNVs. We describe the LD structure and copy number variation underlying the association between levels of the CCL3 protein and a complex structural variant (MAF=0.15, p=3.6x10-12) affecting CCL3L3, a paralog of the CCL3 gene. We also identify a cis-association between a low-frequency NOMO1 deletion and the protein product of this gene (MAF=0.02, p=2.2x10-7), for which no cis- or trans- single nucleotide variant-driven protein quantitative trait locus (pQTL) has been documented to date. This work demonstrates that existing population-wide WGS call-sets can be mined for CNVs with minimal computational overhead, delivering insight into a less well-studied, yet potentially impactful class of genetic variant. The regression tree based approach, UN-CNVc, is available as an R and bash executable on GitHub at https://github.com/agilly/un-cnvc. Supplementary information is appended.

Posted ContentDOI
14 Nov 2018-bioRxiv
TL;DR: A number of features make VCF/Plotein especially suited for the medical community, such as its speed, security, the ability to filter by disease or gene function, and the ease with which information may be shared with collaborators/co-workers.
Abstract: Purpose To create a user-friendly web application that allows researchers, medical professionals and patients to easily and securely view, filter and interact with human exome sequencing data in the Variant Call Format (VCF). Methods We have created VCF/Plotein, a web application written entirely in JavaScript using the Vue.js framework, available at http://vcfplotein.liigh.unam.mx. After a VCF is loaded, gene and variant information is extracted from Ensembl, and cross-referencing with external databases is performed via the Elasticsearch search engine. Support for application-based gene and variant filtering has also been implemented. Interactive graphs are created using the D3.js library. All data processing is done locally in the user’s CPU to ensure the security of patient data. Results VCF/Plotein allows users to interactively view and filter VCF files without needing any bioinformatics knowledge. A number of features make it especially suited for the medical community, such as its speed, security, the ability to filter by disease or gene function, and the ease with which information may be shared with collaborators/co-workers. Conclusion VCF/Plotein is a novel web application that allows users to easily and interactively filter and display exome sequencing information, and that is especially suited for bench researchers, medical professionals and patients.