scispace - formally typeset
Search or ask a question
Journal ArticleDOI

vcfr: a package to manipulate and visualize variant call format data in R.

TL;DR: The r package vcfr provides essential, novel tools currently not available in r to facilitate VCF data exploration, including intuitive methods for data quality control and easy export to other r packages for further analysis.
Abstract: Software to call single-nucleotide polymorphisms or related genetic variants has converged on the variant call format (VCF) as the output format of choice. This has created a need for tools to work with VCF files. While an increasing number of software exists to read VCF data, many only extract the genotypes without including the data associated with each genotype that describes its quality. We created the r package vcfr to address this issue. We developed a VCF file exploration tool implemented in the r language because r provides an interactive experience and an environment that is commonly used for genetic data analysis. Functions to read and write VCF files into r as well as functions to extract portions of the data and to plot summary statistics of the data are implemented. vcfr further provides the ability to visualize how various parameterizations of the data affect the results. Additional tools are included to integrate sequence (fasta) and annotation data (GFF) for visualization of genomic regions such as chromosomes. Conversion functions translate data from the vcfr data structure to formats used by other r genetics packages. Computationally intensive functions are implemented in C++ to improve performance. Use of these tools is intended to facilitate VCF data exploration, including intuitive methods for data quality control and easy export to other r packages for further analysis. vcfr thus provides essential, novel tools currently not available in r.
Citations
More filters
01 Jan 2011
TL;DR: The sheer volume and scope of data posed by this flood of data pose a significant challenge to the development of efficient and intuitive visualization tools able to scale to very large data sets and to flexibly integrate multiple data types, including clinical data.
Abstract: Rapid improvements in sequencing and array-based platforms are resulting in a flood of diverse genome-wide data, including data from exome and whole-genome sequencing, epigenetic surveys, expression profiling of coding and noncoding RNAs, single nucleotide polymorphism (SNP) and copy number profiling, and functional assays. Analysis of these large, diverse data sets holds the promise of a more comprehensive understanding of the genome and its relation to human disease. Experienced and knowledgeable human review is an essential component of this process, complementing computational approaches. This calls for efficient and intuitive visualization tools able to scale to very large data sets and to flexibly integrate multiple data types, including clinical data. However, the sheer volume and scope of data pose a significant challenge to the development of such tools.

2,187 citations

Journal ArticleDOI
01 Sep 1993-Nature

413 citations

12 Jan 2015
TL;DR: In this paper, the authors demonstrate that amplicon sequencing with GT-seq greatly reduces the cost of genotyping hundreds of targeted SNPs relative to existing methods by utilizing a simple library preparation method and massive efficiency of scale.
Abstract: Genotyping‐in‐Thousands by sequencing (GT‐seq) is a method that uses next‐generation sequencing of multiplexed PCR products to generate genotypes from relatively small panels (50–500) of targeted single‐nucleotide polymorphisms (SNPs) for thousands of individuals in a single Illumina HiSeq lane. This method uses only unlabelled oligos and PCR master mix in two thermal cycling steps for amplification of targeted SNP loci. During this process, sequencing adapters and dual barcode sequence tags are incorporated into the amplicons enabling thousands of individuals to be pooled into a single sequencing library. Post sequencing, reads from individual samples are split into individual files using their unique combination of barcode sequences. Genotyping is performed with a simple perl script which counts amplicon‐specific sequences for each allele, and allele ratios are used to determine the genotypes. We demonstrate this technique by genotyping 2068 individual steelhead trout (Oncorhynchus mykiss) samples with a set of 192 SNP markers in a single library sequenced in a single Illumina HiSeq lane. Genotype data were 99.9% concordant to previously collected TaqMan™ genotypes at the same 192 loci, but call rates were slightly lower with GT‐seq (96.4%) relative to Taqman (99.0%). Of the 192 SNPs, 187 were genotyped in ≥90% of the individual samples and only 3 SNPs were genotyped in <70% of samples. This study demonstrates amplicon sequencing with GT‐seq greatly reduces the cost of genotyping hundreds of targeted SNPs relative to existing methods by utilizing a simple library preparation method and massive efficiency of scale.

200 citations

Journal ArticleDOI
07 Jan 2021-Nature
TL;DR: Analyses of molecular, anatomical, pigmentation and ecological characteristics of nearly all of the approximately 240 species of cichlid fishes in Lake Tanganyika show that the massive adaptive radiation occurred within the confines of the lake through trait-specific pulses of accelerated evolution.
Abstract: Adaptive radiation is the likely source of much of the ecological and morphological diversity of life1–4. How adaptive radiations proceed and what determines their extent remains unclear in most cases1,4. Here we report the in-depth examination of the spectacular adaptive radiation of cichlid fishes in Lake Tanganyika. On the basis of whole-genome phylogenetic analyses, multivariate morphological measurements of three ecologically relevant trait complexes (body shape, upper oral jaw morphology and lower pharyngeal jaw shape), scoring of pigmentation patterns and approximations of the ecology of nearly all of the approximately 240 cichlid species endemic to Lake Tanganyika, we show that the radiation occurred within the confines of the lake and that morphological diversification proceeded in consecutive trait-specific pulses of rapid morphospace expansion. We provide empirical support for two theoretical predictions of how adaptive radiations proceed, the ‘early-burst’ scenario1,5 (for body shape) and the stages model1,6,7 (for all traits investigated). Through the analysis of two genomes per species and by taking advantage of the uneven distribution of species in subclades of the radiation, we further show that species richness scales positively with per-individual heterozygosity, but is not correlated with transposable element content, number of gene duplications or genome-wide levels of selection in coding sequences. Analyses of molecular, anatomical, pigmentation and ecological characteristics of nearly all of the approximately 240 species of cichlid fishes in Lake Tanganyika show that the massive adaptive radiation occurred within the confines of the lake through trait-specific pulses of accelerated evolution.

114 citations

Journal ArticleDOI
TL;DR: In this article, the authors show that diverse poplar species carry partial duplicates of the ARABIDOPSIS RESPONSE REGULATOR 17 (ARR17) orthologue in the male specific region of the Y chromosome.
Abstract: Although hundreds of plant lineages have independently evolved dioecy (that is, separation of the sexes), the underlying genetic basis remains largely elusive1. Here we show that diverse poplar species carry partial duplicates of the ARABIDOPSIS RESPONSE REGULATOR 17 (ARR17) orthologue in the male-specific region of the Y chromosome. These duplicates give rise to small RNAs apparently causing male-specific DNA methylation and silencing of the ARR17 gene. CRISPR-Cas9-induced mutations demonstrate that ARR17 functions as a sex switch, triggering female development when on and male development when off. Despite repeated turnover events, including a transition from the XY system to a ZW system, the sex-specific regulation of ARR17 is conserved across the poplar genus and probably beyond. Our data reveal how a single-gene-based mechanism of dioecy can enable highly dynamic sex-linked regions and contribute to maintaining recombination and integrity of sex chromosomes.

106 citations

References
More filters
Journal ArticleDOI
TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
Abstract: Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]

45,957 citations


"vcfr: a package to manipulate and v..." refers methods in this paper

  • ...Bioinformatic tools for calling variants such as SAMTOOLs (Li et al. 2009) or the GATK’s haplotype caller (McKenna et al....

    [...]

  • ...Bioinformatic tools for calling variants such as SAMTOOLs (Li et al. 2009) or the GATK’s haplotype caller (McKenna et al. 2010; DePristo et al. 2011; Auwera et al. 2013) have all converged on VCF (Danecek et al. 2011; Samtools 2016) as an output file format....

    [...]

Journal ArticleDOI
TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.
Abstract: Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ~10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: [email protected]

43,862 citations


"vcfr: a package to manipulate and v..." refers methods in this paper

  • ...This short-read data were mapped to the T30-4 reference with BWA-MEM (Li & Durbin 2009), while BAM improvement and variant calling were performed according to the GATK’s best practices (McKenna et al. 2010; DePristo et al. 2011; Auwera et al. 2013)....

    [...]

Journal ArticleDOI
TL;DR: The GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Abstract: Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS—the 1000 Genome pilot alone includes nearly five terabases—make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.

20,557 citations


"vcfr: a package to manipulate and v..." refers background or methods in this paper

  • ...2009) or the GATK’s haplotype caller (McKenna et al. 2010; DePristo et al. 2011; Auwera et al. 2013) have all converged on VCF (Danecek et al....

    [...]

  • ...Bioinformatic tools for calling variants such as SAMTOOLs (Li et al. 2009) or the GATK’s haplotype caller (McKenna et al. 2010; DePristo et al. 2011; Auwera et al. 2013) have all converged on VCF (Danecek et al. 2011; Samtools 2016) as an output file format....

    [...]

  • ...This short-read data were mapped to the T30-4 reference with BWA-MEM (Li & Durbin 2009), while BAM improvement and variant calling were performed according to the GATK’s best practices (McKenna et al. 2010; DePristo et al. 2011; Auwera et al. 2013)....

    [...]

Journal ArticleDOI
TL;DR: UNLABELLED Analysis of Phylogenetics and Evolution (APE) is a package written in the R language for use in molecular evolution and phylogenetics that provides both utility functions for reading and writing data and manipulating phylogenetic trees.
Abstract: Summary: Analysis of Phylogenetics and Evolution (APE) is a package written in the R language for use in molecular evolution and phylogenetics. APE provides both utility functions for reading and writing data and manipulating phylogenetic trees, as well as several advanced methods for phylogenetic and evolutionary analysis (e.g. comparative and population genetic methods). APE takes advantage of the many R functions for statistics and graphics, and also provides a flexible framework for developing and implementing further statistical methods for the analysis of evolutionary processes. Availability: The program is free and available from the official R package archive at http://cran.r-project.org/src/contrib/PACKAGES.html#ape. APE is licensed under the GNU General Public License.

10,818 citations


"vcfr: a package to manipulate and v..." refers methods in this paper

  • ...When sequence information is provided, the VCF data can be converted into an object of class DNAbin using vcfR2DNAbin() for analysis in APE (Paradis et al. 2004) or PEGAS (Paradis 2010)....

    [...]

  • ...This package also does not appear to include conversion functions to translate data into data structures used in commonly used R packages such as APE (Paradis et al. 2004), ADEGENET (Jombart 2008), PEGAS (Paradis 2010) and POPPR (Kamvar et al. 2014, 2015)....

    [...]

  • ...The package also includes functions to convert this information to formats used by existing R packages specifically designed to work with population genetic data [e.g. APE (Paradis et al. 2004), ADEGENET (Jombart 2008), PEGAS (Paradis 2010) and POPPR (Kamvar et al. 2014, 2015)]....

    [...]

Journal ArticleDOI
TL;DR: In this article, the authors present an approach for efficient and intuitive visualization tools able to scale to very large data sets and to flexibly integrate multiple data types, including clinical data.
Abstract: Rapid improvements in sequencing and array-based platforms are resulting in a flood of diverse genome-wide data, including data from exome and whole-genome sequencing, epigenetic surveys, expression profiling of coding and noncoding RNAs, single nucleotide polymorphism (SNP) and copy number profiling, and functional assays. Analysis of these large, diverse data sets holds the promise of a more comprehensive understanding of the genome and its relation to human disease. Experienced and knowledgeable human review is an essential component of this process, complementing computational approaches. This calls for efficient and intuitive visualization tools able to scale to very large data sets and to flexibly integrate multiple data types, including clinical data. However, the sheer volume and scope of data pose a significant challenge to the development of such tools.

10,798 citations