scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Performance comparison of exome DNA sequencing technologies

01 Oct 2011-Nature Biotechnology (Nat Biotechnol)-Vol. 29, Iss: 10, pp 908-914
TL;DR: The results suggest that the Nimblegen platform, which is the only one to use high-density overlapping baits, covers fewer genomic regions than the other platforms but requires the least amount of sequencing to sensitively detect small variants.
Abstract: Whole exome sequencing by high-throughput sequencing of target-enriched genomic DNA (exome-seq) has become common in basic and translational research as a means of interrogating the interpretable part of the human genome at relatively low cost. We present a comparison of three major commercial exome sequencing platforms from Agilent, Illumina and Nimblegen applied to the same human blood sample. Our results suggest that the Nimblegen platform, which is the only one to use high-density overlapping baits, covers fewer genomic regions than the other platforms but requires the least amount of sequencing to sensitively detect small variants. Agilent and Illumina are able to detect a greater total number of variants with additional sequencing. Illumina captures untranslated regions, which are not targeted by the Nimblegen and Agilent platforms. We also compare exome sequencing and whole genome sequencing (WGS) of the same sample, demonstrating that exome sequencing can detect additional small variants missed by WGS.
Citations
More filters
Journal ArticleDOI
TL;DR: Off-target effects of RGENs can be reduced below the detection limits of deep sequencing by choosing unique target sequences in the genome and modifying both guide RNA and Cas9, and paired nickases induced chromosomal deletions in a targeted manner without causing unwanted translocations.
Abstract: RNA-guided endonucleases (RGENs), derived from the prokaryotic adaptive immune system known as CRISPR/Cas, enable targeted genome engineering in cells and organisms. RGENs are ribonucleoproteins that consist of guide RNA and Cas9, a protein component originated from Streptococcus pyogenes. These enzymes cleave chromosomal DNA, whose sequence is complementary, to guide RNA in a targeted manner, producing site-specific DNA double-strand breaks (DSBs), the repair of which gives rise to targeted genome modifications. Despite broad interest in RGEN-mediated genome editing, these nucleases are limited by off-target mutations and unwanted chromosomal translocations associated with off-target DNA cleavages. Here, we show that off-target effects of RGENs can be reduced below the detection limits of deep sequencing by choosing unique target sequences in the genome and modifying both guide RNA and Cas9. We found that both the composition and structure of guide RNA can affect RGEN activities in cells to reduce off-target effects. RGENs efficiently discriminated on-target sites from off-target sites that differ by two bases. Furthermore, exome sequencing analysis showed that no off-target mutations were induced by two RGENs in four clonal populations of mutant cells. In addition, paired Cas9 nickases, composed of D10A Cas9 and guide RNA, which generate two single-strand breaks (SSBs) or nicks on different DNA strands, were highly specific in human cells, avoiding off-target mutations without sacrificing genome-editing efficiency. Interestingly, paired nickases induced chromosomal deletions in a targeted manner without causing unwanted translocations. Our results highlight the importance of choosing unique target sequences and optimizing guide RNA and Cas9 to avoid or reduce RGEN-induced off-target mutations.

1,332 citations

Journal ArticleDOI
TL;DR: The issue of sequencing depth in the design of next-generation sequencing experiments is discussed and current guidelines and precedents on the issue of coverage are reviewed for four major study designs, including de novo genome sequencing, genome resequencing, transcriptome sequencing and genomic location analyses.
Abstract: Sequencing technologies have placed a wide range of genomic analyses within the capabilities of many laboratories. However, sequencing costs often set limits to the amount of sequences that can be generated and, consequently, the biological outcomes that can be achieved from an experimental design. In this Review, we discuss the issue of sequencing depth in the design of next-generation sequencing experiments. We review current guidelines and precedents on the issue of coverage, as well as their underlying considerations, for four major study designs, which include de novo genome sequencing, genome resequencing, transcriptome sequencing and genomic location analyses (for example, chromatin immunoprecipitation followed by sequencing (ChIP-seq) and chromosome conformation capture (3C)).

1,156 citations

Journal ArticleDOI
16 Mar 2012-Cell
TL;DR: This study demonstrates that longitudinal iPOP can be used to interpret healthy and diseased states by connecting genomic information with additional dynamic omics activity and reveals extensive heteroallelic changes during healthy and disease states and an unexpected RNA editing mechanism.

1,142 citations

Journal ArticleDOI
TL;DR: This work presents an approach in which 192 sequencing libraries can be produced in a single day of technician time at a cost of about $15 per sample, effective not only for low-pass whole-genome sequencing, but also for simultaneously enriching them in pools of approximately 100 individually barcoded samples for a subset of the genome.
Abstract: Improvements in technology have reduced the cost of DNA sequencing to the point that the limiting factor for many experiments is the time and reagent cost of sample preparation. We present an approach in which 192 sequencing libraries can be produced in a single day of technician time at a cost of about $15 per sample. These libraries are effective not only for low-pass whole-genome sequencing, but also for simultaneously enriching them in pools of approximately 100 individually barcoded samples for a subset of the genome without substantial loss in efficiency of target capture. We illustrate the power and effectiveness of this approach on about 2000 samples from a prostate cancer study.

924 citations


Cites methods from "Performance comparison of exome DNA..."

  • ...…platform for hybrid selections, we expect that similar hybridization-based target enrichment systems, such as the Illumina TruSeq Enrichment kits (Clark et al. 2011), the Roche/NimbleGen SeqCap EZ Hybridization kits, and array-based hybridization (Hodges et al. 2007), would enrich multiplexed…...

    [...]

Journal ArticleDOI
Heng Li1
TL;DR: By investigating false heterozygous calls in the haploid genome, the erroneous realignment in low-complexity regions and the incomplete reference genome with respect to the sample are identified as the two major sources of errors, which press for continued improvements in these two areas.
Abstract: Motivation: Whole-genome high-coverage sequencing has been widely used for personal and cancer genomics as well as in various research areas. However, in the lack of an unbiased whole-genome truth set, the global error rate of variant calls and the leading causal artifacts still remain unclear even given the great efforts in the evaluation of variant calling methods. Results: We made 10 single nucleotide polymorphism and INDEL call sets with two read mappers and five variant callers, both on a haploid human genome and a diploid genome at a similar coverage. By investigating false heterozygous calls in the haploid genome, we identified the erroneous realignment in low-complexity regions and the incomplete reference genome with respect to the sample as the two major sources of errors, which press for continued improvements in these two areas. We estimated that the error rate of raw genotype calls is as high as 1 in 10–15 kb, but the error rate of post-filtered calls is reduced to 1 in 100–200 kb without significant compromise on the sensitivity. Availability and implementation: BWA-MEM alignment and raw variant calls are available at http://bit.ly/1g8XqRt scripts and miscellaneous data at https://github.com/lh3/varcmp. Contact: gro.etutitsnidaorb@ilgneh Supplementary information: Supplementary data are available at Bioinformatics online.

801 citations


Additional excerpts

  • ...…either by comparing variant calls from different pipelines, or by comparing calls to variants ascertained with array genotyping or in another study (Clark et al., 2011; Li, 2012; Lam et al., 2012a,b; Boland et al., 2013; Liu et al., 2013; Goode et al., 2013; O’Rawe et al., 2013; Zook et al., 2014;…...

    [...]

References
More filters
Journal ArticleDOI
TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
Abstract: Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]

45,957 citations


"Performance comparison of exome DNA..." refers methods in this paper

  • ...3%) Aligned reads were processed and sorted with SAMtool...

    [...]

Journal ArticleDOI
TL;DR: The GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Abstract: Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS—the 1000 Genome pilot alone includes nearly five terabases—make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.

20,557 citations


"Performance comparison of exome DNA..." refers methods in this paper

  • ...To evaluate SNV detection performance, we called variants in each normalized data set using the Genome Analysis Toolkit (GATK...

    [...]

Journal ArticleDOI
28 Oct 2010-Nature
TL;DR: The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype as mentioned in this paper, and the results of the pilot phase of the project, designed to develop and compare different strategies for genomewide sequencing with high-throughput platforms.
Abstract: The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

7,538 citations

Journal ArticleDOI
TL;DR: The overlap of miRNA sequences with annotated transcripts, both protein- and non-coding, are described and graphical views of the locations of a wide range of genomic features in model organisms allow for the first time the prediction of the likely boundaries of many miRNA primary transcripts.
Abstract: miRBase is the central online repository for microRNA (miRNA) nomenclature, sequence data, annotation and target prediction. The current release (10.0) contains 5071 miRNA loci from 58 species, expressing 5922 distinct mature miRNA sequences: a growth of over 2000 sequences in the past 2 years. miRBase provides a range of data to facilitate studies of miRNA genomics: all miRNAs are mapped to their genomic coordinates. Clusters of miRNA sequences in the genome are highlighted, and can be defined and retrieved with any inter-miRNA distance. The overlap of miRNA sequences with annotated transcripts, both protein- and non-coding, are described. Finally, graphical views of the locations of a wide range of genomic features in model organisms allow for the first time the prediction of the likely boundaries of many miRNA primary transcripts. miRBase is available at http://microrna.sanger.ac.uk/.

4,493 citations


"Performance comparison of exome DNA..." refers background in this paper

  • ...We first examined coverage of major RNA databases—RefSeq (coding and untranslated region (UTR)), Ensembl (total and coding sequence (CDS)) and the microRNA (miRNA) database miRBas...

    [...]

Related Papers (5)