scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies

01 Nov 2010-Nucleic Acids Research (Oxford University Press)-Vol. 38, Iss: 21, pp 7400-7409
TL;DR: It is concluded that pyrosequencing can be used to investigate genetically diverse samples with high accuracy if technical errors are properly treated and probabilistic haplotype inference outperforms the counting-based calling method in both precision and recall.
Abstract: Next-generation sequencing technologies can be used to analyse genetically heterogeneous samples at unprecedented detail. The high coverage achievable with these methods enables the detection of many low-frequency variants. However, sequencing errors complicate the analysis of mixed populations and result in inflated estimates of genetic diversity. We developed a probabilistic Bayesian approach to minimize the effect of errors on the detection of minority variants. We applied it to pyrosequencing data obtained from a 1.5-kb-fragment of the HIV-1 gag/pol gene in two control and two clinical samples. The effect of PCR amplification was analysed. Error correction resulted in a two- and five-fold decrease of the pyrosequencing base substitution rate, from 0.05% to 0.03% and from 0.25% to 0.05% in the non-PCR and PCR-amplified samples, respectively. We were able to detect viral clones as rare as 0.1% with perfect sequence reconstruction. Probabilistic haplotype inference outperforms the counting-based calling method in both precision and recall. Genetic diversity observed within and between two clinical samples resulted in various patterns of phenotypic drug resistance and suggests a close epidemiological link. We conclude that pyrosequencing can be used to investigate genetically diverse samples with high accuracy if technical errors are properly treated.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: It is determined that Duplex Sequencing has a theoretical background error rate of less than one artifactual mutation per billion nucleotides sequenced and that detection of mutations present in only one of the two strands of duplex DNA can be used to identify sites of DNA damage.
Abstract: Next-generation DNA sequencing promises to revolutionize clinical medicine and basic research. However, while this technology has the capacity to generate hundreds of billions of nucleotides of DNA sequence in a single experiment, the error rate of ∼1% results in hundreds of millions of sequencing mistakes. These scattered errors can be tolerated in some applications but become extremely problematic when “deep sequencing” genetically heterogeneous mixtures, such as tumors or mixed microbial populations. To overcome limitations in sequencing accuracy, we have developed a method termed Duplex Sequencing. This approach greatly reduces errors by independently tagging and sequencing each of the two strands of a DNA duplex. As the two strands are complementary, true mutations are found at the same position in both strands. In contrast, PCR or sequencing errors result in mutations in only one strand and can thus be discounted as technical error. We determine that Duplex Sequencing has a theoretical background error rate of less than one artifactual mutation per billion nucleotides sequenced. In addition, we establish that detection of mutations present in only one of the two strands of duplex DNA can be used to identify sites of DNA damage. We apply the method to directly assess the frequency and pattern of random mutations in mitochondrial DNA from human cells.

944 citations

Journal ArticleDOI
TL;DR: The understanding of viruses as quasispecies has led to new antiviral designs, such as lethal mutagenesis, whose aim is to drive viruses toward low fitness values with limited chances of fitness recovery.
Abstract: Summary: Evolution of RNA viruses occurs through disequilibria of collections of closely related mutant spectra or mutant clouds termed viral quasispecies. Here we review the origin of the quasispecies concept and some biological implications of quasispecies dynamics. Two main aspects are addressed: (i) mutant clouds as reservoirs of phenotypic variants for virus adaptability and (ii) the internal interactions that are established within mutant spectra that render a virus ensemble the unit of selection. The understanding of viruses as quasispecies has led to new antiviral designs, such as lethal mutagenesis, whose aim is to drive viruses toward low fitness values with limited chances of fitness recovery. The impact of quasispecies for three salient human pathogens, human immunodeficiency virus and the hepatitis B and C viruses, is reviewed, with emphasis on antiviral treatment strategies. Finally, extensions of quasispecies to nonviral systems are briefly mentioned to emphasize the broad applicability of quasispecies theory.

852 citations

Journal ArticleDOI
TL;DR: This paper presents a probabilistic analysis of the stationary phase replacement of Na6(CO3)(SO4)/ Na2SO4 in horseshoe clusters and shows clear trends in the number of stationary phases and in the stationary phases of Na2CO3.
Abstract: Kepa Ruiz-Mirazo,†,∥ Carlos Briones,‡,∥ and Andreś de la Escosura* †Biophysics Unit (CSIC-UPV/EHU), Leioa, and Department of Logic and Philosophy of Science, University of the Basque Country, Avenida de Tolosa 70, 20080 Donostia−San Sebastiań, Spain ‡Department of Molecular Evolution, Centro de Astrobiología (CSIC−INTA, associated to the NASA Astrobiology Institute), Carretera de Ajalvir, Km 4, 28850 Torrejoń de Ardoz, Madrid, Spain Organic Chemistry Department, Universidad Autońoma de Madrid, Cantoblanco, 28049 Madrid, Spain

616 citations

Journal ArticleDOI
TL;DR: This work identified major and minor polymorphisms at coding and noncoding positions in the HIV-1 protease (pro) gene and observed dynamic genetic changes within the population during intermittent drug exposure, including the emergence of multiple resistant alleles.
Abstract: Viruses can create complex genetic populations within a host, and deep sequencing technologies allow extensive sampling of these populations. Limitations of these technologies, however, potentially bias this sampling, particularly when a PCR step precedes the sequencing protocol. Typically, an unknown number of templates are used in initiating the PCR amplification, and this can lead to unrecognized sequence resampling creating apparent homogeneity; also, PCR-mediated recombination can disrupt linkage, and differential amplification can skew allele frequency. Finally, misincorporation of nucleotides during PCR and errors during the sequencing protocol can inflate diversity. We have solved these problems by including a random sequence tag in the initial primer such that each template receives a unique Primer ID. After sequencing, repeated identification of a Primer ID reveals sequence resampling. These resampled sequences are then used to create an accurate consensus sequence for each template, correcting for recombination, allelic skewing, and misincorporation/sequencing errors. The resulting population of consensus sequences directly represents the initial sampled templates. We applied this approach to the HIV-1 protease (pro) gene to view the distribution of sequence variation of a complex viral population within a host. We identified major and minor polymorphisms at coding and noncoding positions. In addition, we observed dynamic genetic changes within the population during intermittent drug exposure, including the emergence of multiple resistant alleles. These results provide an unprecedented view of a complex viral population in the absence of PCR resampling.

475 citations

Journal ArticleDOI
TL;DR: The discovery of novel C > A/G > T transversion artifacts found at low allelic fractions in targeted capture data is described and informatics methods are presented to confidently filter these artifacts from sequencing data sets.
Abstract: As researchers begin probing deep coverage sequencing data for increasingly rare mutations and subclonal events, the fidelity of next generation sequencing (NGS) laboratory methods will become increasingly critical. Although error rates for sequencing and polymerase chain reaction (PCR) are well documented, the effects that DNA extraction and other library preparation steps could have on downstream sequence integrity have not been thoroughly evaluated. Here, we describe the discovery of novel C > A/G > T transversion artifacts found at low allelic fractions in targeted capture data. Characteristics such as sequencer read orientation and presence in both tumor and normal samples strongly indicated a non-biological mechanism. We identified the source as oxidation of DNA during acoustic shearing in samples containing reactive contaminants from the extraction process. We show generation of 8-oxoguanine (8-oxoG) lesions during DNA shearing, present analysis tools to detect oxidation in sequencing data and suggest methods to reduce DNA oxidation through the introduction of antioxidants. Further, informatics methods are presented to confidently filter these artifacts from sequencing data sets. Though only seen in a low percentage of reads in affected samples, such artifacts could have profoundly deleterious effects on the ability to confidently call rare mutations, and eliminating other possible sources of artifacts should become a priority for the research community.

456 citations

References
More filters
Book
28 Jul 2013
TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.
Abstract: During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book. This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression and path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for ``wide'' data (p bigger than n), including multiple testing and false discovery rates. Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.

19,261 citations

Journal ArticleDOI
01 Jan 1949-Nature
TL;DR: In this article, the authors define and examine a measure of concentration in terms of population constants, and examine the relationship between the characteristic and the index of diversity when both are applied to a logarithmic distribution.
Abstract: THE 'characteristic' defined by Yule1 and the 'index of diversity' defined by Fisher2 are two measures of the degree of concentration or diversity achieved when the individuals of a population are classified into groups. Both are defined as statistics to be calculated from sample data and not in terms of population constants. The index of diversity has so far been used chiefly with the logarithmic distribution. It cannot be used everywhere, as it does not always give values which are independent of sample size ; it cannot do so, for example, when applied to an infinite population of individuals classified into a finite number of groups. Williams3 has pointed out a relationship between the characteristic and the index of diversity when both are applied to a logarithmic distribution. The present purpose is to define and examine a measure of concentration in terms of population constants.

10,077 citations

Journal ArticleDOI
TL;DR: The European Molecular Biology Open Software Suite is a mature package of software tools developed for the molecular biology community that includes a comprehensive set of applications for molecular sequence analysis and other tasks and integrates popular third-party software packages under a consistent interface.

9,493 citations


"Error correction of next-generation..." refers methods in this paper

  • ...Sequence manipulation was performed using Biopython (30) and EMBOSS (31)....

    [...]

Journal ArticleDOI
15 Sep 2005-Nature
TL;DR: A scalable, highly parallel sequencing system with raw throughput significantly greater than that of state-of-the-art capillary electrophoresis instruments with 96% coverage at 99.96% accuracy in one run of the machine is described.
Abstract: The proliferation of large-scale DNA-sequencing projects in recent years has driven a search for alternative methods to reduce time and cost. Here we describe a scalable, highly parallel sequencing system with raw throughput significantly greater than that of state-of-the-art capillary electrophoresis instruments. The apparatus uses a novel fibre-optic slide of individual wells and is able to sequence 25 million bases, at 99% or better accuracy, in one four-hour run. To achieve an approximately 100-fold increase in throughput over current Sanger sequencing technology, we have developed an emulsion method for DNA amplification and an instrument for sequencing by synthesis using a pyrosequencing protocol optimized for solid support and picolitre-scale volumes. Here we show the utility, throughput, accuracy and robustness of this system by shotgun sequencing and de novo assembly of the Mycoplasma genitalium genome with 96% coverage at 99.96% accuracy in one run of the machine.

8,434 citations


"Error correction of next-generation..." refers background in this paper

  • ...Recent technological advances have drastically decreased the time and the cost required to obtain DNA sequences (1)....

    [...]

Journal ArticleDOI
TL;DR: A technical review of template preparation, sequencing and imaging, genome alignment and assembly approaches, and recent advances in current and near-term commercially available NGS instruments is presented.
Abstract: Demand has never been greater for revolutionary technologies that deliver fast, inexpensive and accurate genome information. This challenge has catalysed the development of next-generation sequencing (NGS) technologies. The inexpensive production of large volumes of sequence data is the primary advantage over conventional methods. Here, I present a technical review of template preparation, sequencing and imaging, genome alignment and assembly approaches, and recent advances in current and near-term commercially available NGS instruments. I also outline the broad range of applications for NGS technologies, in addition to providing guidelines for platform selection to address biological questions of interest.

7,023 citations

Related Papers (5)