scispace - formally typeset
Search or ask a question

Showing papers on "2 base encoding published in 2012"


Journal ArticleDOI
TL;DR: UNLABELLED ART is a set of simulation tools that generate synthetic next-generation sequencing reads that are essential for testing and benchmarking tools for next- generation sequencing data analysis including read alignment, de novo assembly and genetic variation discovery.
Abstract: Summary: ART is a set of simulation tools that generate synthetic next-generation sequencing reads. This functionality is essential for testing and benchmarking tools for next-generation sequencing data analysis including read alignment, de novo assembly and genetic variation discovery. ART generates simulated sequencing reads by emulating the sequencing process with built-in, technology-specific read error models and base quality value profiles parameterized empirically in large sequencing datasets. We currently support all three major commercial next-generation sequencing platforms: Roche’s 454, Illumina’s Solexa and Applied Biosystems’ SOLiD. ART also allows the flexibility to use customized read error model parameters and quality profiles. Availability: Both source and binary software packages are available at http://www.niehs.nih.gov/research/resources/software/art

1,285 citations


Journal ArticleDOI
TL;DR: The results indicate that it is possible to map SMS reads with high accuracy and speed, and the inferences made on the mapability of SMS reads using the combinatorial model of sequencing error are in agreement with the mapping accuracy demonstrated on simulated reads.
Abstract: Recent methods have been developed to perform high-throughput sequencing of DNA by Single Molecule Sequencing (SMS). While Next-Generation sequencing methods may produce reads up to several hundred bases long, SMS sequencing produces reads up to tens of kilobases long. Existing alignment methods are either too inefficient for high-throughput datasets, or not sensitive enough to align SMS reads, which have a higher error rate than Next-Generation sequencing. We describe the method BLASR (Basic Local Alignment with Successive Refinement) for mapping Single Molecule Sequencing (SMS) reads that are thousands of bases long, with divergence between the read and genome dominated by insertion and deletion error. The method is benchmarked using both simulated reads and reads from a bacterial sequencing project. We also present a combinatorial model of sequencing error that motivates why our approach is effective. The results indicate that it is possible to map SMS reads with high accuracy and speed. Furthermore, the inferences made on the mapability of SMS reads using our combinatorial model of sequencing error are in agreement with the mapping accuracy demonstrated on simulated reads.

1,085 citations


Journal ArticleDOI
12 Dec 2012-PLOS ONE
TL;DR: The results demonstrate that the Illumina HiSeq 2000 sequencing system, the primary sequencing technology currently used for de novo genome sequencing and assembly at JGI, has various advantages in terms of total sequence throughput and cost, but it also introduces challenges for the downstream analyses.
Abstract: Background: The emergence of next generation sequencing (NGS) has provided the means for rapid and high throughput sequencing and data generation at low cost, while concomitantly creating a new set of challenges. The number of available assembled microbial genomes continues to grow rapidly and their quality reflects the quality of the sequencing technology used, but also of the analysis software employed for assembly and annotation. Methodology/Principal Findings: In this work, we have explored the quality of the microbial draft genomes across various sequencing technologies. We have compared the draft and finished assemblies of 133 microbial genomes sequenced at the Department of Energy-Joint Genome Institute and finished at the Los Alamos National Laboratory using a variety of combinations of sequencing technologies, reflecting the transition of the institute from Sanger-based sequencing platforms to NGS platforms. The quality of the public assemblies and of the associated gene annotations was evaluated using various metrics. Results obtained with the different sequencing technologies, as well as their effects on downstream processes, were analyzed. Our results demonstrate that the Illumina HiSeq 2000 sequencing system, the primary sequencing technology currently used for de novo genome sequencing and assembly at JGI, has various advantages in terms of total sequencemore » throughput and cost, but it also introduces challenges for the downstream analyses. In all cases assembly results although on average are of high quality, need to be viewed critically and consider sources of errors in them prior to analysis. Conclusion: These data follow the evolution of microbial sequencing and downstream processing at the JGI from draft genome sequences with large gaps corresponding to missing genes of significant biological role to assemblies with multiple small gaps (Illumina) and finally to assemblies that generate almost complete genomes (Illumina+PacBio).« less

138 citations


Journal ArticleDOI
TL;DR: The REvolutionary Approaches and Devices for Nucleic Acid analysis (READNA) consortium funded by the European Commission under FP7 has made great contributions to the development of new nucleic acid analysis methodology.

73 citations


Journal ArticleDOI
TL;DR: A historical perspective on human genome sequencing is provided, current and future sequencing technologies are summarized, issues related to data management and interpretation are highlighted, and research and clinical applications of high-throughput sequencing are considered, with specific emphasis on cardiovascular disease.
Abstract: We are in the midst of a time of great change in genetics that may dramatically impact human biology and medicine. The completion of the human genome project,1,2 the development of low cost, high-throughput parallel sequencing technology, and large-scale studies of genetic variation3 have provided a rich set of techniques and data for the study of genetic disease risk, treatment response, population diversity, and human evolution. Newly-developed sequencing instruments now generate hundreds of millions to billions of short sequences per run, allowing for rapid complete sequencing of human genomes. These technological advances have facilitated a precipitous drop (Figure 1) in the cost per base pair of DNA sequenced. To capitalize on the potential of these technologies for research and clinical applications, translational scientists and clinicians must become familiar with a continuously evolving field. In this review we will provide a historical perspective on human genome sequencing, summarize current and future sequencing technologies, highlight issues related to data management and interpretation, and finally consider research and clinical applications of high-throughput sequencing, with specific emphasis on cardiovascular disease. Open in a separate window Figure 1 Sequencing milestones, costs, and output since completion of the human genome project. Note logarithmic scale for sequencing costs and bases produced per sequence run.

73 citations


Book ChapterDOI
TL;DR: Methods and software algorithms that have been developed to detect SVs and copy number changes using massively parallel sequencing data are presented and visualization and de novo assembly strategies for characterizing SV breakpoints and removing false positives are described.
Abstract: The emergence of next-generation sequencing (NGS) technologies offers an incredible opportunity to comprehensively study DNA sequence variation in human genomes. Commercially available platforms from Roche (454), Illumina (Genome Analyzer and Hiseq 2000), and Applied Biosystems (SOLiD) have the capability to completely sequence individual genomes to high levels of coverage. NGS data is particularly advantageous for the study of structural variation (SV) because it offers the sensitivity to detect variants of various sizes and types, as well as the precision to characterize their breakpoints at base pair resolution. In this chapter, we present methods and software algorithms that have been developed to detect SVs and copy number changes using massively parallel sequencing data. We describe visualization and de novo assembly strategies for characterizing SV breakpoints and removing false positives.

53 citations


01 Jan 2012
TL;DR: Key features regarding different aspects of pyrosequencing technology are considered, including the general principles, enzyme properties, sequencing modes, instrumentation, limitations, potential and future applications.
Abstract: Pyrosequencing is the first alternative to the conventional Sanger method for de novo DNA sequencing. Pyrosequencing is a DNA sequencing technology based on the sequencingbysynthesis principle. It employs a series of four enzymes to accurately detect nucleic acid sequences during the synthesis. Pyrosequencing has the potential advantages of accuracy, flexibility, parallel processing, and can be easily automated. Furthermore, the technique dispenses with the need for labeled primers, labeled nucleotides, and gelelectrophoresis. Pyrosequencing has opened up npossibilities for performing sequencebased DNA analysis. The method has been proven highly suitable for single nucleotide polymorphism analysis and sequencing of short stretches of DNA. Pyrosequencing has been successful for both confirmatory sequencing and de novo sequencing . By increasing the read length to higher scores and by shortening the sequence reaction time per base calling, pyrosequencing may take over many broad areas of DNA sequencing applications as the trend is directed to analysis of fewer amounts of specimens and larges cale settings, with higher throughput and lower cost. This article considers key features regarding different aspects of pyrosequencing technology, including the general principles, enzyme properties, sequencing modes, instrumentation, limitations, potential and future applications.

37 citations


Journal ArticleDOI
05 Nov 2012-PLOS ONE
TL;DR: This work presents a cost-effective strategy for simplified library preparation compatible with both whole genome- and targeted sequencing experiments, and presents a two-tagging strategy, which allows for multiplex sequencing of targeted regions.
Abstract: During the recent years, rapid development of sequencing technologies and a competitive market has enabled researchers to perform massive sequencing projects at a reasonable cost. As the price for the actual sequencing reactions drops, enabling more samples to be sequenced, the relative price for preparing libraries gets larger and the practical laboratory work becomes complex and tedious. We present a cost-effective strategy for simplified library preparation compatible with both whole genome- and targeted sequencing experiments. An optimized enzyme composition and reaction buffer reduces the number of required clean-up steps and allows for usage of bulk enzymes which makes the whole process cheap, efficient and simple. We also present a two-tagging strategy, which allows for multiplex sequencing of targeted regions. To prove our concept, we have prepared libraries for low-pass sequencing from 100 ng DNA, performed 2-, 4- and 8-plex exome capture and a 96-plex capture of a 500 kb region. In all samples we see a high concordance (>99.4%) of SNP calls when comparing to commercially available SNP-chip platforms.

35 citations


Book
01 Jan 2012
TL;DR: This research presents a novel approach called “Tag-based next generation sequencing” that addresses the challenge of “what’s next” in the next generation of DNA sequencing.
Abstract: Tag-based next generation sequencing / , Tag-based next generation sequencing / , کتابخانه دیجیتال جندی شاپور اهواز

19 citations


Patent
01 Oct 2012
TL;DR: In this article, the authors present methods and systems for Electronic DNA sequencing, single molecule DNA sequencing and combinations of the above, providing low cost and convenient sequencing, and provide a low-cost and convenient method.
Abstract: The present invention provides for methods and systems for Electronic DNA sequencing, single molecule DNA sequencing, and combinations of the above, providing low cost and convenient sequencing

18 citations



Journal ArticleDOI
TL;DR: The proposed ParticleCall provides more accurate calls than the Illumina’s base calling algorithm, Bustard, and is significantly more computationally efficient than other recent schemes with similar performance, rendering it more feasible for high-throughput sequencing data analysis.
Abstract: Background Next-generation sequencing systems are capable of rapid and cost-effective DNA sequencing, thus enabling routine sequencing tasks and taking us one step closer to personalized medicine. Accuracy and lengths of their reads, however, are yet to surpass those provided by the conventional Sanger sequencing method. This motivates the search for computationally efficient algorithms capable of reliable and accurate detection of the order of nucleotides in short DNA fragments from the acquired data.

Proceedings ArticleDOI
01 Jul 2012
TL;DR: By drawing an analogy between the DNA sequencing problem and the classic communication problem, an information theoretic notion of sequencing capacity is defined, which is the maximum number of DNA base pairs that can be resolved reliably per read.
Abstract: DNA sequencing is the basic workhorse of modern day biology and medicine. Shotgun sequencing is the dominant technique used: many randomly located short fragments called reads are extracted from the DNA sequence, and these reads are assembled to reconstruct the original sequence. By drawing an analogy between the DNA sequencing problem and the classic communication problem, we define an information theoretic notion of sequencing capacity. This is the maximum number of DNA base pairs that can be resolved reliably per read, and provides a fundamental limit to the performance that can be achieved by any assembly algorithm. We compute the sequencing capacity explicitly for a simple statistical model of the DNA sequence and the read process.

Journal ArticleDOI
15 Feb 2012-Gene
TL;DR: This study proves the power of low depth pyrosequencing strategy, which provides a cost-effective way for sequencing whole prokaryote genomes in a short time and enables further studies in microbial population diversity and comparative genomics.

Patent
27 Jan 2012
TL;DR: In this article, the authors describe a paired end sequencing method that enables the sequencing of unique read pairs by co-localizing both 5'ends on a single emulsion polymerase chain reaction bead.
Abstract: The present invention is related to genomic nucleotide sequencing. In particular, the invention describes a paired end sequencing method that enables the sequencing of unique read pairs by co-localizing both 5 ' ends on a single emulsion polymerase chain reaction bead. The method may use a customized forked adaptor primer pair that is compatible with massively parallel sequencing techniques. The compositions and methods disclosed herein contemplate sequencing complex genomes, amplified genomic regions, as well as detecting chromosomal structural rearrangements.

Proceedings ArticleDOI
01 Nov 2012
TL;DR: This paper extends the approach to utilize additional information from reference genetic variation datasets which provide the correlation structure between genetic variants to significantly increase the efficiency of overlapping pool sequencing.
Abstract: Next generation sequencing technologies are rapidly decreasing the cost of obtaining genetic information. The cost for utilizing one of these technologies consists of a sample preparation step and a sequencing step of the prepared sample. The dramatic increase in the efficiency of the sequencing technology makes the costs of the sequencing step negligible for small target regions. Thus the main remaining cost is the sample preparation step. Using overlapping sequencing pools where samples are mixed together into pools which are prepared and sequenced together has been shown to reduce the cost significantly for collecting information on genetic variants which only occur in a few of the samples. These methods utilize ideas from compressed sensing. In this paper, we extend this approach to utilize additional information from reference genetic variation datasets which provide the correlation structure between genetic variants. Utilizing this information, we can significantly increase the efficiency of overlapping pool sequencing.

Journal Article
TL;DR: The high-throughput - next generation sequencing (HT-NGS) technologies are currently the hottest topic in the field of human and animals genomics researches, which can produce over 100 times more data compared to the most sophisticated capillary sequencers based on the Sanger method.
Abstract: DNA sequencing is one of the most important platforms for the study of biological systems today. The high-throughput - next generation sequencing (HT-NGS) technologies are currently the hottest topic in the field of human and animals genomics researches, which can produce over 100 times more data compared to the most sophisticated capillary sequencers based on the Sanger method. New generation of sequencing technologies, from Illumina/Solexa,ABI/SOLiD, 454/Roche, and Helicos, has provided unprecedented opportunities for high-throughput functional genomic research. The next-generation sequencing technologies offer novel and rapid ways for genome-wide characterization and profiling of mRNAs, small RNAs, transcription factor regions, structure of chromatin and DNA methylation patterns, microbiology and metagenomics. However, unlike traditional Sanger dideoxy sequencing, these methods have lower accuracy and shorter read lengths than the dideoxy gold standard. An astounding potential exists for these technologies to bring enormous change in genetic and biological research and to enhance our fundamental biological knowledge.

Patent
09 Nov 2012
TL;DR: In this article, a multi-dimensional matrix (with at least three dimensions) and high-throughput sequencing technologies are combined to identify/recover genomic locations of each insert in thousands of transgenic plants simultaneously.
Abstract: Methods and systems for combining a multi-dimensional matrix (with at least three dimensions) and high-throughput sequencing technologies to identify/recover genomic locations of each insert in thousands of transgenic plants simultaneously. In some embodiments, multiplex sequencing is carried out and sequencing data are imported in parallel into sequence data base for displaying in the multi-dimensional matrix.

Journal ArticleDOI
20 Dec 2012-PLOS ONE
TL;DR: A universal and cost-effective method for sequencing the ultra-long paired-ends of genomic libraries with parallel pyrosequencing is provided, using a Chinese amphioxus (Branchiostoma belcheri) BAC library as an example.
Abstract: Second generation sequencing has been widely used to sequence whole genomes. Though various paired-end sequencing methods have been developed to construct the long scaffold from contigs derived from shotgun sequencing, the classical paired-end sequencing of the Bacteria Artificial Chromosome (BAC) or fosmid libraries by the Sanger method still plays an important role in genome assembly. However, sequencing libraries with the Sanger method is expensive and time-consuming. Here we report a new strategy to sequence the paired-ends of genomic libraries with parallel pyrosequencing, using a Chinese amphioxus (Branchiostoma belcheri) BAC library as an example. In total, approximately 12,670 non-redundant paired-end sequences were generated. Mapping them to the primary scaffolds of Chinese amphioxus, we obtained 413 ultra-scaffolds from 1,182 primary scaffolds, and the N50 scaffold length was increased approximately 55 kb, which is about a 10% improvement. We provide a universal and cost-effective method for sequencing the ultra-long paired-ends of genomic libraries. This method can be very easily implemented in other second generation sequencing platforms.

01 Jan 2012
TL;DR: RACER (Rapid Accurate Correction of Errors in Reads), an error correction program that targets the Illumina genome sequencer, which is currently the dominant NGS technology, and has been implemented in C++ and OpenMP for parallelization.
Abstract: Motivation: High throughput Next Generation Sequencing (NGS) technologies can sequence the genome of a species quickly and cheaply. Errors that are introduced by NGS technologies limit the full potential of the applications that rely on their data. Current techniques used to correct these errors are not sufficient due to issues with time, space, or accuracy. A more efficient and accurate program is needed to correct errors from NGS technologies. Results: We have designed and implemented RACER (Rapid Accurate Correction of Errors in Reads), an error correction program that targets the Illumina genome sequencer, which is currently the dominant NGS technology. RACER combines advanced data structures with an intricate analysis of data to achieve high performance. It has been implemented in C++ and OpenMP for parallelization. We have performed extensive testing on a variety of real data sets to compare RACER with the current leading programs. RACER performs better than all the current technologies in time, space, and accuracy. RACER corrects up to twice as many errors as other parallel programs, while being one order of magnitude faster. We hope RACER will become a very useful tool for many applications that use NGS data.

01 Jan 2012
TL;DR: New generation of sequencers, based on the ‘next-next’ or third-generation sequencing (TGS) technologies like the Single-Molecule Real-Time (SMRT TM )S equencer, Heliscope TM Single Molecule Sequencer, and the Ion Personal Genome Machine TM are becoming available that are capable of generating longer sequence reads in a shorter time and at even lower costs per instrument run.
Abstract: A number of next-generation sequencing (NGS) technologies such as Roche/454, Illumina and AB SOLiD have recently become available. These technologies are capable of generating hundreds of thousands or tens of millions of short DNA sequence reads at a relatively low cost. These NGS technologies, now referred as second-generation sequencing (SGS) technologies, are being utilized for de novo sequencing, genome re-sequencing, and whole genome and transcriptome analysis. Now, new generation of sequencers, based on the ‘next-next’ or third-generation sequencing (TGS) technologies like the Single-Molecule Real-Time (SMRT TM )S equencer, Heliscope TM Single Molecule Sequencer, and the Ion Personal Genome Machine TM are becoming available that are capable of generating longer sequence reads in a shorter time and at even lower costs per instrument run. Ever declining sequencing costs and increased data output and sample throughput for NGS and TGS sequencing technologies enable the plant genomics and breeding community to undertake genotyping-by-sequencing (GBS). Data analysis, storage and management of large-scale second or TGS projects, however, are essential. This article provides an overview of different sequencing technologies with an emphasis on forthcoming TGS technologies and bioinformatics tools required for the latest evolution of DNA sequencing platforms.