scispace - formally typeset
Search or ask a question

Showing papers on "Hybrid genome assembly published in 2008"


Journal ArticleDOI
TL;DR: Next-generation DNA sequencing has the potential to dramatically accelerate biological and biomedical research, by enabling the comprehensive analysis of genomes, transcriptomes and interactomes to become inexpensive, routine and widespread, rather than requiring significant production-scale efforts.
Abstract: DNA sequence represents a single format onto which a broad range of biological phenomena can be projected for high-throughput data collection. Over the past three years, massively parallel DNA sequencing platforms have become widely available, reducing the cost of DNA sequencing by over two orders of magnitude, and democratizing the field by putting the sequencing capacity of a major genome center in the hands of individual investigators. These new technologies are rapidly evolving, and near-term challenges include the development of robust protocols for generating sequencing libraries, building effective new approaches to data-analysis, and often a rethinking of experimental design. Next-generation DNA sequencing has the potential to dramatically accelerate biological and biomedical research, by enabling the comprehensive analysis of genomes, transcriptomes and interactomes to become inexpensive, routine and widespread, rather than requiring significant production-scale efforts.

4,458 citations


Journal ArticleDOI
06 Nov 2008-Nature
TL;DR: An approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost is reported, effective for accurate, rapid and economical whole-genome re-sequencing and many other biomedical applications.
Abstract: DNA sequence information underpins genetic research, enabling discoveries of important biological or medical benefit. Sequencing projects have traditionally used long (400-800 base pair) reads, but the existence of reference sequences for the human and many other genomes makes it possible to develop new, fast approaches to re-sequencing, whereby shorter reads are compared to a reference to identify intraspecies genetic variation. Here we report an approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost. Single molecules of DNA are attached to a flat surface, amplified in situ and used as templates for synthetic sequencing with fluorescent reversible terminator deoxyribonucleotides. Images of the surface are analysed to generate high-quality sequence. We demonstrate application of this approach to human genome sequencing on flow-sorted X chromosomes and then scale the approach to determine the genome sequence of a male Yoruba from Ibadan, Nigeria. We build an accurate consensus sequence from >30x average depth of paired 35-base reads. We characterize four million single-nucleotide polymorphisms and four hundred thousand structural variants, many of which were previously unknown. Our approach is effective for accurate, rapid and economical whole-genome re-sequencing and many other biomedical applications.

3,802 citations


Journal ArticleDOI
TL;DR: An astounding potential exists for next-generation DNA sequencing technologies to bring enormous change in genetic and biological research and to enhance the authors' fundamental biological knowledge.
Abstract: Recent scientific discoveries that resulted from the application of nextgeneration DNA sequencing technologies highlight the striking impact of these massively parallel platforms on genetics. These new methods have expanded previously focused readouts from a variety of DNA preparation protocols to a genome-wide scale and have fine-tuned their resolution to single base precision. The sequencing of RNA also has transitioned and now includes full-length cDNA analyses, serial analysis of gene expression (SAGE)-based methods, and noncoding RNA discovery. Next-generation sequencing has also enabled novel applications such as the sequencing of ancient DNA samples, and has substantially widened the scope of metagenomic analysis of environmentally derived samples. Taken together, an astounding potential exists for these technologies to bring enormous change in genetic and biological research and to enhance our fundamental biological knowledge.

2,354 citations


Journal ArticleDOI
17 Apr 2008-Nature
TL;DR: This sequence was completed in two months at approximately one-hundredth of the cost of traditional capillary electrophoresis methods and demonstrated the acquisition of novel human sequence, including novel genes not previously identified by traditional genomic sequencing, which is the first genome sequenced by next-generation technologies.
Abstract: Next-generation sequencing technologies are revolutionizing human genomics, promising to yield draft genomes cheaply and quickly. One such technology has now been used to analyse much of the genetic code of a single individual — who happens to be James D. Watson. The procedure, which involves no cloning of the genomic DNA, makes use of the latest 454 parallel sequencing instrument. The sequence cost less than US$1 million (and a mere two months) to produce, compared to the approximately US$100 million reported for sequencing Craig Venter's genome by traditional methods. Still a major undertaking, but another step towards the goal of 'personalized genomes' and 'personalized medicine'. The DNA sequence of a diploid genome of a single individual, James D. Watson, sequenced to 7.4-fold redundancy in two months using massively parallel sequencing in picolitre-size reaction vessels is reported. The association of genetic variation with disease and drug response, and improvements in nucleic acid technologies, have given great optimism for the impact of ‘genomic medicine’. However, the formidable size of the diploid human genome1, approximately 6 gigabases, has prevented the routine application of sequencing methods to deciphering complete individual human genomes. To realize the full potential of genomics for human health, this limitation must be overcome. Here we report the DNA sequence of a diploid genome of a single individual, James D. Watson, sequenced to 7.4-fold redundancy in two months using massively parallel sequencing in picolitre-size reaction vessels. This sequence was completed in two months at approximately one-hundredth of the cost of traditional capillary electrophoresis methods. Comparison of the sequence to the reference genome led to the identification of 3.3 million single nucleotide polymorphisms, of which 10,654 cause amino-acid substitution within the coding sequence. In addition, we accurately identified small-scale (2–40,000 base pair (bp)) insertion and deletion polymorphism as well as copy number variation resulting in the large-scale gain and loss of chromosomal segments ranging from 26,000 to 1.5 million base pairs. Overall, these results agree well with recent results of sequencing of a single individual2 by traditional methods. However, in addition to being faster and significantly less expensive, this sequencing technology avoids the arbitrary loss of genomic sequences inherent in random shotgun sequencing by bacterial cloning because it amplifies DNA in a cell-free system. As a result, we further demonstrate the acquisition of novel human sequence, including novel genes not previously identified by traditional genomic sequencing. This is the first genome sequenced by next-generation technologies. Therefore it is a pilot for the future challenges of ‘personalized genome sequencing’.

1,879 citations


Journal ArticleDOI
TL;DR: A general method for genome assembly that can be applied to all types of DNA sequence data, not only short read data, but also conventional sequence reads is described.
Abstract: New DNA sequencing technologies deliver data at dramatically lower costs but demand new analytical methods to take full advantage of the very short reads that they produce. We provide an initial, theoretical solution to the challenge of de novo assembly from whole-genome shotgun “microreads.” For 11 genomes of sizes up to 39 Mb, we generated high-quality assemblies from 80× coverage by paired 30-base simulated reads modeled after real Illumina-Solexa reads. The bacterial genomes of Campylobacter jejuni and Escherichia coli assemble optimally, yielding single perfect contigs, and larger genomes yield assemblies that are highly connected and accurate. Assemblies are presented in a graph form that retains intrinsic ambiguities such as those arising from polymorphism, thereby providing information that has been absent from previous genome assemblies. For both C. jejuni and E. coli, this assembly graph is a single edge encompassing the entire genome. Larger genomes produce more complicated graphs, but the vast majority of the bases in their assemblies are present in long edges that are nearly always perfect. We describe a general method for genome assembly that can be applied to all types of DNA sequence data, not only short read data, but also conventional sequence reads.

880 citations


Journal ArticleDOI
TL;DR: This study proposes a de novo assembler software that generates a set of accurate contigs of several kilobases that cover most of the bacterial genome on the Illumina sequencing platform that produces millions of very short sequences that are 35 bases in length.
Abstract: Novel high-throughput DNA sequencing technologies allow researchers to characterize a bacterial genome during a single experiment and at a moderate cost. However, the increase in sequencing throughput that is allowed by using such platforms is obtained at the expense of individual sequence read length, which must be assembled into longer contigs to be exploitable. This study focuses on the Illumina sequencing platform that produces millions of very short sequences that are 35 bases in length. We propose a de novo assembler software that is dedicated to process such data. Based on a classical overlap graph representation and on the detection of potentially spurious reads, our software generates a set of accurate contigs of several kilobases that cover most of the bacterial genome. The assembly results were validated by comparing data sets that were obtained experimentally for Staphylococcus aureus strain MW2 and Helicobacter acinonychis strain Sheeba with that of their published genomes acquired by conventional sequencing of 1.5- to 3.0-kb fragments. We also provide indications that the broad coverage achieved by high-throughput sequencing might allow for the detection of clonal polymorphisms in the set of DNA molecules being sequenced.

642 citations


Journal ArticleDOI
TL;DR: The 454 Sequencer has dramatically increased the volume of sequencing conducted by the scientific community and expanded the range of problems that can be addressed by the direct readouts of DNA sequence, leading to a better understanding of the structure of the human genome and opening up new approaches to identify small RNAs.
Abstract: The 454 Sequencer has dramatically increased the volume of sequencing conducted by the scientific community and expanded the range of problems that can be addressed by the direct readouts of DNA sequence. Key breakthroughs in the development of the 454 sequencing platform included higher throughput, simplified all in vitro sample preparation and the miniaturization of sequencing chemistries, enabling massively parallel sequencing reactions to be carried out at a scale and cost not previously possible. Together with other recently released next-generation technologies, the 454 platform has started to democratize sequencing, providing individual laboratories with access to capacities that rival those previously found only at a handful of large sequencing centers. Over the past 18 months, 454 sequencing has led to a better understanding of the structure of the human genome, allowed the first non-Sanger sequence of an individual human and opened up new approaches to identify small RNAs. To make next-generation technologies more widely accessible, they must become easier to use and less costly. In the longer term, the principles established by 454 sequencing might reduce cost further, potentially enabling personalized genomics.

568 citations


Journal ArticleDOI
TL;DR: A new Eulerian assembler is presented that generates nearly optimal short read assemblies of bacterial genomes and an approach to assemble reads in the case of the popular hybrid protocol when short and long Sanger-based reads are combined.
Abstract: In the last year, high-throughput sequencing technologies have progressed from proof-of-concept to production quality. While these methods produce high-quality reads, they have yet to produce reads comparable in length to Sanger-based sequencing. Current fragment assembly algorithms have been implemented and optimized for mate-paired Sanger-based reads, and thus do not perform well on short reads produced by short read technologies. We present a new Eulerian assembler that generates nearly optimal short read assemblies of bacterial genomes and describe an approach to assemble reads in the case of the popular hybrid protocol when short and long Sanger-based reads are combined.

480 citations


Journal ArticleDOI
TL;DR: This study sequenced a Caernohabditis elegans N2 Bristol strain isolate and compared the reads to the reference genome to characterize the data and to evaluate coverage and representation, demonstrating the utility of massively parallel short read sequencing for whole genome resequencing and for accurate discovery of genome-wide polymorphisms.
Abstract: Massively parallel sequencing instruments enable rapid and inexpensive DNA sequence data production. Because these instruments are new, their data require characterization with respect to accuracy and utility. To address this, we sequenced a Caernohabditis elegans N2 Bristol strain isolate using the Solexa Sequence Analyzer, and compared the reads to the reference genome to characterize the data and to evaluate coverage and representation. Massively parallel sequencing facilitates strain-to-reference comparison for genome-wide sequence variant discovery. Owing to the short-read-length sequences produced, we developed a revised approach to determine the regions of the genome to which short reads could be uniquely mapped. We then aligned Solexa reads from C. elegans strain CB4858 to the reference, and screened for single-nucleotide polymorphisms (SNPs) and small indels. This study demonstrates the utility of massively parallel short read sequencing for whole genome resequencing and for accurate discovery of genome-wide polymorphisms.

447 citations


Journal ArticleDOI
TL;DR: The RMAP tool, which can map reads having a wide range of lengths and allows base-call quality scores to determine which positions in each read are more important when mapping, indicates that significant gains in Solexa read mapping performance can be achieved by considering the information in 3' ends of longer reads, and appropriately using the base- call quality scores.
Abstract: Background Second-generation sequencing has the potential to revolutionize genomics and impact all areas of biomedical science. New technologies will make re-sequencing widely available for such applications as identifying genome variations or interrogating the oligonucleotide content of a large sample (e.g. ChIP-sequencing). The increase in speed, sensitivity and availability of sequencing technology brings demand for advances in computational technology to perform associated analysis tasks. The Solexa/Illumina 1G sequencer can produce tens of millions of reads, ranging in length from ~25–50 nt, in a single experiment. Accurately mapping the reads back to a reference genome is a critical task in almost all applications. Two sources of information that are often ignored when mapping reads from the Solexa technology are the 3' ends of longer reads, which contain a much higher frequency of sequencing errors, and the base-call quality scores.

385 citations


Journal ArticleDOI
TL;DR: The Genome Sequencer FLX System (GS FLX), powered by 454 Sequencing, is a next-generation DNA sequencing technology featuring a unique mix of long reads, exceptional accuracy, and ultra-high throughput.

Journal ArticleDOI
TL;DR: This work compares and contrasts new sequencing platforms in terms of stage of development, instrument configuration, template format, sequencing chemistry, throughput capability, operating cost, data handling issues, and error models, to extend the utility of these platforms for genome analysis.
Abstract: DNA sequencing is in a period of rapid change, in which capillary sequencing is no longer the technology of choice for most ultra-high-throughput applications. A new generation of instruments that utilize primed synthesis in flow cells to obtain, simultaneously, the sequence of millions of different DNA templates has changed the field. We compare and contrast these new sequencing platforms in terms of stage of development, instrument configuration, template format, sequencing chemistry, throughput capability, operating cost, data handling issues, and error models. While these platforms outperform capillary instruments in terms of bases per day and cost per base, the short length of sequence reads obtained from most instruments and the limited number of samples that can be run simultaneously imposes some practical constraints on sequencing applications. However, recently developed methods for paired-end sequencing and for array-based direct selection of desired templates from complex mixtures extend the utility of these platforms for genome analysis. Given the ever increasing demand for DNA sequence information, we can expect continuous improvement of this new generation of instruments and their eventual replacement by even more powerful technology.

Journal ArticleDOI
TL;DR: A framework for how full sensitivity mapping can be done in the most efficient way, via spaced seeds is presented, and software called ZOOM is developed, which is able to map the Illumina/Solexa reads of 15x coverage of a human genome to the reference human genome in one CPU-day, allowing two mismatches, at full sensitivity.
Abstract: Motivation: The next generation sequencing technologies are generating billions of short reads daily. Resequencing and personalized medicine need much faster software to map these deep sequencing reads to a reference genome, to identify SNPs or rare transcripts. Results: We present a framework for how full sensitivity mapping can be done in the most efficient way, via spaced seeds. Using the framework, we have developed software called ZOOM, which is able to map the Illumina/Solexa reads of 15× coverage of a human genome to the reference human genome in one CPU-day, allowing two mismatches, at full sensitivity. Availability: ZOOM is freely available to non-commercial users at http://www.bioinfor.com/zoom Contact:[email protected], [email protected]

Journal ArticleDOI
TL;DR: Alta-Cyclic substantially improved the number of accurate reads for sequencing runs up to 78 bases and reduced systematic biases, facilitating confident identification of sequence variants.
Abstract: Next-generation sequencing is limited to short read lengths and by high error rates. We systematically analyzed sources of noise in the Illumina Genome Analyzer that contribute to these high error rates and developed a base caller, Alta-Cyclic, that uses machine learning to compensate for noise factors. Alta-Cyclic substantially improved the number of accurate reads for sequencing runs up to 78 bases and reduced systematic biases, facilitating confident identification of sequence variants.

Journal ArticleDOI
TL;DR: The resulting assemblies contain a single scaffold covering a large fraction of the respective genomes, suggesting that the careful use of optical maps can provide a cost-effective framework for the assembly of genomes.
Abstract: Motivation: New, high-throughput sequencing technologies have made it feasible to cheaply generate vast amounts of sequence information from a genome of interest. The computational reconstruction of the complete sequence of a genome is complicated by specific features of these new sequencing technologies, such as the short length of the sequencing reads and absence of mate-pair information. In this article we propose methods to overcome such limitations by incorporating information from optical restriction maps. Results: We demonstrate the robustness of our methods to sequencing and assembly errors using extensive experiments on simulated datasets. We then present the results obtained by applying our algorithms to data generated from two bacterial genomes Yersinia aldovae and Yersinia kristensenii. The resulting assemblies contain a single scaffold covering a large fraction of the respective genomes, suggesting that the careful use of optical maps can provide a cost-effective framework for the assembly of genomes. Availability: The tools described here are available as an open-source package at ftp://ftp.cbcb.umd.edu/pub/software/soma Contact: mpop@umiacs.umd.edu

Journal ArticleDOI
TL;DR: A new data integration and visualization tool EagleView is introduced to facilitate data analyses, visual validation, and hypothesis generation for genome assembly, polymorphism detection, as well as data visualization.
Abstract: The emergence of high-throughput next-generation sequencing technologies (e.g., 454 Life Sciences [Roche], Illumina sequencing [formerly Solexa sequencing]) has dramatically sped up whole-genome de novo sequencing and resequencing. While the low cost of these sequencing technologies provides an unparalleled opportunity for genome-wide polymorphism discovery, the analysis of the new data types and huge data volume poses formidable informatics challenges for base calling, read alignment and genome assembly, polymorphism detection, as well as data visualization. We introduce a new data integration and visualization tool EagleView to facilitate data analyses, visual validation, and hypothesis generation. EagleView can handle a large genome assembly of millions of reads. It supports a compact assembly view, multiple navigation modes, and a pinpoint view of technology-specific trace information. Moreover, EagleView supports viewing coassembly of mixed-type reads from different technologies and supports integrating genome feature annotations into genome assemblies. EagleView has been used in our own lab and by over 100 research labs worldwide for next-generation sequence analyses. The EagleView software is freely available for not-for-profit use at http://bioinformatics.bc.edu/marthlab/EagleView.

Journal ArticleDOI
TL;DR: An MDR index for barley, which was obtained by whole-genome Illumina/Solexa sequencing, proved as efficient in repeat identification as manual expert annotation.
Abstract: Barley has one of the largest and most complex genomes of all economically important food crops. The rise of new short read sequencing technologies such as Illumina/Solexa permits such large genomes to be effectively sampled at relatively low cost. Based on the corresponding sequence reads a Mathematically Defined Repeat (MDR) index can be generated to map repetitive regions in genomic sequences. We have generated 574 Mbp of Illumina/Solexa sequences from barley total genomic DNA, representing about 10% of a genome equivalent. From these sequences we generated an MDR index which was then used to identify and mark repetitive regions in the barley genome. Comparison of the MDR plots with expert repeat annotation drawing on the information already available for known repetitive elements revealed a significant correspondence between the two methods. MDR-based annotation allowed for the identification of dozens of novel repeat sequences, though, which were not recognised by hand-annotation. The MDR data was also used to identify gene-containing regions by masking of repetitive sequences in eight de-novo sequenced bacterial artificial chromosome (BAC) clones. For half of the identified candidate gene islands indeed gene sequences could be identified. MDR data were only of limited use, when mapped on genomic sequences from the closely related species Triticum monococcum as only a fraction of the repetitive sequences was recognised. An MDR index for barley, which was obtained by whole-genome Illumina/Solexa sequencing, proved as efficient in repeat identification as manual expert annotation. Circumventing the labour-intensive step of producing a specific repeat library for expert annotation, an MDR index provides an elegant and efficient resource for the identification of repetitive and low-copy (i.e. potentially gene-containing sequences) regions in uncharacterised genomic sequences. The restriction that a particular MDR index can not be used across species is outweighed by the low costs of Illumina/Solexa sequencing which makes any chosen genome accessible for whole-genome sequence sampling.

Journal ArticleDOI
TL;DR: This study demonstrates the feasibility of very short read sequencing for the sequencing of bacterial genomes, particularly those for which a related species has been sequenced previously, and expands the potential application of this new technology to most known prokaryotic species.
Abstract: Recent improvements in technology have made DNA sequencing dramatically faster and more efficient than ever before. The new technologies produce highly accurate sequences, but one drawback is that the most efficient technology produces the shortest read lengths. Short-read sequencing has been applied successfully to resequence the human genome and those of other species but not to whole-genome sequencing of novel organisms. Here we describe the sequencing and assembly of a novel clinical isolate of Pseudomonas aeruginosa, strain PAb1, using very short read technology. From 8,627,900 reads, each 33 nucleotides in length, we assembled the genome into one scaffold of 76 ordered contiguous sequences containing 6,290,005 nucleotides, including one contig spanning 512,638 nucleotides, plus an additional 436 unordered contigs containing 416,897 nucleotides. Our method includes a novel gene-boosting algorithm that uses amino acid sequences from predicted proteins to build a better assembly. This study demonstrates the feasibility of very short read sequencing for the sequencing of bacterial genomes, particularly those for which a related species has been sequenced previously, and expands the potential application of this new technology to most known prokaryotic species.

Journal ArticleDOI
TL;DR: These results reveal the surprisingly powerful ability of microchip electrophoresis to provide ultrafast Sanger sequencing, which will translate to increased system throughput and reduced costs.
Abstract: To realize the immense potential of large-scale genomic sequencing after the completion of the second human genome (Venter's), the costs for the complete sequencing of additional genomes must be dramatically reduced. Among the technologies being developed to reduce sequencing costs, microchip electrophoresis is the only new technology ready to produce the long reads most suitable for the de novo sequencing and assembly of large and complex genomes. Compared with the current paradigm of capillary electrophoresis, microchip systems promise to reduce sequencing costs dramatically by increasing throughput, reducing reagent consumption, and integrating the many steps of the sequencing pipeline onto a single platform. Although capillary-based systems require ≈70 min to deliver ≈650 bases of contiguous sequence, we report sequencing up to 600 bases in just 6.5 min by microchip electrophoresis with a unique polymer matrix/adsorbed polymer wall coating combination. This represents a two-thirds reduction in sequencing time over any previously published chip sequencing result, with comparable read length and sequence quality. We hypothesize that these ultrafast long reads on chips can be achieved because the combined polymer system engenders a recently discovered “hybrid” mechanism of DNA electromigration, in which DNA molecules alternate rapidly between reptating through the intact polymer network and disrupting network entanglements to drag polymers through the solution, similar to dsDNA dynamics we observe in single-molecule DNA imaging studies. Most importantly, these results reveal the surprisingly powerful ability of microchip electrophoresis to provide ultrafast Sanger sequencing, which will translate to increased system throughput and reduced costs.

Journal ArticleDOI
26 Jun 2008-Nature
TL;DR: How advances in DNA-sequencing technology can be harnessed to explore transcriptomes in remarkable detail is described, which has already revolutionized the study of chromatin structure, DNA-binding proteins, DNA methylation, genome organization and small RNAs.
Abstract: Advances in DNA-sequencing technology provide unprecedented insight into the entire collection of four genomes' transcribed sequences; they herald a new era in the study of gene regulation and genome function.

Journal ArticleDOI
15 Aug 2008
TL;DR: A novel approach, called QPALMA, is presented which takes advantage of the read's quality information as well as computational splice site predictions to maximize alignment accuracy and facilitate mapping of massive amounts of sequencing data typically generated by the new technologies.
Abstract: Motivation: Next generation sequencing technologies open exciting new possibilities for genome and transcriptome sequencing. While reads produced by these technologies are relatively short and error prone compared to the Sanger method their throughput is several magnitudes higher. To utilize such reads for transcriptome sequencing and gene structure identification, one needs to be able to accurately align the sequence reads over intron boundaries. This represents a significant challenge given their short length and inherent high error rate. Results: We present a novel approach, called QPALMA, for computing accurate spliced alignments which takes advantage of the read's quality information as well as computational splice site predictions. Our method uses a training set of spliced reads with quality information and known alignments. It uses a large margin approach similar to support vector machines to estimate its parameters to maximize alignment accuracy. In computational experiments, we illustrate that the quality information as well as the splice site predictions help to improve the alignment quality. Finally, to facilitate mapping of massive amounts of sequencing data typically generated by the new technologies, we have combined our method with a fast mapping pipeline based on enhanced suffix arrays. Our algorithms were optimized and tested using reads produced with the Illumina Genome Analyzer for the model plant Arabidopsis thaliana. Availability: Datasets for training and evaluation, additional results and a stand-alone alignment tool implemented in C++ and python are available at http://www.fml.mpg.de/raetsch/projects/qpalma. Contact: Gunnar.Raetsch@tuebingen.mpg.de

Book ChapterDOI
15 Sep 2008
TL;DR: The Weighted Sequence Graph (WSG) representation of all optimal and near optimal alignments between the two reads sampled from a piece of DNA is combined with k-mer filtering methods and spaced seeds to quickly generate candidate locations for the reads on the reference genome.
Abstract: Single Molecule Sequencing technologies such as the Heliscope simplify the preparation of DNA for sequencing, while sampling millions of reads in a day. Simultaneously, the technology suffers from a significantly higher error rate, ameliorated by the ability to sample multiple reads from the same location. In this paper we develop novel rapid alignment algorithms for two-pass Single Molecule Sequencing methods. We combine the Weighted Sequence Graph (WSG) representation of all optimal and near optimal alignments between the two reads sampled from a piece of DNA with k-mer filtering methods and spaced seeds to quickly generate candidate locations for the reads on the reference genome. We also propose a fast implementation of the Smith-Waterman algorithm using vectorized instructions that significantly speeds up the matching process. Our method combines these approaches in order to build an algorithm that is both fast and accurate, since it is able to take complete advantage of both of the reads sampled during two pass sequencing.

Book ChapterDOI
30 Mar 2008
TL;DR: A novel network flow-based algorithm is given that accurately estimates the copy counts of repeats in a genome by taking advantage of the high coverage provided by NGS and combines the predicted copy-counts with mate-pair data in order to assemble the reads into contigs.
Abstract: Next Generation Sequencing (NGS) technologies are capable of reading millions of short DNA sequences both quickly and cheaply. While these technologies are already being used for resequencing individuals once a reference genome exists, it has not been shown if it is possible to use them for ab initio genome assembly. In this paper, we give a novel network flow-based algorithm that, by taking advantage of the high coverage provided by NGS, accurately estimates the copy counts of repeats in a genome. We also give a second algorithm that combines the predicted copy-counts with mate-pair data in order to assemble the reads into contigs. We run our algorithms on simulated read data from E. Coli and predict copy-counts with extremely high accuracy, while assembling long contigs.

Journal ArticleDOI
TL;DR: The development of the DGS (Ditag Genome Scanning) technique for high-resolution analysis of genome structure is reported, showing that DGS provides a kilobase resolution for studying genome structure with high specificity and high genome coverage.
Abstract: Normal genome variation and pathogenic genome alteration frequently affect small regions in the genome. Identifying those genomic changes remains a technical challenge. We report here the development of the DGS (Ditag Genome Scanning) technique for high-resolution analysis of genome structure. The basic features of DGS include (1) use of high-frequent restriction enzymes to fractionate the genome into small fragments; (2) collection of two tags from two ends of a given DNA fragment to form a ditag to represent the fragment; (3) application of the 454 sequencing system to reach a comprehensive ditag sequence collection; (4) determination of the genome origin of ditags by mapping to reference ditags from known genome sequences; (5) use of ditag sequences directly as the sense and antisense PCR primers to amplify the original DNA fragment. To study the relationship between ditags and genome structure, we performed a computational study by using the human genome reference sequences as a model, and analyzed the ditags experimentally collected from the well-characterized normal human DNA GM15510 and the leukemic human DNA of Kasumi-1 cells. Our studies show that DGS provides a kilobase resolution for studying genome structure with high specificity and high genome coverage. DGS can be applied to validate genome assembly, to compare genome similarity and variation in normal populations, and to identify genomic abnormality including insertion, inversion, deletion, translocation, and amplification in pathological genomes such as cancer genomes.

Patent
01 Dec 2008
TL;DR: In this article, methods for efficient shotgun sequencing to allow efficient selection and sequencing of nucleic acids of interest contained in a library are described. But the nucleic acid of interest can be defined any time before or after preparation of the library.
Abstract: Methods are provided for efficient shotgun sequencing to allow efficient selection and sequencing of nucleic acids of interest contained in a library. The nucleic acids of interest can be defined any time before or after preparation of the library. One example of nucleic acids of interest is missing or low confidence genome sequences resulting from an initial sequencing procedure. Other nucleic acids of interest include subsets of genomic DNA, RNA or cDNAs (exons, genes, gene sets, transciptomes). By designing an efficient (simple to implement, speedy, high specificity, low cost) selection procedure, a more complete sequence is achieved with less effort than by using highly redundant shotgun sequencing in an initial sequencing procedure

Journal ArticleDOI
TL;DR: The Genome Sequencer FLX enables long sequence reads separated by kilobase distances of genomic DNA to enable improved de novo assemblies and genomic structural variation studies.
Abstract: The Genome Sequencer FLX System from Roche and 454 Life Sciences™ is a versatile sequencing platform suitable for a wide range of applications, including de novo sequencing and assembly of genomic DNA, transcriptome sequencing, metagenomics analysis and amplicon sequencing. The Genome Sequencer FLX enables long sequence reads separated by kilobase distances of genomic DNA. These Long-Tag Paired End reads enable improved de novo assemblies and genomic structural variation studies.

Journal ArticleDOI
TL;DR: The ReRep approach for identification of repetitive elements in GSS datasets proved to be straightforward and efficient and further validated by the analysis of a more complete genomic dataset from the EMBL and Sanger Centre databases.
Abstract: Genome survey sequences (GSS) offer a preliminary global view of a genome since, unlike ESTs, they cover coding as well as non-coding DNA and include repetitive regions of the genome. A more precise estimation of the nature, quantity and variability of repetitive sequences very early in a genome sequencing project is of considerable importance, as such data strongly influence the estimation of genome coverage, library quality and progress in scaffold construction. Also, the elimination of repetitive sequences from the initial assembly process is important to avoid errors and unnecessary complexity. Repetitive sequences are also of interest in a variety of other studies, for instance as molecular markers.

Book ChapterDOI
TL;DR: A high-resolution RH map of a genome derived from low pass or survey sequencing (coverage from 1 to 2 times) can provide essentially the same comparative data on gene order that is derived from high-coverage (greater than x7) genome sequencing.
Abstract: Radiation hybrid (RH) mapping has become one of the most well-established techniques for economically and efficiently navigating genomes of interest. The success of the technique relies on random chromosome breakage of a target genome, which is then captured by recipient cells missing a preselected marker. Selection for hybrid cells that have DNA fragments bearing the marker of choice, plus a random set of DNA fragments from the initial irradiation, generates a set of cell lines that recapitulates the genome of the target organism several-fold. Markers or genes of interest are analyzed by PCR using DNA isolated from each cell line. Statistical tools are applied to determine both the linear order of markers on each chromosome, and the confidence of each placement. The resolution of the resulting map relies on many factors, most notably the degree of breakage from the initial radiation as well as the number of hybrid clones and mean retention value.A high-resolution RH map of a genome derived from low pass or survey sequencing (coverage from 1 to 2 times) can provide essentially the same comparative data on gene order that is derived from high-coverage (greater than x7) genome sequencing. When combined with fluorescence in situ hybridization, RH maps are complete and ordered blueprints for each chromosome. They give information about the relative order and spacing of genes and markers, and allow investigators to move between target and reference genomes, such as those of mouse or human, with ease although the approach is not limited to mammal genomes.

Journal ArticleDOI
TL;DR: An optimized approach for hybrid de novo genome assembly using pyrosequencing data and varying amounts of Sanger-type reads is described and an effective method for identifying ge-nomic differences between reference and sample se-quences in whole-genome resequencing procedures al-so is suggested.
Abstract: During the last four years, the pyrosequencing-based 454 platform has rapidly displaced the traditional Sanger sequencing method due to its high throughput and cost effectiveness. Meanwhile, the Sanger sequencing meth-odology still provides the longest reads, and paired-end sequencing that is based on that chemistry offers an opportunity to ensure accurate assembly results. In this report, we describe an optimized approach for hybrid de novo genome assembly using pyrosequencing data and varying amounts of Sanger-type reads. 454 platform- derived contigs can be used as single non-breakable virtual reads or converted to simpler contigs that consist of editable, overlapping pseudoreads. These modified contigs maintain their integrity at the first jumpstarting assembly stage and are edited by fragmenting and rejoining. Pre-existing assembly software then can be applied for mixed assembly with 454-derived data and Sanger reads. An effective method for identifying ge-nomic differences between reference and sample se-quences in whole-genome resequencing procedures al-so is suggested.Abbreviations: CelAsm (Celera Assembler)Keywords: hybrid assembly, pyrosequencing, resequen-cingThe 454 sequencing platform (Roche Applied Science GS 20 or GS FLX), which is based on massively parallel sequence determination by pyrosequencing on clonally amplified genome fragments that are captured on micro-scopic beads, is becoming more and more popular in genome sequencing applications (Margulies et al., 2005). Its characteristics, which are superior to the traditional Sanger method - such as high production rate with an affordable cost, absence of cloning bias, and ability to go beyond strong secondary structure - enlarge its field of application in genome technology. Although there are several commercial next-generation sequencing tech-nologies that have become available in recent years (Shendure et al., 2004), 454 pyrosequencing is the only one that can be used for de novo genome sequencing among the high-throughput, short-read sequencing tech-nologies due to its long read length (∼250 bp in GS FLX; announced to be extended to 400 bp by the end of 2008). Many sequencing centers, however, may want to mix a limited amount of traditional Sanger-type sequences, usually generated from fosmid libraries, for scaffolding purposes. Also, a few may want to mix a considerable amount of Sanger read data to 454 pyrosequencing da-ta to produce more accurate results. Among the SFF tools that Roche Applied Science provides for the han-dling of raw data files, SFFINFO can generate FASTA and quality score files from an SFF file. Although the converted files can be assembled using PHRAP (http:// www.phrap.org/), it does not ensure correct assembly because the quality scores that are generated from 454 data are not compatible with those from Sanger reads. Further, PHRAP has problems with handling massive reads (usually hundreds of thousands from an SFF file). A recent report has demonstrated that GS assembler programs (gsAssembler for de novo assembly and gsMapping for reference-guided assembly; http://www. 454.com/enabling-technology/the-software.asp) that are supplied by Roche Applied Science are ideal for correct assembly of 454 data that are short and inherently er-ror-rich (Chaisson and Pevzner, 2008). Recent versions (1.1.02.15 and later) of GS assembler programs support mixed assembly with Sanger-type reads, but their performance is not well known at present. Moreover, because pre-existing assembly soft-ware such as PHRAP and CelAsm (Huson et al., 2001) do not directly support data that are produced by 454 machines, 454-derived contigs (GS contigs) should be used as if they were individual reads or be shredded to generate many overlapping 'pseudoreads' (Goldberg et al., 2006). Pseudoreads, made from GS contigs to emu-late the read size of standard Sanger data (ca. 600 bp), are virtual reads whose stepping between consecutive

Journal ArticleDOI
TL;DR: This paper describes an informatics pipeline called PABS (Platform Assisted BAC-by-BAC Sequencing) that is developed to provide a tool to optimize the BAC/BAC sequencing strategy.
Abstract: Genome sequencing projects are either based on whole genome shotgun (WGS) or on a BAC-by-BAC strategy. Although WGS is in most cases the preferred choice, sometimes the BAC-by-BAC approach may be better because it requires a much simpler assembly process. Furthermore, when the study is limited to specific regions of the genome, the WGS would require an unjustified effort, making the BAC-by-BAC the only feasible strategy. In this paper we describe an informatics pipeline called PABS (Platform Assisted BAC-by-BAC Sequencing) that we developed to provide a tool to optimize the BAC-by-BAC sequencing strategy. PABS has two main functions: (i) PABS-Select, to choose suitable overlapping clones; and (ii) PABS-Validate, to verify whether a BAC under analysis is actually overlapping the neighboring BAC.