scispace - formally typeset
Search or ask a question

Showing papers on "Hybrid genome assembly published in 2001"


Journal ArticleDOI
Eric S. Lander1, Lauren Linton1, Bruce W. Birren1, Chad Nusbaum1  +245 moreInstitutions (29)
15 Feb 2001-Nature
TL;DR: The results of an international collaboration to produce and make freely available a draft sequence of the human genome are reported and an initial analysis is presented, describing some of the insights that can be gleaned from the sequence.
Abstract: The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

22,269 citations


Journal ArticleDOI
J. Craig Venter1, Mark Raymond Adams1, Eugene W. Myers1, Peter W. Li1  +269 moreInstitutions (12)
16 Feb 2001-Science
TL;DR: Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems are indicated.
Abstract: A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.

12,098 citations


Journal ArticleDOI
TL;DR: This work abandons the classical “overlap–layout–consensus” approach in favor of a new euler algorithm that, for the first time, resolves the 20-year-old “repeat problem” in fragment assembly.
Abstract: For the last 20 years, fragment assembly in DNA sequencing followed the "overlap-layout-consensus" paradigm that is used in all currently available assembly tools. Although this approach proved useful in assembling clones, it faces difficulties in genomic shotgun assembly. We abandon the classical "overlap-layout-consensus" approach in favor of a new euler algorithm that, for the first time, resolves the 20-year-old "repeat problem" in fragment assembly. Our main result is the reduction of the fragment assembly to a variation of the classical Eulerian path problem that allows one to generate accurate solutions of large-scale sequencing problems. euler, in contrast to the celera assembler, does not mask such repeats but uses them instead as a powerful fragment assembly tool.

1,408 citations


Journal ArticleDOI
TL;DR: This comparison shows that the Autofinish-Hybrid method of finishing against a human finisher in five different projects with a variety of shotgun depths by finishing each project twice, while using roughly the same number and type of reads and closing gaps atrough the same rate.
Abstract: Currently, the genome sequencing community is producing shotgun sequence data at a very high rate, but finishing (collecting additional directed sequence data to close gaps and improve the quality of the data) is not matching that rate. One reason for the difference is that shotgun sequencing is highly automated but finishing is not: Most finishing decisions, such as which directed reads to obtain and which specialized sequencing techniques to use, are made by people. If finishing rates are to increase to match shotgun sequencing rates, most finishing decisions also must be automated. The Autofinish computer program (which is part of the computer software package) does this by automatically choosing finishing reads. Autofinish is able to suggest most finishing reads required for completion of each sequencing project, greatly reducing the amount of human attention needed. sometimes completely finishes the project, with no human decisions required. It cannot solve the most complex problems, so we recommend that Autofinish be allowed to suggest reads for the first three rounds of finishing, and if the project still is not finished completely, a human finisher complete the work. We compared this Autofinish-Hybrid method of finishing against a human finisher in five different projects with a variety of shotgun depths by finishing each project twice--once with each method. This comparison shows that the Autofinish-Hybrid method saves many hours over a human finisher alone, while using roughly the same number and type of reads and closing gaps at roughly the same rate. Autofinish currently is in production use at several large sequencing centers. It is designed to be adaptable to the finishing strategy of the lab--it can finish using some or all of the following: resequencing reads, reverses, custom primer walks on either subclone templates or whole clone templates, PCR, or minilibraries. Autofinish has been used for finishing cDNA, genomic clones, and whole bacterial genomes (see http://www.phrap.org).

356 citations


Journal ArticleDOI
TL;DR: The new EULER-DB algorithm is described that, similarly to the Celera assembler, takes advantage of clone-end sequencing by using the double-barreled data, but, in contrast, EUler-DB does not mask repeats but uses them instead as a powerful tool for contig ordering.
Abstract: For the last twenty years fragment assembly was dominated by the "overlap - layout - consensus" algorithms that are used in all currently available assembly tools. However, the limits of these algorithms are being tested in the era of genomic sequencing and it is not clear whether they are the best choice for large-scale assemblies. Although the "overlap - layout - consensus" approach proved to be useful in assembling clones, it faces difficulties in genomic assemblies: the existing algorithms make assembly errors even in bacterial genomes. We abandoned the "overlap - layout - consensus" approach in favour of a new Eulerian Superpath approach that outperforms the existing algorithms for genomic fragment assembly (Pevzner et al. 2001 InProceedings of the Fifth Annual International Conference on Computational Molecular Biology (RECOMB-01), 256-26). In this paper we describe our new EULER-DB algorithm that, similarly to the Celera assembler takes advantage of clone-end sequencing by using the double-barreled data. However, in contrast to the Celera assembler, EULER-DB does not mask repeats but uses them instead as a powerful tool for contig ordering. We also describe a new approach for the Copy Number Problem: "How many times a given repeat is present in the genome?". For long nearly-perfect repeats this question is notoriously difficult and some copies of such repeats may be "lost" in genomic assemblies. We describe our EULER-CN algorithm for the Copy Number Problem that proved to be successful in difficult sequencing projects.

148 citations


Journal ArticleDOI
TL;DR: This work describes the algorithm used by GigAssembler, which produced the first publicly available assembly of the human genome, a working draft containing roughly 2.7 billion base pairs and covering an estimated 88% of the genome that has been used for several recent studies of the genomes.
Abstract: The data for the public working draft of the human genome contains roughly 400,000 initial sequence contigs in ∼30,000 large insert clones. Many of these initial sequence contigs overlap. A program, GigAssembler, was built to merge them and to order and orient the resulting larger sequence contigs based on mRNA, paired plasmid ends, EST, BAC end pairs, and other information. This program produced the first publicly available assembly of the human genome, a working draft containing roughly 2.7 billion base pairs and covering an estimated 88% of the genome that has been used for several recent studies of the genome. Here we describe the algorithm used by GigAssembler.

146 citations


Proceedings ArticleDOI
22 Apr 2001
TL;DR: The main result is the reduction of the fragment assembly to a variation of the classical Eulerian path problem that opens new possibilities for repeat resolution and allows one to generate error-free solutions of the large-scale fragment assembly problems.
Abstract: For the last twenty years fragment assembly in DNA sequencing followed the “overlap - layout - consensus” paradigm that is used in all currently available assembly tools. Although this approach proved to be useful in assembling clones, it faces difficulties in genomic shotgun assembly: the existing algorithms make assembly errors and are often unable to resolve repeats even in prokaryotic genomes. Biologists are well-aware of these errors and are forced to carry additional experiments to verify the assembled contigs.We abandon the classical “overlap - layout - consensus” approach in favor of a new Eulerian Superpath approach that, for the first time, resolves the problem of repeats in fragment assembly. Our main result is the reduction of the fragment assembly to a variation of the classical Eulerian path problem. This reduction opens new possibilities for repeat resolution and allows one to generate error-free solutions of the large-scale fragment assembly problems. The major improvement of EULER over other algorithms is that it resolves all repeats except long perfect repeats that are theoretically impossible to resolve without additional experiments.

62 citations


Journal ArticleDOI
TL;DR: A simplified strategy for sequencing large genomes, which requires relatively few library constructions and only minimal computational power for a complete genome assembly, and can be managed in a cooperative fashion to take advantage of a distributed international DNA sequencing capacity.
Abstract: A simplified strategy for sequencing large genomes is proposed. Clone-Array Pooled Shotgun Sequencing (CAPSS) is based on pooling rows and columns of arrayed genomic clones, for shotgun library construction. Random sequences are accumulated, and the data are processed by sequential comparison of rows and columns to assemble the sequence of clones at points of intersection. Compared with either a clone-by-clone approach or whole-genome shotgun sequencing, CAPSS requires relatively few library constructions and only minimal computational power for a complete genome assembly. The strategy is suitable for sequencing large genomes for which there are no sequence-ready maps, but for which relatively high resolution STS maps and highly redundant BAC libraries are available. It is immediately applicable to the sequencing of mouse, rat, zebrafish, and other important genomes, and can be managed in a cooperative fashion to take advantage of a distributed international DNA sequencing capacity.

58 citations


Patent
24 Sep 2001
TL;DR: Clone-Array Pooled Shotgun Sequencing (CAPSS) as mentioned in this paper is a simplified strategy for sequencing large genomes based upon pooling rows and columns of arrayed genomic clones for shotgun library construction.
Abstract: A simplified strategy for sequencing large genomes has been developed. Clone-Array Pooled Shotgun Sequencing (CAPSS) is based upon pooling rows and columns of arrayed genomic clones, for shotgun library construction. Random sequences are accumulated and the data are assembled by sequential comparison of rows and columns, to resolve the sequence of clones at points of intersection. Compared to either a clone-by-clone approach or whole genome shotgun sequencing, CAPSS requires relatively few library constructions and only minimal computational power for a complete genome assembly. The strategy is suitable for sequencing large genomes for which there are no sequence-ready maps, but for which relatively high resolution STS maps and highly redundant BAC libraries are available. It is immediately applicable to the sequencing of mouse, rat, zebra fish and other important genomes, and can be managed in a cooperative fashion to take advantage of the distributed international DNA sequencing capacity.

42 citations


Book ChapterDOI
28 Aug 2001
TL;DR: Some simple and fast methods that were developed to evaluate and compare different assemblies of the human genome and additional applications are in "feature-tracking", comparisons of different chromosomes within the same genome and comparisons between similar chromosomes from different species.
Abstract: Using current technology, large consecutive stretches of DNA (such as whole chromosomes) are usually assembled from short fragments obtained by shotgun sequencing, or from fragments and mate-pairs, if a "double-barreled" shotgun strategy is employed. The positioning of the fragments (and mate-pairs, if available) in an assembled sequence can be used to evaluate the quality of the assembly and also to compare two different assemblies of the same chromosome, even if they are obtained from two different sequencing projects. This paper describes some simple and fast methods of this type that were developed to evaluate and compare different assemblies of the human genome. Additional applications are in "feature-tracking" from one version of an assembly to the next, comparisons of different chromosomes within the same genome and comparisons between similar chromosomes from different species.

27 citations


Proceedings ArticleDOI
22 Apr 2001
TL;DR: The Bactig Ordering Problem is introduced, which is a key problem that arises in this context, and an efficient heuristic called the greedy path-merginq algorithm that performs well on real data is presented.
Abstract: Two different approaches to determining the human genome are currently being pursued: one is the “clone-by-clone” approach, employed by the publicly-funded. Human Genome Project, and the other is the “whole genome shotgun” approach, favored by researchers at Celera Genomics. An interim strategy employed at Celera, called hierarchical assembly, makes use of preliminary data produced by both approaches. This paper introduces the Bactig Ordering Problem, which is a key problem that arises in this context, and presents an efficient heuristic called the greedy path-merginq algorithm that performs well on real data.

Journal ArticleDOI
TL;DR: Methods for using cross-species whole-genome shotgun sequence (WGS) for genome annotation are described in this paper and showed a 23-fold enrichment for coding regions compared with noncoding regions in the human genome.
Abstract: Multi-species sequence comparisons are a very efficient way to reveal conserved genes. Because sequence finishing is expensive and time consuming, many genome sequences are likely to stay incomplete. A challenge is to use these fragmented data for understanding the human genome. Methods for using cross-species whole-genome shotgun sequence (WGS) for genome annotation are described in this paper. About one-half million high-quality rat WGS reads (covering 7.5% of the rat genome) generated at the Baylor College of Medicine Human Genome Sequencing Center were compared with the human genome. Using computer-generated random reads as a negative control, a set of parameters was determined for reliable interpretation of BLAST search results. About 10% of the rat reads contain regions that are conserved in the human genomic sequence and about one-third of these include known gene-coding regions. Mapping the conserved regions to human chromosomes showed a 23-fold enrichment for coding regions compared with noncoding regions. This approach can also be applied to other mammalian genomes for gene finding. These data predicted ∼42,500 genes in the human, slightly more than reported previously.


Journal Article
TL;DR: The gene density in the gene-rich region of rice genome is about 6 kb/gene, and this sequence may code for at least 28 putative proteins, as deduced from computational search for homology with other known coding sequences and EST, or predicted using GenScan package.
Abstract: Rice is a model species for the cereals and a good candidate for genome sequencing due to its relatively small genome (430 Mb), dense physical and genetic maps, and good transgenic systems. As part of an international effort to decode the rice genome, a PAC clone localized at 10 cM of chromosome 5 is completely determined for its sequence using shotgun libraries of its two inserts, 2-kb and 5-kb in length. In total 2,998 sequencing reads were used for the assembly of the final sequence, covering 175,439 bp. This sequence may code for at least 28 putative proteins, as deduced from computational search for homology with other known coding sequences and EST, or predicted using GenScan package. Also present in this sequence are simple repeats, palindrome and retrotransposons. On the basis of these findings, the gene density in the gene-rich region of rice genome is about 6 kb/gene.

Proceedings ArticleDOI
Gene Myers1
22 Apr 2001
TL;DR: The DNA sequence assembler, built for the whole genome shotgun assembly of the human genome, utilizes end-reads of inserts to order and orient assembled contigs into scaffolds for which the distances between consecutive contigs are statistically characterized.
Abstract: The DNA sequence assembler we built for the whole genome shotgun assembly of the human genome, utilizes end-reads of inserts to order and orient assembled contigs into scaffolds for which the distances between consecutive contigs are statistically characterized. We consider the problem of comparing two such scaffolds. Applications include comparison of two distinct assemblies for mutual confirmation, and comparison of scaffold assemblies of BACs to determine a whole genome tiling of the BACs. We formalize the problem and develop efficient algorithms for a number of variations of the problem, the essential result being a sparse algorithm that refines gap estimates based on the overlap evidence.