scispace - formally typeset
Search or ask a question

Showing papers on "Hybrid genome assembly published in 1999"


Journal ArticleDOI
TL;DR: These data provide the first example of whole-genome random BAC fingerprint analysis of a eucaryote, and have provided a model essential to efforts aimed at generating similar databases of fingerprint contigs to support sequencing of other complex genomes, including that of human.
Abstract: Arabidopsis thaliana has emerged as a model system for studies of plant genetics and development, and its genome has been targeted for sequencing1 by an international consortium (the Arabidopsis Genome Initiative; http://genome-www.stanford.edu/Arabidopsis/agi.html ). To support the genome-sequencing effort, we fingerprinted more than 20,000 BACs (ref. 2) from two high-quality publicly available libraries3,4,5, generating an estimated 17-fold redundant coverage of the genome, and used the fingerprints to nucleate assembly of the data by computer. Subsequent manual revision of the assemblies resulted in the incorporation of 19,661 fingerprinted BACs into 169 ordered sets of overlapping clones ('contigs'), each containing at least 3 clones. These contigs are ideal for parallel selection of BACs for large-scale sequencing and have supported the generation of more than 5.8 Mb of finished genome sequence submitted to GenBank; analysis of the sequence has confirmed the integrity of contigs constructed using this fingerprint data. Placement of contigs onto chromosomes can now be performed, and is being pursued by groups involved in both sequencing and positional cloning studies. To our knowledge, these data provide the first example of whole-genome random BAC fingerprint analysis of a eucaryote, and have provided a model essential to efforts aimed at generating similar databases of fingerprint contigs to support sequencing of other complex genomes, including that of human.

136 citations


Journal ArticleDOI
TL;DR: This project highlights the utility of using the Fugu genome in comparative DNA sequence analyses, and makes some estimation of the syntenic relationship between mammals and Fugu and looked at the efficacy of ORF prediction from short, unedited Fugu genomic sequences.
Abstract: Despite massive investment in genome mapping and DNA sequencing over the last 10 years, large-scale sequencing of vertebrate genomes has been initiated only very recently. This is partly because the initial emphasis has been on developing mapping, sequencing, and assembly technologies and partly because sequence-ready contigs of large regions of the human genome have not been available. Many valuable lessons have been learned—at no small expense—from the bacterial, yeast, and, in particular, the Caenorhabditis elegans projects. It is also clear, however, that mammalian genomes may present additional problems relating to the generation of cloned DNA from some regions, sequence assembly of highly repetitive DNA, and the large size of the genomes involved. To interpret much of the data, comparative sequencing of genomic regions from other vertebrates will be necessary. The identification of conserved sequences across species has always been a key technique in the identification of genes. In addition, sequence comparison in invertebrate projects has identified many genes by sequence similarity and in many cases has allowed speculation on function. Now that the resolution of genomes is approaching the single base pair, powerful analytical methods need to be used to define the many elements—both coding and non-coding—that are contained within the human genome. Despite the need for comparison, there is little investment in other vertebrate sequencing projects at this time. Small regions of conserved synteny within the mouse genome have been pinpointed for complete genomic sequencing, and this will provide an opportunity to compare not only precise orders of genes but also regions in and around the coding sequence itself. This should lead to the identification of other conserved elements within the DNA sequence. However, the mouse, as a sequence model, has some disadvantages. The first is cost, as the mouse genome is about the same size as the human. Second, where comparative analyses have been performed between mouse and human genomic regions, there are many noncoding regions of similarity. Whereas this kind of comparison identifies a large number of potential regulatory sequences, it is unlikely that all of these have functional significance (for review, see Hardison et al. 1997). For example, comparative sequence analysis of human and murine genomic DNA across the Bruton's tyrosine kinase loci (Oeltjen et al. 1997) revealed 179 conserved elements with >60% identity across 50 bp or more, of which only 34 were coding exons. This represents 25% of the total human DNA analyzed, and it is unlikely that all of these are functional elements. The degree of conservation of noncoding sequence between syntenic regions of human chromosome 12p13 and mouse chromosome 6 are much higher still (Ansari-Lari et al. 1998). Furthermore, it is difficult to determine whether many of these regions are conserved due to differing rates of evolutionary divergence or because of functional significance without comparing these genomic regions in a third vertebrate. Given that many critical control elements are very small (<50 bp), there is clearly a need for direct comparison between genomic DNA from more divergent vertebrates. Although it is true that the significant differences in biology between fish and mammals means that a number of regulatory mechanisms will be different between the two (or completely absent in one or the other), preliminary comparisons between genomic regions in man and Fugu rubripes demonstrate just how much more clearly small, conserved elements are detected on such a clean background (Miles et al. 1998). With five times the divergence time between fish and mammals, compared with mouse and man, it is obvious that some comparisons will be much more meaningful statistically between the former pair. The strategies used in obtaining the complete sequence of large genomes, whether from whole shotgun libraries or clone contigs, are remarkably uniform and involve a high rate of redundancy. This is necessary to have a high accuracy rate and is still more rapid and cost effective than other procedures. In contrast, the technique of sequence skimming or scanning has become popular when looking for specific genes through smaller contigs, the premise being that exons are hit by chance rather than through a directed effort. The success of this technique is related to the density of identifiable sequence elements (usually exons) and does not rely on a 100% accurate sequence. Recent studies suggest that twofold redundancy sequencing of the human genome is highly informative in terms of gene and EST identification (Bouck et al. 1998). The genome of the pufferfish F. rubripes has been presented as a model vertebrate genome (Brenner et al. 1993). As a vertebrate, Fugu contains a similar gene set to man but in a genome eight times smaller. Fugu genes have the same structure as their mammalian counterparts but are generally much smaller and more densely packed throughout the genome. With an estimated density of one gene every 6-7 kb, the sequence scanning approach becomes highly successful and allows the identification of many genes within even moderate sized clones such as cosmids. The identification of two or more known genes on a cosmid clone allows comparisons of synteny with mammalian genomes. We present data on the sequence scanning of over 1000 cosmid clones from a publicly available and well-characterized 7 × coverage whole genome Fugu cosmid library. Over 50,000 sequences have been generated from the inserts of these cosmids representing an essentially random set of genomic subclones or STSs. Because these sequences can be grouped together according to their parent cosmid clones, close-range physical linkage data are available. By using a combination of similarity searches against existing DNA databases and coding sequence prediction packages, we estimate that >40% of these sequences contain coding exons. This supports the expected figures for gene density in the Fugu genome. We have developed a rapid and economical approach to sequence analysis of vertebrate genomes and generated a publicly accessible framework within which we have deposited the sequences as well as all related data. This technology puts genomic sequencing within the reach of more modestly funded labs and does not involve any form of complex automation. The 1059 cosmids scanned represent 24.4 Mb (6%) of the Fugu genome. This is equivalent to 180 Mb of the human genome.

98 citations


Journal ArticleDOI
TL;DR: It is estimated that establishing the resource of STCs as a means of identifying minimally overlapping clones represents only 1%-3% of the total cost of sequencing the human genome, and, up to a point of diminishing returns, a larger STC resource is associated with a smaller total sequencing cost.
Abstract: The BAC-end, or sequence-tagged-connector (STC), approach was proposed in 1996 as a new strategy for large-scale sequencing (Venter et al. 1996). The BAC-end sequencing strategy offers a potential solution to the growing disparity between the availability of sequence-ready maps of minimally overlapping clones and the community’s high-throughput sequencing capacity. In this scheme (Fig. ​(Fig.1),1), 400–500 bp of sequence is determined at the ends of the inserts of a large collection of BAC clones. As a result, sequence tags are randomly scattered across the genome. These BAC-end sequences, referred to as STCs, can be used to identify a minimum tiling path of BACs by computational procedures. Any “nucleation” sequence (the sequence of an entire BAC) can be compared electronically to the database of STCs to identify the next clones to be sequenced to maximally extend the contig. Groups at The Institute for Genomic Research (TIGR) and the University of Washington are currently collecting 900,000 STCs, representing the ends of >450,000 BACs. These clones are also being fingerprinted with a single restriction-enzyme digestion. To increase the likelihood that all regions of the genome are represented in this collection, the BACs are derived from at least two libraries created by digestion of the genome with different restriction enzymes. Figure 1 The STC resource and the STC sequencing strategy, as proposed by Venter et al. (1996). At its completion, this effort will characterize sequences at an unprecedented density—a sequence read every 3.3 kb on average—across the human genome. At this STC density, any nucleation BAC (with an average insert size of 150 kb) will contain ∼45 STCs. The 45 BACs containing these STCs will be oriented 5′ to 3′ and aligned across the nucleation sequence by computer. The BACs minimally overlapping the 5′ and 3′ ends of the nucleation sequence are candidates for the next sequence extension. On average, 22 clones extend 3′ of the nucleation sequence and 22 clones extend 5′; hence, the average overlap in a particular direction will be ∼7 kb. Because the pairs of end sequences are used to identify and orient overlapping BACs, the STC strategy is akin to other implementations of double-ended strategies for sequence assembly and scaffolding (e.g., Edwards et al. 1990; Edwards and Caskey 1991; Chen et al. 1993; Smith et al. 1994; Richards et al. 1994; Roach et al. 1995; Weber and Myers 1997). The STC strategy combines a means to identify a BAC tiling path with a random shotgun approach to sequencing. The BAC tiling path permits compartmentalization of the sequencing and thus overcomes the drawbacks of whole-genome random shotgunning described by Green (1997). Thus, by using this centralized STC resource, contiguous human sequence could be obtained in distributed laboratories without the need for separate dedicated “mapping factories” (e.g., Wong et al. 1997). One issue regarding this strategy is its cost. Intuitively, with more STCs scattered in the genome, one can identify clones with smaller average overlap with previously sequenced regions. This minimization of overlap leads to fewer clones completely sequenced and, hence, lower overall cost. But how do we trade-off the cost of establishing the STC resource of increasing density against the cost of sequencing an entire clone? A second issue is the impact of diverse repeats in the genome on the success of the strategy. The human genome is riddled with repeated sequence elements, ranging from the small, but very frequent, Alu elements to a variety of large, low-copy duplications. Clearly, STCs entirely contained within a repeat element can connect noncontiguous regions of the genome. Additional precautions, such as monitoring discrepancies among the restriction-digest fingerprints of overlapping BACs or checking chromosomal locations by fluorescence in situ hybridization (FISH), must be implemented to break these false connections. In this paper we develop a tractable mathematical model to derive properties of the STC sequencing strategy. This work complements analytical and simulation models presented previously on related problems of large-scale mapping and sequencing (Roach et al. 1995; Myers and Weber 1997; Siegel et al. 1998b). Using this model, various parameters—such as the number of end sequences determined, the average insert size, the costs of various steps in the procedure—can be varied to assess the effect on overall costs or success of the approach. Success can be measured in terms of problem clones, from which continuation is not possible, owing either to the lack of identified matches in the database or to the possibility that a falsely matching STC will lead to sequencing of a nonoverlapping clone. Here is an outline of the model. The target, for example, a model of the human genome, consists of bases independently chosen from a given nucleotide distribution. Clones of fixed length are assumed to be located independently and uniformly at random over the target. STCs at each clone end are sequenced with a specified error rate. A different, much lower error rate is used for completely sequenced clones, because this sequence is typically assembled from 8- to 10-fold overlapping sequence reads (i.e., a redundant shotgun approach), and each base is determined as the highest quality read from these overlapping sequences. A variety of repeat families is defined in the model. Each family is defined by its number of copies, the segment size, and percent similarity. We use a tractable, conservative decision rule for declaring a match between an STC and a sequenced clone. This decision rule is based on the number of matching bases found when comparing an STC aligned to a subregion of a sequenced clone. The problem-clone rate is derived, and the overall sequencing cost is estimated. Library parameters may then be changed to study their effect on the problem rate and on the overall cost of sequencing the target. The model can identify the optimum-cost parameters for which the incremental cost of increasing the STC resource reaches the point of diminishing returns, matching the corresponding decrease in the average cost of clone sequencing owing to smaller overlaps. Box 1 presents our notation and assumptions in detail followed by an outline of the calculations for the expected number of problem clones (indicating the extent to which difficulties will be encountered while selecting successive clones to be sequenced in their entirety) and for the expected overlap for true STC extensions (indicating the overall extra cost owing to redundant sequencing, because with larger expected overlap, more clones will need to be sequenced to cover the target). Results for sequencing on the scale of the human genome then follow, using a variety of library sizes, clone lengths, and decision-rule criteria to identify low-cost strategies.

26 citations


Proceedings ArticleDOI
01 Apr 1999
TL;DR: An inter-marker assembly algorithm that determines the unique sequence segments between a marker pair is presented and both algorithms are evaluated with respect to a simulation that can model various types of repeats and for which the only information about the presence of repeats is excessive coverage and the ability to detect their boundaries.
Abstract: A monumental achievement in the history of science, the sequencing of the entire human genome, will soon be reached. The Human Genome Project (HGP) has been working toward this goal since 1990 using a two-tiered strategy. Recently it was proposed that using a whole-genome shotgun approach to sequence the genome would be faster and less costly. This thesis expands on that proposal by presenting two algorithms that can be used in whole-genome shotgun sequencing. These algorithms were implemented and tested on simulated data. Essential to this approach is the availability of pairs of short, unique sequence markers at a roughly estimated distance from each other. Determining the sequence of the genome can then be broken into a series of inter-marker assembly problems that determine the sequence between a pair of markers. Unfortunately, marker pairs are not always correct and repeats can greatly confound the assembly. This motivates the first problem of rapidly finding a set of linked contigs, called a scaffold, between a pair of markers that confirms the marker pair and the ability to traverse the region between them. Then an inter-marker assembly algorithm that determines the unique sequence segments between a marker pair is presented. Both algorithms are evaluated with respect to a simulation that can model various types of repeats and for which our only information about the presence of repeats is excessive coverage and the ability to detect their boundaries. Simulation results show that at 10x coverage one can find and assemble the unique sequence between markers more than 99.9% of the time for many of the repeat models. Events in this field have been moving rapidly. Recently a new company called Celera Genomics announced its intention to sequence the human genome before the HGP by using the whole-genome shotgun approach. We end this thesis by briefly discussing Celera's approach, and relating it to the algorithms presented here.

12 citations


Journal ArticleDOI
01 Apr 1999
TL;DR: The most commonly used methods for generating appropriately sized DNA fragments for dideoxy and chemical sequencing are discussed in this unit, and the biochemistry underlying these procedures, as well as how to choose between these and alternative sequencing methods, are discussed.
Abstract: This unit contains a general discussion of factors that should be considered before embarking on a DNA sequencing project. In general, any sequencing strategy should include plans for sequencing both strands of the DNA fragment. Complementary strand confirmation leads to higher accuracy, especially when sequencing regions where artifacts such as “compressions” are a problem. Sequencing the opposite strand is often required to obtain accurate data for such regions. The most commonly used methods for generating appropriately sized DNA fragments for dideoxy and chemical sequencing are discussed in this unit, and the biochemistry underlying these procedures, as well as how to choose between these and alternative sequencing methods, are discussed in the introduction to this chapter.

7 citations


Book ChapterDOI
Christoph Wilhelm Sensen1
01 Jan 1999
TL;DR: This chapter gives an overview of the strategies developed to sequence entire microbial genomes, and discusses the advantages and disadvantages of various approaches.
Abstract: This chapter gives an overview of the strategies developed to sequence entire microbial genomes, and discusses the advantages and disadvantages of various approaches. For total-genome shotgun sequencing, the genomic DNA is fragmented into random pieces and subcloned directly into pUC, Ml3, or other vectors that accept insert sizes of 1 to 5 kbp. Typically, 6 to 10 genome equivalents are sequenced to cover the DNA molecule completely by using standard primers that prime at the end of the cloning vector. The primer-walking strategy has been tried primarily in the context of the yeast sequencing project. The method requires an ordered library of clones, either an overlapping set of large clones (e.g., a cosmid library) or an ordered set of discrete subclones (e.g., two 6-base cutter restriction digest libraries from a cosmid). Regardless of the sequencing strategy chosen in a particular project, there are four general phases of the sequencing process. They are primary sequencing phase, linking phase, polishing phase, and finished sequence. Only one genome project, the Escherichia coli effort at the University of Wisconsin, made substantial progress with radioactive sequencing before changing to automated-sequencing strategies. There are two different kinds of sequencing laboratories that produce genomic sequence: sequencing factories and smaller laboratories with an output of 2 to 5 Mbp of genomic sequence per year. With increasing levels of automation, the sequence production costs will be reduced, and in the future it may be possible to reach 10 cents per finished base pair.

5 citations


Journal ArticleDOI
TL;DR: It was estimated that even a vast genome such as human genome can be sequenced at a moderate redundancy (∼7) with a satisfactory accuracy (10-4 error rate), resulting in a high sequencing speed and much lower cost.
Abstract: In order to quantitatively comprehend the essence of whole genome shotgun sequencing, a Monte-Carlo simulation was carried out. It was estimated that even a vast genome such as human genome can be sequenced at a moderate redundancy (∼7) with a satisfactory accuracy (10-4 error rate), resulting in a high sequencing speed and much lower cost. Switching from a random process (i.e., shotgun) to a directed process such as PCR-relay was shown to be ultimately important for a whole genome shotgun sequencing not to inflate its cost. An equation to evaluate the optimum switching point was introduced as a function of coverage, which also depends on the costs of a shotgun process and a directed one for sequencing a unit length. Moderate redundancy was underscored to have more merits in speed and accuracy than its demerit of being redundant. Our simulation for estimating redundancy was basically consistent with the results of the current whole genome shotgun sequencing. As a conclusion, whole genome shotgun sequencing applied to a vast genome is estimated to be effective.

1 citations