scispace - formally typeset
Search or ask a question

Showing papers on "Hybrid genome assembly published in 2006"


Journal ArticleDOI
TL;DR: Of the amplification methodologies examined in this paper, the multiple displacement amplification products generated the least bias, and produced significantly higher yields of amplified DNA.
Abstract: Whole genome amplification is an increasingly common technique through which minute amounts of DNA can be multiplied to generate quantities suitable for genetic testing and analysis. Questions of amplification-induced error and template bias generated by these methods have previously been addressed through either small scale (SNPs) or large scale (CGH array, FISH) methodologies. Here we utilized whole genome sequencing to assess amplification-induced bias in both coding and non-coding regions of two bacterial genomes. Halobacterium species NRC-1 DNA and Campylobacter jejuni were amplified by several common, commercially available protocols: multiple displacement amplification, primer extension pre-amplification and degenerate oligonucleotide primed PCR. The amplification-induced bias of each method was assessed by sequencing both genomes in their entirety using the 454 Sequencing System technology and comparing the results with those obtained from unamplified controls. All amplification methodologies induced statistically significant bias relative to the unamplified control. For the Halobacterium species NRC-1 genome, assessed at 100 base resolution, the D-statistics from GenomiPhi-amplified material were 119 times greater than those from unamplified material, 164.0 times greater for Repli-G, 165.0 times greater for PEP-PCR and 252.0 times greater than the unamplified controls for DOP-PCR. For Campylobacter jejuni, also analyzed at 100 base resolution, the D-statistics from GenomiPhi-amplified material were 15 times greater than those from unamplified material, 19.8 times greater for Repli-G, 61.8 times greater for PEP-PCR and 220.5 times greater than the unamplified controls for DOP-PCR. Of the amplification methodologies examined in this paper, the multiple displacement amplification products generated the least bias, and produced significantly higher yields of amplified DNA.

338 citations


Journal ArticleDOI
TL;DR: The data indicate that 454 pyrosequencing allows rapid and cost-effective sequencing of the gene-containing portions of large and complex genomes and that its combination with ABI-Sanger sequencing and targeted sequence analysis can result in large regions of high-quality finished genomic sequences.
Abstract: During the past decade, Sanger sequencing has been used to completely sequence hundreds of microbial and a few higher eukaryote genomes. In recent years, a number of alternative technologies became available, among them adaptations of the pyrosequencing procedure (i.e. "454 sequencing"), promising a ~100-fold increase in throughput over Sanger technology – an advancement which is needed to make large and complex genomes more amenable to full genome sequencing at affordable costs. Although several studies have demonstrated its potential usefulness for sequencing small and compact microbial genomes, it was unclear how the new technology would perform in large and highly repetitive genomes such as those of wheat or barley. To study its performance in complex genomes, we used 454 technology to sequence four barley Bacterial Artificial Chromosome (BAC) clones and compared the results to those from ABI-Sanger sequencing. All gene containing regions were covered efficiently and at high quality with 454 sequencing whereas repetitive sequences were more problematic with 454 sequencing than with ABI-Sanger sequencing. 454 sequencing provided a much more even coverage of the BAC clones than ABI-Sanger sequencing, resulting in almost complete assembly of all genic sequences even at only 9 to 10-fold coverage. To obtain highly advanced working draft sequences for the BACs, we developed a strategy to assemble large parts of the BAC sequences by combining comparative genomics, detailed repeat analysis and use of low-quality reads from 454 sequencing. Additionally, we describe an approach of including small numbers of ABI-Sanger sequences to produce hybrid assemblies to partly compensate the short read length of 454 sequences. Our data indicate that 454 pyrosequencing allows rapid and cost-effective sequencing of the gene-containing portions of large and complex genomes and that its combination with ABI-Sanger sequencing and targeted sequence analysis can result in large regions of high-quality finished genomic sequences.

254 citations


Journal ArticleDOI
TL;DR: While the cost of Sanger sequencing has dropped dramatically over the past two decades, it is unlikely that the 100,000 dollars genome will be achieved by this means, and the best bets for ultrarapid, low-cost sequencing are single-molecule approaches.

193 citations


Journal ArticleDOI
TL;DR: M-GCAT is an interactive comparative genomics tool well suited for quickly generating multiple genome comparisons frameworks and alignments among closely related species.
Abstract: Due to recent advances in whole genome shotgun sequencing and assembly technologies, the financial cost of decoding an organism's DNA has been drastically reduced, resulting in a recent explosion of genomic sequencing projects. This increase in related genomic data will allow for in depth studies of evolution in closely related species through multiple whole genome comparisons. To facilitate such comparisons, we present an interactive multiple genome comparison and alignment tool, M-GCAT, that can efficiently construct multiple genome comparison frameworks in closely related species. M-GCAT is able to compare and identify highly conserved regions in up to 20 closely related bacterial species in minutes on a standard computer, and as many as 90 (containing 75 cloned genomes from a set of 15 published enterobacterial genomes) in an hour. M-GCAT also incorporates a novel comparative genomics data visualization interface allowing the user to globally and locally examine and inspect the conserved regions and gene annotations. M-GCAT is an interactive comparative genomics tool well suited for quickly generating multiple genome comparisons frameworks and alignments among closely related species. M-GCAT is freely available for download for academic and non-commercial use at: http://alggen.lsi.upc.es/recerca/align/mgcat/intro-mgcat.html .

92 citations


Journal ArticleDOI
TL;DR: Pilot studies on maize indicate that the new gene-enrichment, gene-finishing and gene-orientation technologies are efficient, robust and comprehensive in sequencing the gene-space of large genome plants, and in locating all of these genes and adjacent sequences on the genetic and physical maps.

70 citations


Journal ArticleDOI
TL;DR: Polony DNA sequencing provides an inexpensive, accurate, high‐throughput way to resequence genomes of interest by comparison to a reference genome by identifying differences between sequences.
Abstract: Polony DNA sequencing provides an inexpensive, accurate, high-throughput way to resequence genomes of interest by comparison to a reference genome. Mate-paired in vitro shotgun genomic libraries are produced and clonally amplified on microbeads by emulsion PCR. These serve as templates for sequencing by fluorescent nonamer ligation reactions on a microscope slide. Each sequencing run results in millions of 26-bp reads that can be aligned to the reference genome, allowing the identification of differences between sequences.

31 citations


Journal ArticleDOI
TL;DR: The model is used to analyze coverage performance over a range of small to moderately-sized genomic targets and finds that the read pairing effect and the edge effect interact in a non-trivial fashion.
Abstract: The classical theory of shotgun DNA sequencing accounts for neither the placement dependencies that are a fundamental consequence of the forward-reverse sequencing strategy, nor the edge effect that arises for small to moderate-sized genomic targets. These phenomena are relevant to a number of sequencing scenarios, including large-insert BAC and fosmid clones, filtered genomic libraries, and macro-nuclear chromosomes. Here, we report a model that considers these two effects and provides both the expected value of coverage and its variance. Comparison to methyl-filtered maize data shows significant improvement over classical theory. The model is used to analyze coverage performance over a range of small to moderately-sized genomic targets. We find that the read pairing effect and the edge effect interact in a non-trivial fashion. Shorter reads give superior coverage per unit sequence depth relative to longer ones. In principle, end-sequences can be optimized with respect to template insert length; however,...

22 citations


Journal ArticleDOI
TL;DR: It is found that the mathematical notion of occupancy serves as a good model for evolution of the coverage distribution function and reveals new insights related to sequence redundancy.

15 citations


Reference EntryDOI
TL;DR: SGS has already demonstrated its tremendous power in sequencing not only microbial genomes but also large eukaryotic genomes, such as those of the cultivated rice and the laboratory mouse, and is expected to play a central role in the field of genomics, especially when the latter has to constantly face some major challenges.
Abstract: Shotgun sequencing (SGS) is primarily a large-scale sequencing (LSS) technique that does not rely on precise guiding information about the target DNA, which includes large-insert clones and single genomes, ranging from thousands of basepairs (Kb) to billions of basepairs. A mixture of smaller genomes can also be sequenced in a similar way when retrospective means are available to assemble and distinguish them. It provides a fast and cost-effective way of sequencing large genomes regardless of whether the project as a whole takes a “whole-genome” (WG) or “clone-by-clone” (CBC) approach. The success of SGS essentially depends on random sampling, high-quality data, sufficient sequence coverage, effective assembly, and gap closing procedures. SGS has already demonstrated its tremendous power in sequencing not only microbial genomes but also large eukaryotic genomes, such as those of the cultivated rice and the laboratory mouse. Together with improved sequencing technologies and computing tools, SGS is expected to play a central role in the field of genomics, especially when the latter has to constantly face some major challenges, scientific, managerial, and political. Some of the scientific challenges relate to the effective sequencing of large, polyploid, and mixed genomes, as well as those with highly repetitive sequence contents. A key managerial challenge is for the operators to consistently produce high-quality data while increasing throughput and reducing cost. It is always a tough decision for a steering committee organizing a genome project to choose between WG and CBC, but SGS is always the basic technique of choice. Keywords: Large-insert Clones; Minimal-tiling-path Clones; Physical Gap; Physical Map; Scaffold; Sequence Contig; Sequence Gap; Shotgun Sequencing

5 citations


Journal Article
TL;DR: A probabilistic model for biased sampling distribution was developed by using an experimental data set derived from a microbial genome project and it is proposed that an optimum sequencing strategy employing different insert lengths and redundancy can be established by performing a variety of simulations.
Abstract: We have developed a program for generating shotgun data sets from known genome sequences. Generation of synthetic data sets by computer program is a useful alternative to real data to which students and researchers have limited access. Uniformly-distributed-sampling clones that were adopted by previous programs cannot account for the real situation where sampled reads tend to come from particular regions of the target genome. To reflect such situation, a probabilistic model for biased sampling distribution was developed by using an experimental data set derived from a microbial genome project. Among the experimental parameters tested (varied fragment or read lengths, chimerism, and sequencing error), the extent of sequencing error was the most critical factor that hampered sequence assembly. We propose that an optimum sequencing strategy employing different insert lengths and redundancy can be established by performing a variety of simulations.

5 citations


Journal ArticleDOI
TL;DR: ‘GenomeMark,’ a computer program that detects and statistically analyzes candidate repeats, identified novel sequence words present in tandem throughout genomes that have remarkable spacer sequence distributions and many were genome specific, validating the genome signature theory.
Abstract: Identifying and predicting the structural characteristics of novel repeats throughout the genome can lend insight into biological function. Specific repeats are believed to have biological significance as a function of their distribution patterns. We have developed ‘GenomeMark,’ a computer program that detects and statistically analyzes candidate repeats. Specifically, ‘GenomeMark’ identifies the periodic distribution of unique words, calculating their χ2 and Z-score values. Using ‘GenomeMark,’ we identified novel sequence words present in tandem throughout genomes. We found that these sequences have remarkable spacer sequence distributions and many were genome specific, validating the genome signature theory. Further analysis confirmed that many of these sequences have a specific biological function. The program is available from the authors upon request and is freely available for non-commercial and academic entities.

Reference EntryDOI
TL;DR: This article will review both the technical aspects of genome sequencing as well as the various strategies employed for sequencing large, eukaryotic genomes.
Abstract: Beginning with the landmark work of Watson and Crick and continuing with the technological breakthroughs of the 1970s, 1980s, and the 1990s, molecular genetics and genomics has revolutionized all aspects of biology. Today, scientists are able to decipher the precise order of every nucleotide of an organism's complete genome, using a conceptually simple approach called shotgun sequencing. Briefly described, shotgun sequencing involves randomly breaking the target genome into smaller pieces, which are sequenced and reassembled into a complete sequence using computer software. Genome sequences provide a basis for further research into gene regulation, pathology, evolution, and metabolism. In order to appreciate a complete genome sequence for the experimental data that it is, it is necessary to review the steps taken to generate genomic sequence. This article will review both the technical aspects of genome sequencing as well as the various strategies employed for sequencing large, eukaryotic genomes. Keywords: Annotation; Assembly; Bacterial Artificial Chromosome (BAC); Base-calling; Fingerprinting; Genetic Map; Genome; Physical Map; Production or Shotgun Sequencing

01 Jan 2006
TL;DR: This work presents an efficient algorithm for detecting approximate tandem repeats in genomic sequences that may contain the symbol ’N’ and is incorporated in a new tool called REPEATSHUNTER that enables to search perfect as well as approximate tandem repeat of different kinds, and then visualize them via the UCSC Genome Browser.
Abstract: Tandem Repeats (TRs) are head to tail perfect or approximate duplications that abundantly occur in genomic sequences. They tend to be highly polymorphic due to large variation in the number of repeats. These repeats are known to be the cause of several diseases as well as useful markers for genetic studies. In recent years several algorithms for detecting approximate tandem repeats were suggested. However, often due to technological limitations, sequencers designate a base ’N’ meaning ”no call” when they are unable to call a base at a specific position, and many genomic sequences contain the letter ’N’ in addition to the four letters of DNA. Current algorithms for detecting approximate tandem repeats are designed to process sequences with known symbols, and therefore, do not correctly detect TRs in sequences that contain the symbol ’N’. Here, we present an efficient algorithm for detecting approximate tandem repeats in genomic sequences that may contain the symbol ’N’. The ideas and methods underlying the algorithm are described and its effectiveness on genomic data is demonstrated. This algorithm is incorporated in a new tool called REPEATSHUNTER that enables to search perfect as well as approximate tandem repeats of different kinds, and then visualize them via the UCSC Genome Browser. Availability: Executables of a server based version and a local version of REPEATSHUNTER are available at