CAP3: A DNA Sequence Assembly Program

doi:10.1101/GR.9.9.868

Open AccessJournal ArticleDOI

CAP3: A DNA Sequence Assembly Program

Xiaoqiu Huang, +1 more

- 01 Sep 1999 -

Genome Research

- Vol. 9, Iss: 9, pp 868-877

TLDR

The third generation of the CAP sequence assembly program is described, which has a capability to clip 5' and 3' low-quality regions of reads and uses forward-reverse constraints to correct assembly errors and link contigs.

Abstract:

The shotgun sequencing strategy has been used widely in genome sequencing projects. A major phase in this strategy is to assemble short reads into long sequences. A number of DNA sequence assembly programs have been developed (Staden 1980; Peltola et al. 1984; Huang 1992; Smith et al. 1993; Gleizes and Henaut 1994; Lawrence et al. 1994; Kececioglu and Myers 1995; Sutton et al. 1995; Green 1996). The FAKII program provides a library of routines for each phase of the assembly process (Larson et al. 1996). The GAP4 program has a number of useful interactive features (Bonfield et al. 1995). The PHRAP program clips 5′ and 3′ low-quality regions of reads and uses base quality values in evaluation of overlaps and generation of contig sequences (Green 1996). TIGR Assembler has been used in a number of megabase microbial genome projects (Sutton et al. 1995). Continued development and improvement of sequence assembly programs are required to meet the challenges of the human, mouse, and maize genome projects. We have developed the third generation of the CAP sequence assembly program (Huang 1992). The CAP3 program includes a number of improvements and new features. A capability to clip 5′ and 3′ low-quality regions of reads is included in the CAP3 program. Base quality values produced by PHRED (Ewing et al. 1998) are used in computation of overlaps between reads, construction of multiple sequence alignments of reads, and generation of consensus sequences. Efficient algorithms are employed to identify and compute overlaps between reads. Forward–reverse constraints are used to correct assembly errors and link contigs. Results of CAP3 on four BAC data sets are presented. The performance of CAP3 was compared with that of PHRAP on a number of BAC data sets. PHRAP often produces longer contigs than CAP3 whereas CAP3 often produces fewer errors in consensus sequences than PHRAP. It is easier to construct scaffolds with CAP3 than with PHRAP on low-pass data with forward–reverse constraints. An unusual feature of CAP3 is the use of forward–reverse constraints in the construction of contigs. A forward–reverse constraint is often produced by sequencing of both ends of a subclone. A forward–reverse constraint specifies that the two reads should be on the opposite strands of the DNA molecule within a specified range of distance. By sequencing both ends of each subclone, a large number of forward–reverse constraints are produced for a cosmid or BAC data set. A difficulty with use of forward–reverse constraints in assembly is that some of the forward–reverse constraints are incorrect because of errors in lane tracking and cloning. Our strategy for dealing with this difficulty is based on the observation that a majority of the constraints are correct and wrong constraints usually occur randomly. Thus, a few unsatisfied constraints in a contig may not be sufficient to indicate an assembly error in the contig. However, if a sufficient number of constraints are all inconsistent with a join in a contig and all support an alternative join, it is likely that the current join is an error, and the alternative join should be made.

CAP3: A DNA Sequence Assembly Program

Citations

Versatile and open software for comparing large genomes

Mapping short DNA sequencing reads and calling variants using mapping quality scores

TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets

MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes

An Eulerian path approach to DNA fragment assembly

References

Basic Local Alignment Search Tool

Improved tools for biological sequence comparison.

Identification of common molecular subsequences.

Base-calling of automated sequencer traces using Phred. I. accuracy assessment

Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities

Related Papers (5)

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Basic Local Alignment Search Tool

Gene Ontology: tool for the unification of biology

Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice

MUSCLE: multiple sequence alignment with high accuracy and high throughput