scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Highly improved homopolymer aware nucleotide-protein alignments with 454 data

12 Sep 2012-BMC Bioinformatics (BioMed Central)-Vol. 13, Iss: 1, pp 230-230
TL;DR: Increased accuracy provided by HAXAT does not only result in improved homologue estimations, but also provides un-interrupted reading-frames, which greatly facilitate further analysis of protein space, for example phylogenetic analysis.
Abstract: Roche 454 sequencing is the leading sequencing technology for producing long read high throughput sequence data. Unlike most methods where sequencing errors translate to base uncertainties, 454 sequencing inaccuracies create nucleotide gaps. These gaps are particularly troublesome for translated search tools such as BLASTx where they introduce frame-shifts and result in regions of decreased identity and/or terminated alignments, which affect further analysis. To address this issue, the Homopolymer Aware Cross Alignment Tool (HAXAT) was developed. HAXAT uses a novel dynamic programming algorithm for solving the optimal local alignment between a 454 nucleotide and a protein sequence by allowing frame-shifts, guided by 454 flowpeak values. The algorithm is an efficient minimal extension of the Smith-Waterman-Gotoh algorithm that easily fits in into other tools. Experiments using HAXAT demonstrate, through the introduction of 454 specific frame-shift penalties, significantly increased accuracy of alignments spanning homopolymer sequence errors. The full effect of the new parameters introduced with this novel alignment model is explored. Experimental results evaluating homopolymer inaccuracy through alignments show a two to five-fold increase in Matthews Correlation Coefficient over previous algorithms, for 454-derived data. This increased accuracy provided by HAXAT does not only result in improved homologue estimations, but also provides un-interrupted reading-frames, which greatly facilitate further analysis of protein space, for example phylogenetic analysis. The alignment tool is available at http://bioinfo.ifm.liu.se/454tools/haxat .

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
16 Jun 2015-PeerJ
TL;DR: Deep sequencing of the viral phoH gene, a host-derived auxiliary metabolic gene, was used to track viral diversity throughout the water column at the Bermuda Atlantic Time-series Study site in the summer and winter of three years, revealing differences in the viral communities throughout a depth profile and between seasons in the same year.
Abstract: Deep sequencing of the viral phoH gene, a host-derived auxiliary metabolic gene, was used to track viral diversity throughout the water column at the Bermuda Atlantic Time-series Study (BATS) site in the summer (September) and winter (March) of three years. Viral phoH sequences reveal differences in the viral communities throughout a depth profile and between seasons in the same year. Variation was also detected between the same seasons in subsequent years, though these differences were not as great as the summer/winter distinctions. Over 3,600 phoH operational taxonomic units (OTUs; 97% sequence identity) were identified. Despite high richness, most phoH sequences belong to a few large, common OTUs whereas the majority of the OTUs are small and rare. While many OTUs make sporadic appearances at just a few times or depths, a small number of OTUs dominate the community throughout the seasons, depths, and years.

41 citations


Cites methods from "Highly improved homopolymer aware n..."

  • ...Next, the HAXAT program (Lysholm, 2012) was applied to the sequences (against a custom-built database of viral phoH sequences) in order to correct homopolymer sequence errors (using default parameters, except that both strands were queried and a minimum score of 200 was used)....

    [...]

Journal ArticleDOI
TL;DR: A method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics is described, suggesting that metagenomic analysis needs to use frameshIFT alignment to derive accurate results.
Abstract: Motivation: The alignment of DNA sequences to proteins, allowing for frameshifts, is a classic method in sequence analysis. It can help identify pseudogenes (which accumulate mutations), analyze raw DNA and RNA sequence data (which may have frameshift sequencing errors), investigate ribosomal frameshifts, etc. Often, however, only ad hoc approximations or simulations are available to provide the statistical significance of a frameshift alignment score. Results: We describe a method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics. (BLAST presently does not permit its alignments to include frameshifts.) We also illustrate the continuing usefulness of frameshift alignment with two ‘post-genomic’ applications: (i) when finding pseudogenes within the human genome, frameshift alignments show that most anciently conserved non-coding human elements are recent pseudogenes with conserved ancestral genes; and (ii) when analyzing metagenomic DNA reads from polluted soil, frameshift alignments show that most alignable metagenomic reads contain frameshifts, suggesting that metagenomic analysis needs to use frameshift alignment to derive accurate results. Availability and implementation: The statistical calculation is available in FALP (http://www.ncbi.nlm.nih.gov/CBBresearch/Spouge/html_ncbi/html/index/software.html), and giga-scale frameshift alignment is available in LAST (http://last.cbrc.jp/falp). Contact: vog.hin.mln.ibcn@eguops or pj.crbc@nitram Supplementary information: Supplementary data are available at Bioinformatics online.

32 citations


Cites background from "Highly improved homopolymer aware n..."

  • ...Received on May 12, 2014; revised on July 24, 2014; accepted on August 20, 2014...

    [...]

  • ...We should also mention HAXAT, which accurately aligns Roche 454 DNA sequences to proteins allowing for frameshifts, but is not designed for largescale searches (Lysholm, 2012)....

    [...]

Journal ArticleDOI
TL;DR: A hidden Markov model (HMM) is proposed to statistically and explicitly formulate homopolymer sequencing errors by the overcall, undercall, insertion and deletion and a realignment-based SNP-calling program, termed PyroHMMsnp, is developed, which realigns read sequences around homopolymers according to the error model and then infers the underlying genotype by using a Bayesian approach.
Abstract: Both 454 and Ion Torrent sequencers are capable of producing large amounts of long high-quality sequencing reads. However, as both methods sequence homopolymers in one cycle, they both suffer from homopolymer uncertainty and incorporation asynchronization. In mapping, such sequencing errors could shift alignments around homopolymers and thus induce incorrect mismatches, which have become a critical barrier against the accurate detection of single nucleotide polymorphisms (SNPs). In this article, we propose a hidden Markov model (HMM) to statistically and explicitly formulate homopolymer sequencing errors by the overcall, undercall, insertion and deletion. We use a hierarchical model to describe the sequencing and base-calling processes, and we estimate parameters of the HMM from resequencing data by an expectation-maximization algorithm. Based on the HMM, we develop a realignment-based SNP-calling program, termed PyroHMMsnp, which realigns read sequences around homopolymers according to the error model and then infers the underlying genotype by using a Bayesian approach. Simulation experiments show that the performance of PyroHMMsnp is exceptional across various sequencing coverages in terms of sensitivity, specificity and F1 measure, compared with other tools. Analysis of the human resequencing data shows that PyroHMMsnp predicts 12.9% more SNPs than Samtools while achieving a higher specificity. (http://code.google.com/p/pyrohmmsnp/).

23 citations

Journal ArticleDOI
TL;DR: This paper proposes an algorithm which solves the problem of input data wild cards, offers a highly flexible set of parameters and displays a detailed alignment output and a compact representation of the mutated positions of the alignment.
Abstract: Optimal string alignment is used to discover evolutionary relationships or mutations in DNA/RNA or protein sequences. Errors, missing parts or uncertainty in such a sequence can be covered with wild cards, so-called wild bases. This makes an alignment possible even when the data are corrupted or incomplete. The extended pairwise local alignment of wild card DNA/RNA sequences requires additional calculations in the dynamic programming algorithm and necessitates a subsequent best- and worst-case analysis for the wild card positions. In this paper, we propose an algorithm which solves the problem of input data wild cards, offers a highly flexible set of parameters and displays a detailed alignment output and a compact representation of the mutated positions of the alignment. An implementation of the algorithm can be obtained at https://github.com/sysbio-bioinf/swat+ and http://sysbio.uni-ulm.de/?Software:Swat+.

5 citations


Cites methods from "Highly improved homopolymer aware n..."

  • ...[26] Later algorithms by Huang and others improved the space complexity and also covered more possible mutations like intra codon frameshifts.[27,28] The base for all these dynamic programming algorithms is the recursive formula introduced by Smith and Waterman [24]:...

    [...]

Journal ArticleDOI
TL;DR: In this article , a 64×21 substitution matrix is fitted to sequence data, automatically learning the genetic code and detecting subtly homologous regions by considering alternative possible alignments between them, and calculate significance (probability of occurring by chance between random sequences).
Abstract: Protein fossils, i.e. noncoding DNA descended from coding DNA, arise frequently from transposable elements (TEs), decayed genes, and viral integrations. They can reveal, and mislead about, evolutionary history and relationships. They have been detected by comparing DNA to protein sequences, but current methods are not optimized for this task. We describe a powerful DNA-protein homology search method. We use a 64×21 substitution matrix, which is fitted to sequence data, automatically learning the genetic code. We detect subtly homologous regions by considering alternative possible alignments between them, and calculate significance (probability of occurring by chance between random sequences). Our method detects TE protein fossils much more sensitively than blastx, and > 10× faster. Of the ~7 major categories of eukaryotic TE, three were long thought absent in mammals: we find two of them in the human genome, polinton and DIRS/Ngaro. This method increases our power to find ancient fossils, and perhaps to detect non-standard genetic codes. The alternative-alignments and significance paradigm is not specific to DNA-protein comparison, and could benefit homology search generally. This is an extended version of a conference paper [1].

2 citations

References
More filters
Journal ArticleDOI
15 Sep 2005-Nature
TL;DR: A scalable, highly parallel sequencing system with raw throughput significantly greater than that of state-of-the-art capillary electrophoresis instruments with 96% coverage at 99.96% accuracy in one run of the machine is described.
Abstract: The proliferation of large-scale DNA-sequencing projects in recent years has driven a search for alternative methods to reduce time and cost. Here we describe a scalable, highly parallel sequencing system with raw throughput significantly greater than that of state-of-the-art capillary electrophoresis instruments. The apparatus uses a novel fibre-optic slide of individual wells and is able to sequence 25 million bases, at 99% or better accuracy, in one four-hour run. To achieve an approximately 100-fold increase in throughput over current Sanger sequencing technology, we have developed an emulsion method for DNA amplification and an instrument for sequencing by synthesis using a pyrosequencing protocol optimized for solid support and picolitre-scale volumes. Here we show the utility, throughput, accuracy and robustness of this system by shotgun sequencing and de novo assembly of the Mycoplasma genitalium genome with 96% coverage at 99.96% accuracy in one run of the machine.

8,434 citations


"Highly improved homopolymer aware n..." refers background or methods in this paper

  • ...The intensity of each flow of nucleotide reagents is recorded, as a so-called flowpeak and collected in a flowgram [1]....

    [...]

  • ...One of the early next-generation techniques is Roche 454 sequencing [1], currently extensively used as it produces longer sequence reads compared to other platforms....

    [...]

  • ...The GS20 brought a huge leap in performance over present Sanger based techniques and produced around 20 million bases (Mb) per run [1]....

    [...]

  • ...As the homopolymer length is estimated from peak intensity, occasional uncertainties occur [1]....

    [...]

Journal ArticleDOI
TL;DR: How BLAT was optimized is described, which is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences.
Abstract: Analyzing vertebrate genomes requires rapid mRNA/DNA and cross-species protein alignments A new tool, BLAT, is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences BLAT's speed stems from an index of all nonoverlapping K-mers in the genome This index fits inside the RAM of inexpensive computers, and need only be computed once for each genome assembly BLAT has several major stages It uses the index to find regions in the genome likely to be homologous to the query sequence It performs an alignment between homologous regions It stitches together these aligned regions (often exons) into larger alignments (typically genes) Finally, BLAT revisits small internal exons possibly missed at the first stage and adjusts large gap boundaries that have canonical splice sites where feasible This paper describes how BLAT was optimized Effects on speed and sensitivity are explored for various K-mer sizes, mismatch schemes, and number of required index matches BLAT is compared with other alignment programs on various test sets and then used in several genome-wide applications http://genomeucscedu hosts a web-based BLAT server for the human genome

8,326 citations


"Highly improved homopolymer aware n..." refers methods in this paper

  • ...Kent WJ: BLAT–the BLAST-like alignment tool....

    [...]

  • ...Later, methods using improved heuristics, and/or innovative implementations to more efficiently utilise computer memory and processors have been proposed, e.g. MegaBLAST [10], SSAHA2 [11], BLAT [12] and Smith-Waterman-Gotoh alignments utilising SIMD instructions [13]....

    [...]

  • ...MegaBLAST [10], SSAHA2 [11], BLAT [12] and Smith-Waterman-Gotoh alignments utilising SIMD instructions [13]....

    [...]

Journal ArticleDOI
TL;DR: This work has derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins, leading to marked improvements in alignments and in searches using queries from each of the groups.
Abstract: Methods for alignment of protein sequences typically measure similarity by using a substitution matrix with scores for all possible exchanges of one amino acid with another. The most widely used matrices are based on the Dayhoff model of evolutionary rates. Using a different approach, we have derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins. This led to marked improvements in alignments and in searches using queries from each of the groups.

6,553 citations


"Highly improved homopolymer aware n..." refers methods in this paper

  • ...Each mutation position was randomly chosen, while the substitution frequencies were proportional to the corresponding BLOSUM(X) substitution probabilities [23], where X denotes the clustering identity threshold used....

    [...]

Journal ArticleDOI
TL;DR: A new greedy alignment algorithm is introduced with particularly good performance and it is shown that it computes the same alignment as does a certain dynamic programming algorithm, while executing over 10 times faster on appropriate data.
Abstract: For aligning DNA sequences that differ only by sequencing errors, or by equivalent errors from other sources, a greedy algorithm can be much faster than traditional dynamic programming approaches and yet produce an alignment that is guaranteed to be theoretically optimal. We introduce a new greedy alignment algorithm with particularly good performance and show that it computes the same alignment as does a certain dynamic programming algorithm, while executing over 10 times faster on appropriate data. An implementation of this algorithm is currently used in a program that assembles the UniGene database at the National Center for Biotechnology Information.

4,628 citations


"Highly improved homopolymer aware n..." refers methods in this paper

  • ...MegaBLAST [10], SSAHA2 [11], BLAT [12] and Smith-Waterman-Gotoh alignments utilising SIMD instructions [13]....

    [...]

Journal ArticleDOI
TL;DR: Although empirical predictions based on larger numbers of known protein structure tend to be more accurate than those based on a limited sample, the improvement in accuracy is not dramatic, suggesting that the accuracy of current empirical predictive methods will not be substantially increased simply by the inclusion of more data from additional protein structure determinations.

4,522 citations


"Highly improved homopolymer aware n..." refers methods in this paper

  • ...The parameter set which results in the best mean MCC is highlighted in bold. low identity targets....

    [...]

  • ...For example, at 40% identity with flowpeak information, a MCC-value of 0.68 was obtained, compared to 0.41 without the aid of flowpeak information....

    [...]

  • ...True/false positives/negatives were defined as before and the algorithm efficiency were evaluated by MCC, see Figure 5....

    [...]

  • ...This is clear from Table 3 where, even at 100% protein identity and using flowpeak information, the best MCC-value was 0.862....

    [...]

  • ...As the parameter combinations evaluated did not span all possible values of all parameters, some combinations received lower MCC-values than expected....

    [...]