scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Highly improved homopolymer aware nucleotide-protein alignments with 454 data

12 Sep 2012-BMC Bioinformatics (BioMed Central)-Vol. 13, Iss: 1, pp 230-230
TL;DR: Increased accuracy provided by HAXAT does not only result in improved homologue estimations, but also provides un-interrupted reading-frames, which greatly facilitate further analysis of protein space, for example phylogenetic analysis.
Abstract: Roche 454 sequencing is the leading sequencing technology for producing long read high throughput sequence data. Unlike most methods where sequencing errors translate to base uncertainties, 454 sequencing inaccuracies create nucleotide gaps. These gaps are particularly troublesome for translated search tools such as BLASTx where they introduce frame-shifts and result in regions of decreased identity and/or terminated alignments, which affect further analysis. To address this issue, the Homopolymer Aware Cross Alignment Tool (HAXAT) was developed. HAXAT uses a novel dynamic programming algorithm for solving the optimal local alignment between a 454 nucleotide and a protein sequence by allowing frame-shifts, guided by 454 flowpeak values. The algorithm is an efficient minimal extension of the Smith-Waterman-Gotoh algorithm that easily fits in into other tools. Experiments using HAXAT demonstrate, through the introduction of 454 specific frame-shift penalties, significantly increased accuracy of alignments spanning homopolymer sequence errors. The full effect of the new parameters introduced with this novel alignment model is explored. Experimental results evaluating homopolymer inaccuracy through alignments show a two to five-fold increase in Matthews Correlation Coefficient over previous algorithms, for 454-derived data. This increased accuracy provided by HAXAT does not only result in improved homologue estimations, but also provides un-interrupted reading-frames, which greatly facilitate further analysis of protein space, for example phylogenetic analysis. The alignment tool is available at http://bioinfo.ifm.liu.se/454tools/haxat .

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
16 Jun 2015-PeerJ
TL;DR: Deep sequencing of the viral phoH gene, a host-derived auxiliary metabolic gene, was used to track viral diversity throughout the water column at the Bermuda Atlantic Time-series Study site in the summer and winter of three years, revealing differences in the viral communities throughout a depth profile and between seasons in the same year.
Abstract: Deep sequencing of the viral phoH gene, a host-derived auxiliary metabolic gene, was used to track viral diversity throughout the water column at the Bermuda Atlantic Time-series Study (BATS) site in the summer (September) and winter (March) of three years. Viral phoH sequences reveal differences in the viral communities throughout a depth profile and between seasons in the same year. Variation was also detected between the same seasons in subsequent years, though these differences were not as great as the summer/winter distinctions. Over 3,600 phoH operational taxonomic units (OTUs; 97% sequence identity) were identified. Despite high richness, most phoH sequences belong to a few large, common OTUs whereas the majority of the OTUs are small and rare. While many OTUs make sporadic appearances at just a few times or depths, a small number of OTUs dominate the community throughout the seasons, depths, and years.

41 citations


Cites methods from "Highly improved homopolymer aware n..."

  • ...Next, the HAXAT program (Lysholm, 2012) was applied to the sequences (against a custom-built database of viral phoH sequences) in order to correct homopolymer sequence errors (using default parameters, except that both strands were queried and a minimum score of 200 was used)....

    [...]

Journal ArticleDOI
TL;DR: A method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics is described, suggesting that metagenomic analysis needs to use frameshIFT alignment to derive accurate results.
Abstract: Motivation: The alignment of DNA sequences to proteins, allowing for frameshifts, is a classic method in sequence analysis. It can help identify pseudogenes (which accumulate mutations), analyze raw DNA and RNA sequence data (which may have frameshift sequencing errors), investigate ribosomal frameshifts, etc. Often, however, only ad hoc approximations or simulations are available to provide the statistical significance of a frameshift alignment score. Results: We describe a method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics. (BLAST presently does not permit its alignments to include frameshifts.) We also illustrate the continuing usefulness of frameshift alignment with two ‘post-genomic’ applications: (i) when finding pseudogenes within the human genome, frameshift alignments show that most anciently conserved non-coding human elements are recent pseudogenes with conserved ancestral genes; and (ii) when analyzing metagenomic DNA reads from polluted soil, frameshift alignments show that most alignable metagenomic reads contain frameshifts, suggesting that metagenomic analysis needs to use frameshift alignment to derive accurate results. Availability and implementation: The statistical calculation is available in FALP (http://www.ncbi.nlm.nih.gov/CBBresearch/Spouge/html_ncbi/html/index/software.html), and giga-scale frameshift alignment is available in LAST (http://last.cbrc.jp/falp). Contact: vog.hin.mln.ibcn@eguops or pj.crbc@nitram Supplementary information: Supplementary data are available at Bioinformatics online.

32 citations


Cites background from "Highly improved homopolymer aware n..."

  • ...Received on May 12, 2014; revised on July 24, 2014; accepted on August 20, 2014...

    [...]

  • ...We should also mention HAXAT, which accurately aligns Roche 454 DNA sequences to proteins allowing for frameshifts, but is not designed for largescale searches (Lysholm, 2012)....

    [...]

Journal ArticleDOI
TL;DR: A hidden Markov model (HMM) is proposed to statistically and explicitly formulate homopolymer sequencing errors by the overcall, undercall, insertion and deletion and a realignment-based SNP-calling program, termed PyroHMMsnp, is developed, which realigns read sequences around homopolymers according to the error model and then infers the underlying genotype by using a Bayesian approach.
Abstract: Both 454 and Ion Torrent sequencers are capable of producing large amounts of long high-quality sequencing reads. However, as both methods sequence homopolymers in one cycle, they both suffer from homopolymer uncertainty and incorporation asynchronization. In mapping, such sequencing errors could shift alignments around homopolymers and thus induce incorrect mismatches, which have become a critical barrier against the accurate detection of single nucleotide polymorphisms (SNPs). In this article, we propose a hidden Markov model (HMM) to statistically and explicitly formulate homopolymer sequencing errors by the overcall, undercall, insertion and deletion. We use a hierarchical model to describe the sequencing and base-calling processes, and we estimate parameters of the HMM from resequencing data by an expectation-maximization algorithm. Based on the HMM, we develop a realignment-based SNP-calling program, termed PyroHMMsnp, which realigns read sequences around homopolymers according to the error model and then infers the underlying genotype by using a Bayesian approach. Simulation experiments show that the performance of PyroHMMsnp is exceptional across various sequencing coverages in terms of sensitivity, specificity and F1 measure, compared with other tools. Analysis of the human resequencing data shows that PyroHMMsnp predicts 12.9% more SNPs than Samtools while achieving a higher specificity. (http://code.google.com/p/pyrohmmsnp/).

23 citations

Journal ArticleDOI
TL;DR: This paper proposes an algorithm which solves the problem of input data wild cards, offers a highly flexible set of parameters and displays a detailed alignment output and a compact representation of the mutated positions of the alignment.
Abstract: Optimal string alignment is used to discover evolutionary relationships or mutations in DNA/RNA or protein sequences. Errors, missing parts or uncertainty in such a sequence can be covered with wild cards, so-called wild bases. This makes an alignment possible even when the data are corrupted or incomplete. The extended pairwise local alignment of wild card DNA/RNA sequences requires additional calculations in the dynamic programming algorithm and necessitates a subsequent best- and worst-case analysis for the wild card positions. In this paper, we propose an algorithm which solves the problem of input data wild cards, offers a highly flexible set of parameters and displays a detailed alignment output and a compact representation of the mutated positions of the alignment. An implementation of the algorithm can be obtained at https://github.com/sysbio-bioinf/swat+ and http://sysbio.uni-ulm.de/?Software:Swat+.

5 citations


Cites methods from "Highly improved homopolymer aware n..."

  • ...[26] Later algorithms by Huang and others improved the space complexity and also covered more possible mutations like intra codon frameshifts.[27,28] The base for all these dynamic programming algorithms is the recursive formula introduced by Smith and Waterman [24]:...

    [...]

Journal ArticleDOI
TL;DR: In this article , a 64×21 substitution matrix is fitted to sequence data, automatically learning the genetic code and detecting subtly homologous regions by considering alternative possible alignments between them, and calculate significance (probability of occurring by chance between random sequences).
Abstract: Protein fossils, i.e. noncoding DNA descended from coding DNA, arise frequently from transposable elements (TEs), decayed genes, and viral integrations. They can reveal, and mislead about, evolutionary history and relationships. They have been detected by comparing DNA to protein sequences, but current methods are not optimized for this task. We describe a powerful DNA-protein homology search method. We use a 64×21 substitution matrix, which is fitted to sequence data, automatically learning the genetic code. We detect subtly homologous regions by considering alternative possible alignments between them, and calculate significance (probability of occurring by chance between random sequences). Our method detects TE protein fossils much more sensitively than blastx, and > 10× faster. Of the ~7 major categories of eukaryotic TE, three were long thought absent in mammals: we find two of them in the human genome, polinton and DIRS/Ngaro. This method increases our power to find ancient fossils, and perhaps to detect non-standard genetic codes. The alternative-alignments and significance paradigm is not specific to DNA-protein comparison, and could benefit homology search generally. This is an extended version of a conference paper [1].

2 citations

References
More filters
Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

88,255 citations


"Highly improved homopolymer aware n..." refers methods in this paper

  • ...Due to this limitation of nucleotide alignments, many metagenomic studies [17-19] perform translated homology searches (against a protein database), using translated BLAST, e.g. tBLASTx or BLASTx [9]....

    [...]

  • ...Furthermore, many tools implement excellent heuristics, for example alignment search tools like TFASTA [7], BLASTx [9], and these can be used to reduce the number of alignments computed by HAXAT....

    [...]

  • ...These gaps are particularly troublesome for translated search tools such as BLASTx where they introduce frame-shifts and result in regions of decreased identity and/or terminated alignments, which affect further analysis....

    [...]

  • ...An example on how to achieve HAXAT results with the aid of BLASTx heuristics (BLAST+package), using four simple commands, is available at the webpage (http://www.bioinfo.ifm.liu. se/454tools/haxat)....

    [...]

  • ...The webpage also provides a web-version of HAXAT which can either align two sequences using HAXAT alone or search a query sequence against a database using BLASTx heuristics....

    [...]

Journal ArticleDOI
TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.
Abstract: The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

70,111 citations

Journal ArticleDOI
TL;DR: Three computer programs for comparisons of protein and DNA sequences can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity.
Abstract: We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.

12,432 citations


"Highly improved homopolymer aware n..." refers methods in this paper

  • ...Furthermore, many tools implement excellent heuristics, for example alignment search tools like TFASTA [7], BLASTx [9], and these can be used to reduce the number of alignments computed by HAXAT....

    [...]

  • ...Furthermore, many tools implement excellent heuristics, for example alignment search tools like TFASTA [7], BLASTx [9], and these can be used to reduce the number of alignments computed by HAXAT....

    [...]

  • ...Through this novel algorithm, HAXAT produces more sensitive results with 454 data, even in the absence of flowpeak information (FASTA input)....

    [...]

  • ...The second mode uses a 454 aware alignment model without flowpeak information, i.e. running HAXAT with FASTA input (e.g. homopolymer aware alignment)....

    [...]

  • ...The models evaluated are; a 454 model using flowpeak information with and without validation (denoted+ V) as well as without flowpeak information (FASTA input) with and without validation and finally a neutral (non-454-aware) model (FASTA input)....

    [...]

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).

10,262 citations


"Highly improved homopolymer aware n..." refers background in this paper

  • ...In 1981, Smith and Waterman defined the local alignment and proposed a slightly modified algorithm to solve it [4]....

    [...]