scispace - formally typeset
SciSpace - Your AI assistant to discover and understand research papers | Product Hunt

Journal ArticleDOI

Highly improved homopolymer aware nucleotide-protein alignments with 454 data

12 Sep 2012-BMC Bioinformatics (BioMed Central)-Vol. 13, Iss: 1, pp 230-230

TL;DR: Increased accuracy provided by HAXAT does not only result in improved homologue estimations, but also provides un-interrupted reading-frames, which greatly facilitate further analysis of protein space, for example phylogenetic analysis.

AbstractRoche 454 sequencing is the leading sequencing technology for producing long read high throughput sequence data. Unlike most methods where sequencing errors translate to base uncertainties, 454 sequencing inaccuracies create nucleotide gaps. These gaps are particularly troublesome for translated search tools such as BLASTx where they introduce frame-shifts and result in regions of decreased identity and/or terminated alignments, which affect further analysis. To address this issue, the Homopolymer Aware Cross Alignment Tool (HAXAT) was developed. HAXAT uses a novel dynamic programming algorithm for solving the optimal local alignment between a 454 nucleotide and a protein sequence by allowing frame-shifts, guided by 454 flowpeak values. The algorithm is an efficient minimal extension of the Smith-Waterman-Gotoh algorithm that easily fits in into other tools. Experiments using HAXAT demonstrate, through the introduction of 454 specific frame-shift penalties, significantly increased accuracy of alignments spanning homopolymer sequence errors. The full effect of the new parameters introduced with this novel alignment model is explored. Experimental results evaluating homopolymer inaccuracy through alignments show a two to five-fold increase in Matthews Correlation Coefficient over previous algorithms, for 454-derived data. This increased accuracy provided by HAXAT does not only result in improved homologue estimations, but also provides un-interrupted reading-frames, which greatly facilitate further analysis of protein space, for example phylogenetic analysis. The alignment tool is available at http://bioinfo.ifm.liu.se/454tools/haxat .

...read more

Content maybe subject to copyright    Report

Citations
More filters

Journal ArticleDOI
16 Jun 2015-PeerJ
TL;DR: Deep sequencing of the viral phoH gene, a host-derived auxiliary metabolic gene, was used to track viral diversity throughout the water column at the Bermuda Atlantic Time-series Study site in the summer and winter of three years, revealing differences in the viral communities throughout a depth profile and between seasons in the same year.
Abstract: Deep sequencing of the viral phoH gene, a host-derived auxiliary metabolic gene, was used to track viral diversity throughout the water column at the Bermuda Atlantic Time-series Study (BATS) site in the summer (September) and winter (March) of three years. Viral phoH sequences reveal differences in the viral communities throughout a depth profile and between seasons in the same year. Variation was also detected between the same seasons in subsequent years, though these differences were not as great as the summer/winter distinctions. Over 3,600 phoH operational taxonomic units (OTUs; 97% sequence identity) were identified. Despite high richness, most phoH sequences belong to a few large, common OTUs whereas the majority of the OTUs are small and rare. While many OTUs make sporadic appearances at just a few times or depths, a small number of OTUs dominate the community throughout the seasons, depths, and years.

30 citations


Cites methods from "Highly improved homopolymer aware n..."

  • ...Next, the HAXAT program (Lysholm, 2012) was applied to the sequences (against a custom-built database of viral phoH sequences) in order to correct homopolymer sequence errors (using default parameters, except that both strands were queried and a minimum score of 200 was used)....

    [...]


Journal ArticleDOI
TL;DR: A method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics is described, suggesting that metagenomic analysis needs to use frameshIFT alignment to derive accurate results.
Abstract: Motivation: The alignment of DNA sequences to proteins, allowing for frameshifts, is a classic method in sequence analysis. It can help identify pseudogenes (which accumulate mutations), analyze raw DNA and RNA sequence data (which may have frameshift sequencing errors), investigate ribosomal frameshifts, etc. Often, however, only ad hoc approximations or simulations are available to provide the statistical significance of a frameshift alignment score. Results: We describe a method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics. (BLAST presently does not permit its alignments to include frameshifts.) We also illustrate the continuing usefulness of frameshift alignment with two ‘post-genomic’ applications: (i) when finding pseudogenes within the human genome, frameshift alignments show that most anciently conserved non-coding human elements are recent pseudogenes with conserved ancestral genes; and (ii) when analyzing metagenomic DNA reads from polluted soil, frameshift alignments show that most alignable metagenomic reads contain frameshifts, suggesting that metagenomic analysis needs to use frameshift alignment to derive accurate results. Availability and implementation: The statistical calculation is available in FALP (http://www.ncbi.nlm.nih.gov/CBBresearch/Spouge/html_ncbi/html/index/software.html), and giga-scale frameshift alignment is available in LAST (http://last.cbrc.jp/falp). Contact: vog.hin.mln.ibcn@eguops or pj.crbc@nitram Supplementary information: Supplementary data are available at Bioinformatics online.

26 citations


Cites background from "Highly improved homopolymer aware n..."

  • ...Received on May 12, 2014; revised on July 24, 2014; accepted on August 20, 2014...

    [...]

  • ...We should also mention HAXAT, which accurately aligns Roche 454 DNA sequences to proteins allowing for frameshifts, but is not designed for largescale searches (Lysholm, 2012)....

    [...]


Journal ArticleDOI
TL;DR: A hidden Markov model (HMM) is proposed to statistically and explicitly formulate homopolymer sequencing errors by the overcall, undercall, insertion and deletion and a realignment-based SNP-calling program, termed PyroHMMsnp, is developed, which realigns read sequences around homopolymers according to the error model and then infers the underlying genotype by using a Bayesian approach.
Abstract: Both 454 and Ion Torrent sequencers are capable of producing large amounts of long high-quality sequencing reads. However, as both methods sequence homopolymers in one cycle, they both suffer from homopolymer uncertainty and incorporation asynchronization. In mapping, such sequencing errors could shift alignments around homopolymers and thus induce incorrect mismatches, which have become a critical barrier against the accurate detection of single nucleotide polymorphisms (SNPs). In this article, we propose a hidden Markov model (HMM) to statistically and explicitly formulate homopolymer sequencing errors by the overcall, undercall, insertion and deletion. We use a hierarchical model to describe the sequencing and base-calling processes, and we estimate parameters of the HMM from resequencing data by an expectation-maximization algorithm. Based on the HMM, we develop a realignment-based SNP-calling program, termed PyroHMMsnp, which realigns read sequences around homopolymers according to the error model and then infers the underlying genotype by using a Bayesian approach. Simulation experiments show that the performance of PyroHMMsnp is exceptional across various sequencing coverages in terms of sensitivity, specificity and F1 measure, compared with other tools. Analysis of the human resequencing data shows that PyroHMMsnp predicts 12.9% more SNPs than Samtools while achieving a higher specificity. (http://code.google.com/p/pyrohmmsnp/).

22 citations


Journal ArticleDOI
TL;DR: This paper proposes an algorithm which solves the problem of input data wild cards, offers a highly flexible set of parameters and displays a detailed alignment output and a compact representation of the mutated positions of the alignment.
Abstract: Optimal string alignment is used to discover evolutionary relationships or mutations in DNA/RNA or protein sequences. Errors, missing parts or uncertainty in such a sequence can be covered with wild cards, so-called wild bases. This makes an alignment possible even when the data are corrupted or incomplete. The extended pairwise local alignment of wild card DNA/RNA sequences requires additional calculations in the dynamic programming algorithm and necessitates a subsequent best- and worst-case analysis for the wild card positions. In this paper, we propose an algorithm which solves the problem of input data wild cards, offers a highly flexible set of parameters and displays a detailed alignment output and a compact representation of the mutated positions of the alignment. An implementation of the algorithm can be obtained at https://github.com/sysbio-bioinf/swat+ and http://sysbio.uni-ulm.de/?Software:Swat+.

5 citations


Cites methods from "Highly improved homopolymer aware n..."

  • ...[26] Later algorithms by Huang and others improved the space complexity and also covered more possible mutations like intra codon frameshifts.[27,28] The base for all these dynamic programming algorithms is the recursive formula introduced by Smith and Waterman [24]:...

    [...]


Dissertation
04 Dec 2013
TL;DR: This study has demonstrated significant novel linkages between the transcriptional TRF and post-transcriptional microRNA-mediated regulatory layers and contributes to the characterization of both natural and pathogenic SIV infections, with longer term implications for HIV therapeutics.
Abstract: This thesis was presented by Aaron Webber on the 4th December 2013 for the degree of Doctor of Philosophy from the University of Manchester. The title of this thesis is ?Transcriptional co-regulation of microRNAs and protein-coding genes?. The thesis relates to gene expression regulation within humans and closely related primate species. We have investigated the binding site distributions from publically available ChIP-seq data of 117 transcription regulatory factors (TRFs) within the human genome. These were mapped to cis-regulatory regions of two major classes of genes, ? 20,000 genes encoding proteins and ? 1500 genes encoding microRNAs. MicroRNAs are short 20 - 24 nt noncoding RNAs which bind complementary regions within target mRNAs to repress translation. The complete collection of ChIP-seq binding site data is related to genomic associations between protein-coding and microRNA genes, and to the expression patterns and functions of both gene types across human tissues. We show that microRNA genes are associated with highly regulated protein-coding gene regions, and show rigorously that transcriptional regulation is greater than expected, given properties of these protein-coding genes. We find enrichment in developmental proteins among protein-coding genes hosting microRNA sequences. Novel subclasses of microRNAs are identified that lie outside of protein-coding genes yet may still be expressed from a shared promoter region with their protein-coding neighbours. We show that such microRNAs are more likely to form regulatory feedback loops with the transcriptional regulators lying in the upstream protein-coding promoter region.We show that when a microRNA and a TRF regulate one another, the TRF is more likely to sometimes function as a repressor. As in many studies, the data show that microRNAs lying downstream of particular TRFs target significantly many genes in common with these TRFs. We then demonstrate that the prevalence of such TRF/microRNA regulatory partnerships relates directly to the variation in mRNA expression across human tissues, with the least variable mRNAs having the most significant enrichment in such partnerships. This result is connected to theory describing the buffering of gene expression variation by microRNAs. Taken together, our study has demonstrated significant novel linkages between the transcriptional TRF and post-transcriptional microRNA-mediated regulatory layers.We finally consider transcriptional regulators alone, by mapping these to genes clustered on the basis of their expression patterns through time, within the context of CD4+ T cells from African green monkeys and Rhesus macaques infected with Simian immunodeficiency virus (SIV). African green monkeys maintain a functioning immune system despite never clearing the virus, while in rhesus macaques, the immune system becomes chronically stimulated leading to pathogenesis. Gene expression clusters were identified characterizing the natural and pathogenic host systems. We map transcriptional regulators to these expression clusters and demonstrate significant yet unexpected co-binding by two heterodimers (STAT1:STAT2 and BATF:IRF4) over key viral response genes. From 34 structural families of TRFs, we demonstrate that bZIPs, STATs and IRFs are the most frequently perturbed upon SIV infection. Our work therefore contributes to the characterization of both natural and pathogenic SIV infections, with longer term implications for HIV therapeutics.

2 citations


Cites background from "Highly improved homopolymer aware n..."

  • ...The method is fast, relatively inexpensive, produces reads of sufficient length for de novo genome assembly (up to 700 bp) and has high accuracy in general though performs poorly for homopolymeric sequences (Lysholm 2012)....

    [...]


References
More filters

Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.
Abstract: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straight-forward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.

81,150 citations


"Highly improved homopolymer aware n..." refers methods in this paper

  • ...Due to this limitation of nucleotide alignments, many metagenomic studies [17-19] perform translated homology searches (against a protein database), using translated BLAST, e.g. tBLASTx or BLASTx [9]....

    [...]

  • ...Furthermore, many tools implement excellent heuristics, for example alignment search tools like TFASTA [7], BLASTx [9], and these can be used to reduce the number of alignments computed by HAXAT....

    [...]

  • ...These gaps are particularly troublesome for translated search tools such as BLASTx where they introduce frame-shifts and result in regions of decreased identity and/or terminated alignments, which affect further analysis....

    [...]

  • ...An example on how to achieve HAXAT results with the aid of BLASTx heuristics (BLAST+package), using four simple commands, is available at the webpage (http://www.bioinfo.ifm.liu. se/454tools/haxat)....

    [...]

  • ...The webpage also provides a web-version of HAXAT which can either align two sequences using HAXAT alone or search a query sequence against a database using BLASTx heuristics....

    [...]


Journal ArticleDOI
TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.
Abstract: The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

66,744 citations


Journal ArticleDOI
TL;DR: Three computer programs for comparisons of protein and DNA sequences can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity.
Abstract: We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.

12,324 citations


"Highly improved homopolymer aware n..." refers methods in this paper

  • ...Furthermore, many tools implement excellent heuristics, for example alignment search tools like TFASTA [7], BLASTx [9], and these can be used to reduce the number of alignments computed by HAXAT....

    [...]

  • ...Furthermore, many tools implement excellent heuristics, for example alignment search tools like TFASTA [7], BLASTx [9], and these can be used to reduce the number of alignments computed by HAXAT....

    [...]

  • ...Through this novel algorithm, HAXAT produces more sensitive results with 454 data, even in the absence of flowpeak information (FASTA input)....

    [...]

  • ...The second mode uses a 454 aware alignment model without flowpeak information, i.e. running HAXAT with FASTA input (e.g. homopolymer aware alignment)....

    [...]

  • ...The models evaluated are; a 454 model using flowpeak information with and without validation (denoted+ V) as well as without flowpeak information (FASTA input) with and without validation and finally a neutral (non-454-aware) model (FASTA input)....

    [...]


Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.
Abstract: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed. From these findings it is possible to determine whether significant homology exists between the proteins. This information is used to trace their possible evolutionary development. The maximum match is a number dependent upon the similarity of the sequences. One of its definitions is the largest number of amino acids of one protein that can be matched with those of a second protein allowing for all possible interruptions in either of the sequences. While the interruptions give rise to a very large number of comparisons, the method efficiently excludes from consideration those comparisons that cannot contribute to the maximum match. Comparisons are made from the smallest unit of significance, a pair of amino acids, one from each protein. All possible pairs are represented by a two-dimensional array, and all possible comparisons are represented by pathways through the array. For this maximum match only certain of the possible pathways must, be evaluated. A numerical value, one in this case, is assigned to every cell in the array representing like amino acids. The maximum match is the largest number that would result from summing the cell values of every

11,308 citations


Journal ArticleDOI
TL;DR: This letter extends the heuristic homology algorithm of Needleman & Wunsch (1970) to find a pair of segments, one from each of two long sequences, such that there is no other Pair of segments with greater similarity (homology).
Abstract: The identification of maximally homologous subsequences among sets of long sequences is an important problem in molecular sequence analysis. The problem is straightforward only if one restricts consideration to contiguous subsequences (segments) containing no internal deletions or insertions. The more general problem has its solution in an extension of sequence metrics (Sellers 1974; Waterman et al., 1976) developed to measure the minimum number of “events” required to convert one sequence into another. These developments in the modern sequence analysis began with the heuristic homology algorithm of Needleman & Wunsch (1970) which first introduced an iterative matrix method of calculation. Numerous other heuristic algorithms have been suggested including those of Fitch (1966) and Dayhoff (1969). More mathematically rigorous algorithms were suggested by Sankoff (1972), Reichert et al. (1973) and Beyer et al. (1979) but these were generally not biologically satisfying or interpretable. Success came with Sellers (1974) development of a true metric measure of the distance between sequences. This metric was later generalized by Waterman et al. (1976) to include deletions/insertions of arbitrary length. This metric represents the minimum number of “mutational events” required to convert one sequence into another. It is of interest to note that Smith et al. (1980) have recently shown that under some conditions the generalized Sellers metric is equivalent to the original homology algorithm of Needleman & Wunsch (1970). In this letter we extend the above ideas to find a pair of segments, one from each of two long sequences, such that there is no other pair of segments with greater similarity (homology). The similarity measure used here allows for arbitrary length deletions and insertions.

9,761 citations


"Highly improved homopolymer aware n..." refers background in this paper

  • ...In 1981, Smith and Waterman defined the local alignment and proposed a slightly modified algorithm to solve it [4]....

    [...]