scispace - formally typeset
Search or ask a question
Journal ArticleDOI

DIALIGN-T: an improved algorithm for segment-based multiple sequence alignment.

TL;DR: A complete re-implementation of the segment-based approach to multiple protein alignment that contains a number of improvements compared to the previous version 2.2 of DIALIGN and is comparable to the standard global aligner CLUSTAL W, though it is outperformed by some newly developed programs that focus on global alignment.
Abstract: Background We present a complete re-implementation of the segment-based approach to multiple protein alignment that contains a number of improvements compared to the previous version 2.2 of DIALIGN. This previous version is superior to Needleman-Wunsch-based multi-alignment programs on locally related sequence sets. However, it is often outperformed by these methods on data sets with global but weak similarity at the primary-sequence level.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: The initial version of the MAFFT program was developed in 2002 and was updated in 2007 with two new techniques: the PartTree algorithm and the Four-way consistency objective function, which improved the scalability of progressive alignment and the accuracy of ncRNA alignment.
Abstract: The accuracy and scalability of multiple sequence alignment (MSA) of DNAs and proteins have long been and are still important issues in bioinformatics. To rapidly construct a reasonable MSA, we developed the initial version of the MAFFT program in 2002. MSA software is now facing greater challenges in both scalability and accuracy than those of 5 years ago. As increasing amounts of sequence data are being generated by large-scale sequencing projects, scalability is now critical in many situations. The requirement of accuracy has also entered a new stage since the discovery of functional noncoding RNAs (ncRNAs); the secondary structure should be considered for constructing a high-quality alignment of distantly related ncRNAs. To deal with these problems, in 2007, we updated MAFFT to Version 6 with two new techniques: the PartTree algorithm and the Four-way consistency objective function. The former improved the scalability of progressive alignment and the latter improved the accuracy of ncRNA alignment. We review these and other techniques that MAFFTuses and suggest possible future directions of MSA software as a basis of comparative analyses. MAFFT is available at http://align.bmr.kyushu-u.ac.jp/mafft/software/.

3,278 citations


Cites methods from "DIALIGN-T: an improved algorithm fo..."

  • ...Some MSA methods, such as DIALIGN [37–39] and T-Coffee, have a facility to incorporate a local alignment algorithm to detect short patches of strong sequence similarity....

    [...]

Journal ArticleDOI
TL;DR: M-Coffee is a meta-method for assembling multiple sequence alignments (MSA) by combining the output of several individual methods into one single MSA that is robust to variations in the choice of constituent methods and reasonably tolerant to duplicate MSAs.
Abstract: We introduce M-Coffee, a meta-method for assembling multiple sequence alignments (MSA) by combining the output of several individual methods into one single MSA. M-Coffee is an extension of T-Coffee and uses consistency to estimate a consensus alignment. We show that the procedure is robust to variations in the choice of constituent methods and reasonably tolerant to duplicate MSAs. We also show that performances can be improved by carefully selecting the constituent methods. M-Coffee outperforms all the individual methods on three major reference datasets: HOMSTRAD, Prefab and Balibase. We also show that on a case-by-case basis, M-Coffee is twice as likely to deliver the best alignment than any individual method. Given a collection of pre-computed MSAs, M-Coffee has similar CPU requirements to the original T-Coffee. M-Coffee is a freeware open-source package available from http://www.tcoffee.org/.

566 citations

Journal ArticleDOI
TL;DR: Although CLUSTALW is still the most popular alignment tool to date, recent methods offer significantly better alignment quality and, in some cases, reduced computational cost.

530 citations


Cites background or methods from "DIALIGN-T: an improved algorithm fo..."

  • ...New multiple alignment benchmark databases include PREFAB, SABMARK, OXBENCH and IRMBASE....

    [...]

  • ...Subramanian AR, Weyer-Menkhoff J, Kaufmann M, Morgenstern B: DIALIGN-T: an improved algorithm for segment-based multiple sequence alignment....

    [...]

  • ...The recently published methods ALIGN-M [23], DIALIGN [8,14,24], POA [25,26] and SATCHMO [27] have relaxed the requirement for global alignability by allowing both alignable and non-alignable regions....

    [...]

  • ...The most accurate programs Table 1 Summary of MSA programs that we consider to be the best currently Program Advantages CLUSTALW Uses less memory than other programs DIALIGN Attempts to distinguish between alignable and non-alignable regions MAFFT, MUSCLE Faster and more accurate than CLUSTALW; good trade-off of accuracy and computational cost....

    [...]

  • ...Recently, several new benchmarks have appeared, including OXBENCH [11], PREFAB [12 ], SABmark [13], IRMBASE [14] and a new, extended version of BALIBASE (http://www-bio3digbmc.u-strasbg.fr/balibase/)....

    [...]

Journal ArticleDOI
TL;DR: DIALIGN-TX is presented, a substantial improvement of DIAL IGN-T that combines the previous greedy algorithm with a progressive alignment approach and produces significantly better alignments, especially on globally related sequences, without increasing the CPU time and memory consumption exceedingly.
Abstract: DIALIGN-T is a reimplementation of the multiple-alignment program DIALIGN. Due to several algorithmic improvements, it produces significantly better alignments on locally and globally related sequence sets than previous versions of DIALIGN. However, like the original implementation of the program, DIALIGN-T uses a a straight-forward greedy approach to assemble multiple alignments from local pairwise sequence similarities. Such greedy approaches may be vulnerable to spurious random similarities and can therefore lead to suboptimal results. In this paper, we present DIALIGN-TX, a substantial improvement of DIALIGN-T that combines our previous greedy algorithm with a progressive alignment approach. Our new heuristic produces significantly better alignments, especially on globally related sequences, without increasing the CPU time and memory consumption exceedingly. The new method is based on a guide tree; to detect possible spurious sequence similarities, it employs a vertex-cover approximation on a conflict graph. We performed benchmarking tests on a large set of nucleic acid and protein sequences For protein benchmarks we used the benchmark database BALIBASE 3 and an updated release of the database IRMBASE 2 for assessing the quality on globally and locally related sequences, respectively. For alignment of nucleic acid sequences, we used BRAliBase II for global alignment and a newly developed database of locally related sequences called DIRM-BASE 1. IRMBASE 2 and DIRMBASE 1 are constructed by implanting highly conserved motives at random positions in long unalignable sequences. On BALIBASE3, our new program performs significantly better than the previous program DIALIGN-T and outperforms the popular global aligner CLUSTAL W, though it is still outperformed by programs that focus on global alignment like MAFFT, MUSCLE and T-COFFEE. On the locally related test sets in IRMBASE 2 and DIRM-BASE 1, our method outperforms all other programs while MAFFT E-INSi is the only method that comes close to the performance of DIALIGN-TX.

261 citations


Cites background or methods from "DIALIGN-T: an improved algorithm fo..."

  • ...The rationale behind this approach is that a fragment from a sequence pair with high overall similarity is less likely to be a random artefact than a fragment from an otherwise non-related sequence pair, see [16] for details....

    [...]

  • ...The benchmarks on locally related sequence sets were run on IRMBASE 2 for proteins and DIRMBASE 1 for DNA sequences, which have been constructed in a very similar way as IRMBASE 1 [16] by implanting highly conserved motifs generated by ROSE [39] in long random sequences....

    [...]

  • ...Some modifications have been introduced, such as overlap weights [1] and a more context-sensitive approach that takes into account the overal significance of the pairwise alignment to which a fragment belongs [16]....

    [...]

  • ...For details on these subroutines see also [16]....

    [...]

  • ...Fq,r using our 'direct' greedy alignment as described in [16]....

    [...]

Journal ArticleDOI
TL;DR: This review focuses on recent trends in multiple sequence alignment tools and describes the latest algorithmic improvements including the extension of consistency-based methods to the problem of template-based multiple sequence alignments suggesting that template- based methods are significantly more accurate than simpler alternative methods.
Abstract: This review focuses on recent trends in multiple sequence alignment tools. It describes the latest algorithmic improvements including the extension of consistency-based methods to the problem of template-based multiple sequence alignments. Some results are presented suggesting that template-based methods are significantly more accurate than simpler alternative methods. The validation of existing methods is also discussed at length with the detailed description of recent results and some suggestions for future validation strategies. The last part of the review addresses future challenges for multiple sequence alignment methods in the genomic era, most notably the need to cope with very large sequences, the need to integrate large amounts of experimental data, the need to accurately align non-coding and non-transcribed sequences and finally, the need to integrate many alternative methods and approaches. Contact: cedric.notredame@crg.es

226 citations


Cites result from "DIALIGN-T: an improved algorithm fo..."

  • ...The main trend uncovered by this analysis is that all the empirical reference datasets tend to yield similar results, quite significantly distinct from those measured on artificial datasets such as IRMbase ( Subramanian, et al., 2005 ), a collection of artificially generated alignments with local similarity....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved and modifications are incorporated into a new program, CLUSTAL W, which is freely available.
Abstract: The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved for the alignment of divergent protein sequences. Firstly, individual weights are assigned to each sequence in a partial alignment in order to down-weight near-duplicate sequences and up-weight the most divergent ones. Secondly, amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned. Thirdly, residue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure. Fourthly, positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions. These modifications are incorporated into a new program, CLUSTAL W which is freely available.

63,427 citations


"DIALIGN-T: an improved algorithm fo..." refers methods in this paper

  • ...On sequences with weak but global homology, however, the previous implementation of the program is often out-performed by purely global methods such as CLUSTAL W [24], by hybrid medthods like TCOFFEE [19] or POA [13], or by the recently developed programs MUSCLE [8] and PROBCONS [6] that are currently the best-performing methods for global multiple protein alignment....

    [...]

  • ...Global methods align sequences from the beginning to the end [4,24,9]....

    [...]

Journal ArticleDOI
TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.
Abstract: We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the logexpectation score, and refinement using treedependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.

37,524 citations


"DIALIGN-T: an improved algorithm fo..." refers methods in this paper

  • ...On sequences with weak but global homology, however, the previous implementation of the program is often out-performed by purely global methods such as CLUSTAL W [24], by hybrid medthods like T-COFFEE [19] or POA [13], or by the recently developed programs MUSCLE [ 8 ] and PROBCONS [6] that are currently the best-performing methods for global multiple protein alignment....

    [...]

  • ...Further systematic studies should be carried out to evaluate the performance of multiple-protein aligners under varying conditions using, for example, the full-length BAliBASE sequences or newly developed benchmark databases such as SABmark [27,28], Prefab [ 8 ] or Oxbench [21]....

    [...]

  • ...During the last years, a number of hybrid methods have been developed that combine global and local alignment features [17,19,2, 8 ]....

    [...]

Journal ArticleDOI
TL;DR: A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed and it is possible to determine whether significant homology exists between the proteins to trace their possible evolutionary development.

11,844 citations

Journal ArticleDOI
TL;DR: A new method for multiple sequence alignment that provides a dramatic improvement in accuracy with a modest sacrifice in speed as compared to the most commonly used alternatives but avoids the most serious pitfalls caused by the greedy nature of this algorithm.

6,727 citations

Journal ArticleDOI
TL;DR: An algorithm is presented for the multiple alignment of sequences, either proteins or nucleic acids, that is both accurate and easy to use on microcomputers, based on the conventional dynamic-programming method of pairwise alignment.
Abstract: An algorithm is presented for the multiple alignment of sequences, either proteins or nucleic acids, that is both accurate and easy to use on microcomputers. The approach is based on the conventional dynamic-programming method of pairwise alignment. Initially, a hierarchical clustering of the sequences is performed using the matrix of the pairwise alignment scores. The closest sequences are aligned creating groups of aligned sequences. Then close groups are aligned until all sequences are aligned in one group. The pairwise alignments included in the multiple alignment form a new matrix that is used to produce a hierarchical clustering. If it is different from the first one, iteration of the process can be performed. The method is illustrated by an example: a global alignment of 39 sequences of cytochrome c.

5,208 citations


"DIALIGN-T: an improved algorithm fo..." refers methods in this paper

  • ...Global methods align sequences from the beginning to the end [4,24,9]....

    [...]