Kalign – an accurate and fast multiple sequence alignment algorithm
Reads0
Chats0
TLDR
Kalign, a method employing the Wu-Manber string-matching algorithm, is developed to improve both the accuracy and speed of multiple sequence alignment and is especially well suited for the increasingly important task of aligning large numbers of sequences.Abstract:
The alignment of multiple protein sequences is a fundamental step in the analysis of biological data It has traditionally been applied to analyzing protein families for conserved motifs, phylogeny, structural properties, and to improve sensitivity in homology searching The availability of complete genome sequences has increased the demands on multiple sequence alignment (MSA) programs Current MSA methods suffer from being either too inaccurate or too computationally expensive to be applied effectively in large-scale comparative genomics We developed Kalign, a method employing the Wu-Manber string-matching algorithm, to improve both the accuracy and speed of multiple sequence alignment We compared the speed and accuracy of Kalign to other popular methods using Balibase, Prefab, and a new large test set Kalign was as accurate as the best other methods on small alignments, but significantly more accurate when aligning large and distantly related sets of sequences In our comparisons, Kalign was about 10 times faster than ClustalW and, depending on the alignment size, up to 50 times faster than popular iterative methods Kalign is a fast and robust alignment method It is especially well suited for the increasingly important task of aligning large numbers of sequencesread more
Citations
More filters
Journal ArticleDOI
Fast, scalable generation of high‐quality protein multiple sequence alignments using Clustal Omega
Fabian Sievers,Andreas Wilm,David Dineen,Toby J. Gibson,Kevin Karplus,Weizhong Li,Rodrigo Lopez,Hamish McWilliam,Michael Remmert,Johannes Söding,Julie D. Thompson,Desmond G. Higgins +11 more
TL;DR: A new program called Clustal Omega is described, which can align virtually any number of protein sequences quickly and that delivers accurate alignments, and which outperforms other packages in terms of execution time and quality.
Journal ArticleDOI
Database resources of the National Center for Biotechnology Information
David L. Wheeler,Deanna M. Church,Ron Edgar,Scott Federhen,Wolfgang Helmberg,Thomas L. Madden,Joan Pontius,Gregory D. Schuler,Lynn M. Schriml,Edwin Sequeira,Tugba O. Suzek,Tatiana Tatusova,Lukas Wagner +12 more
TL;DR: In addition to maintaining the GenBank(R) nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides data analysis and retrieval resources for the data in GenBank and other biological data made available through NCBI’s website.
Journal ArticleDOI
GENCODE: The reference human genome annotation for The ENCODE Project
Jennifer Harrow,Adam Frankish,José M. González,Electra Tapanari,Mark Diekhans,Felix Kokocinski,Bronwen Aken,Daniel Barrell,Amonida Zadissa,Stephen M. J. Searle,If H. A. Barnes,Alexandra Bignell,Veronika Boychenko,Toby Hunt,M. Kay,Gaurab Mukherjee,Jeena Rajan,Gloria Despacio-Reyes,Gary Saunders,Charles A. Steward,Rachel A. Harte,Michael F. Lin,Cédric Howald,Andrea Tanzer,Thomas Derrien,Jacqueline Chrast,Nathalie Walters,Suganthi Balasubramanian,Baikang Pei,Michael L. Tress,Jose Manuel Rodriguez,Iakes Ezkurdia,Jeltje Van Baren,Michael R. Brent,David Haussler,Manolis Kellis,Alfonso Valencia,Alexandre Reymond,Mark Gerstein,Roderic Guigó,Tim Hubbard +40 more
TL;DR: This work has examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites, and over one-third of GENCODE protein-Coding genes aresupported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas.
Journal ArticleDOI
Improvement of Phylogenies after Removing Divergent and Ambiguously Aligned Blocks from Protein Sequence Alignments
Gerard Talavera,Jose Castresana +1 more
TL;DR: Whether phylogenetic reconstruction improves after alignment cleaning or not is examined and cleaned alignments produce better topologies although, paradoxically, with lower bootstrap, which indicates that divergent and problematic alignment regions may lead, when present, to apparently better supported although, in fact, more biased topologies.
Journal ArticleDOI
Recent developments in the MAFFT multiple sequence alignment program
Kazutaka Katoh,Hiroyuki Toh +1 more
TL;DR: The initial version of the MAFFT program was developed in 2002 and was updated in 2007 with two new techniques: the PartTree algorithm and the Four-way consistency objective function, which improved the scalability of progressive alignment and the accuracy of ncRNA alignment.
References
More filters
Journal ArticleDOI
Basic Local Alignment Search Tool
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.
Journal ArticleDOI
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Stephen F. Altschul,Thomas L. Madden,Alejandro A. Schäffer,Jinghui Zhang,Zheng Zhang,Webb Miller,David J. Lipman +6 more
TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.
Journal ArticleDOI
Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
TL;DR: The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved and modifications are incorporated into a new program, CLUSTAL W, which is freely available.
Journal ArticleDOI
The neighbor-joining method: a new method for reconstructing phylogenetic trees.
Naruya Saitou,Masatoshi Nei +1 more
TL;DR: The neighbor-joining method and Sattath and Tversky's method are shown to be generally better than the other methods for reconstructing phylogenetic trees from evolutionary distance data.
Journal ArticleDOI
MUSCLE: multiple sequence alignment with high accuracy and high throughput
TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.