MUSCLE: multiple sequence alignment with high accuracy and high throughput

doi:10.1093/NAR/GKH340

Home
/
Papers
/
MUSCLE: multiple sequence alignment with high accuracy and high throughput

Journal Article•DOI•

MUSCLE: multiple sequence alignment with high accuracy and high throughput

01 Mar 2004-Nucleic Acids Research (Oxford University Press)-Vol. 32, Iss: 5, pp 1792-1797

TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.

read less

Abstract: We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the logexpectation score, and refinement using treedependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Clustal W and Clustal X version 2.0

[...]

Mark A. Larkin¹, Gordon Blackshields², Nigel P. Brown², R. Chenna², Paul A. McGettigan², Hamish McWilliam², Franck Valentin², Iain M. Wallace², Andreas Wilm², Rodrigo Lopez², J.D. Thompson², Toby J. Gibson², Desmond G. Higgins² - Show less +9 more•Institutions (2)

University College Dublin¹, European Bioinformatics Institute²

01 Nov 2007-Bioinformatics

TL;DR: The Clustal W and ClUSTal X multiple sequence alignment programs have been completely rewritten in C++ to facilitate the further development of the alignment algorithms in the future and has allowed proper porting of the programs to the latest versions of Linux, Macintosh and Windows operating systems.

...read moreread less

Abstract: Summary: The Clustal W and Clustal X multiple sequence alignment programs have been completely rewritten in C++. This will facilitate the further development of the alignment algorithms in the future and has allowed proper porting of the programs to the latest versions of Linux, Macintosh and Windows operating systems. Availability: The programs can be run on-line from the EBI web server: http://www.ebi.ac.uk/tools/clustalw2. The source code and executables for Windows, Linux and Macintosh computers are available from the EBI ftp site ftp://ftp.ebi.ac.uk/pub/software/clustalw2/ Contact: clustalw@ucd.ie

...read moreread less

25,325 citations

Cites background or methods from "MUSCLE: multiple sequence alignment..."

...They are needed routinely as parts of more complicated analyses or analysis pipelines and there are several very widely used packages, e.g. Clustal W (Thompson et al., 1994), Clustal X (Thompson et al., 1997), T-Coffee (Notredame et al., 2000), MAFFT (Katoh et al., 2002) and MUSCLE (Edgar, 2004)....
[...]
...Availability: The programs can be run on-line from the EBI web server: http://www.ebi.ac.uk/tools/clustalw2....
[...]
...More recently, MAFFT and MUSCLE appeared; which were, initially, at least as accurate as Clustal, in terms of alignment accuracy, but which were also extremely fast; and able to align many thousands of sequences....
[...]

Journal Article•DOI•

Search and clustering orders of magnitude faster than BLAST

[...]

Robert C. Edgar

01 Oct 2010-Bioinformatics

TL;DR: UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters and offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets.

...read moreread less

Abstract: Motivation: Biological sequence data is accumulating rapidly, motivating the development of improved high-throughput methods for sequence classification. Results: UBLAST and USEARCH are new algorithms enabling sensitive local and global search of large sequence databases at exceptionally high speeds. They are often orders of magnitude faster than BLAST in practical applications, though sensitivity to distant protein relationships is lower. UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters. UCLUST offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets. Availability: Binaries are available at no charge for non-commercial use at http://www.drive5.com/usearch Contact: [email protected] Supplementary information:Supplementary data are available at Bioinformatics online.

...read moreread less

17,301 citations

Journal Article•DOI•

Fast, scalable generation of high‐quality protein multiple sequence alignments using Clustal Omega

[...]

Fabian Sievers¹, Andreas Wilm², David Dineen¹, Toby J. Gibson, Kevin Karplus³, Weizhong Li⁴, Rodrigo Lopez⁴, Hamish McWilliam⁴, Michael Remmert⁵, Johannes Söding⁵, Julie D. Thompson⁶, Desmond G. Higgins¹ - Show less +8 more•Institutions (6)

University College Dublin¹, Genome Institute of Singapore², University of California, Santa Cruz³, European Bioinformatics Institute⁴, Ludwig Maximilian University of Munich⁵, University of Strasbourg⁶

01 Jan 2011-Molecular Systems Biology

TL;DR: A new program called Clustal Omega is described, which can align virtually any number of protein sequences quickly and that delivers accurate alignments, and which outperforms other packages in terms of execution time and quality.

...read moreread less

Abstract: Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.

...read moreread less

12,489 citations

Cites methods from "MUSCLE: multiple sequence alignment..."

...Code for fast UPGMA and guide tree handling routines was adopted from MUSCLE (Edgar, 2004)....
[...]
...Here, we present results from a range of packages tested on three benchmarks: BAliBASE (Thompson et al, 2005), Prefab (Edgar, 2004) and an extended version of HomFam (Blackshields et al, 2010)....
[...]
...For these tests, we just report results using the default settings for all programs but with two exceptions, which were needed to allow MUSCLE (Edgar, 2004) and MAFFT to align the biggest test cases in HomFam....
[...]
...For these tests, we just report results using the default settings for all programs but with two exceptions, which were needed to allow MUSCLE (Edgar, 2004) and MAFFT to align the biggest test cases in HomFam....
[...]

Journal Article•DOI•

A new coronavirus associated with human respiratory disease in China.

[...]

Fan Wu¹, Su Zhao², Bin Yu³, Yan-Mei Chen¹, Wen Wang³, Zhi gang Song¹, Yi Hu², Zhao Wu Tao², Jun Hua Tian³, Yuan Yuan Pei¹, Ming Li Yuan², Yu Ling Zhang¹, Fa Hui Dai¹, Yi Liu¹, Qi Min Wang¹, Jiao Jiao Zheng¹, Lin Xu¹, Edward C. Holmes¹, Edward C. Holmes⁴, Yong-Zhen Zhang³, Yong-Zhen Zhang¹ - Show less +17 more•Institutions (4)

Fudan University¹, Huazhong University of Science and Technology², Centers for Disease Control and Prevention³, University of Sydney⁴

03 Feb 2020-Nature

TL;DR: Phylogenetic and metagenomic analyses of the complete viral genome of a new coronavirus from the family Coronaviridae reveal that the virus is closely related to a group of SARS-like coronaviruses found in bats in China.

...read moreread less

Abstract: Emerging infectious diseases, such as severe acute respiratory syndrome (SARS) and Zika virus disease, present a major threat to public health1–3. Despite intense research efforts, how, when and where new diseases appear are still a source of considerable uncertainty. A severe respiratory disease was recently reported in Wuhan, Hubei province, China. As of 25 January 2020, at least 1,975 cases had been reported since the first patient was hospitalized on 12 December 2019. Epidemiological investigations have suggested that the outbreak was associated with a seafood market in Wuhan. Here we study a single patient who was a worker at the market and who was admitted to the Central Hospital of Wuhan on 26 December 2019 while experiencing a severe respiratory syndrome that included fever, dizziness and a cough. Metagenomic RNA sequencing4 of a sample of bronchoalveolar lavage fluid from the patient identified a new RNA virus strain from the family Coronaviridae, which is designated here ‘WH-Human 1’ coronavirus (and has also been referred to as ‘2019-nCoV’). Phylogenetic analysis of the complete viral genome (29,903 nucleotides) revealed that the virus was most closely related (89.1% nucleotide similarity) to a group of SARS-like coronaviruses (genus Betacoronavirus, subgenus Sarbecovirus) that had previously been found in bats in China5. This outbreak highlights the ongoing ability of viral spill-over from animals to cause severe disease in humans. Phylogenetic and metagenomic analyses of the complete viral genome of a new coronavirus from the family Coronaviridae reveal that the virus is closely related to a group of SARS-like coronaviruses found in bats in China.

...read moreread less

9,231 citations

Journal Article•DOI•

Jalview Version 2--a multiple sequence alignment editor and analysis workbench.

[...]

Andrew M. Waterhouse¹, James B. Procter¹, David M. A. Martin¹, Michele Clamp¹, Geoffrey J. Barton¹ - Show less +1 more•Institutions (1)

University of Dundee¹

01 May 2009-Bioinformatics

TL;DR: Jalview 2 is a system for interactive WYSIWYG editing, analysis and annotation of multiple sequence alignments that employs web services for sequence alignment, secondary structure prediction and the retrieval of alignments, sequences, annotation and structures from public databases and any DAS 1.53 compliant sequence or annotation server.

...read moreread less

Abstract: Summary: Jalview Version 2 is a system for interactive WYSIWYG editing, analysis and annotation of multiple sequence alignments. Core features include keyboard and mouse-based editing, multiple views and alignment overviews, and linked structure display with Jmol. Jalview 2 is available in two forms: a lightweight Java applet for use in web applications, and a powerful desktop application that employs web services for sequence alignment, secondary structure prediction and the retrieval of alignments, sequences, annotation and structures from public databases and any DAS 1.53 compliant sequence or annotation server. Availability: The Jalview 2 Desktop application and JalviewLite applet are made freely available under the GPL, and can be downloaded from www.jalview.org Contact: g.j.barton@dundee.ac.uk

...read moreread less

7,926 citations

Cites methods from "MUSCLE: multiple sequence alignment..."

...Menus on the JVD interface enable the researcher to gather sequence and annotation data from external databases, and utilise Jalview’s own dedicated SOAP web services for sequence alignment with ClustalW (Thompson, et al., 1994), Muscle ( Edgar, 2004 ) and MAFFT (Katoh, et al., 2005) and secondary structure prediction with Jpred3 (Cole, et al., 2008)....
[...]
...Menus on the JVD interface enable the researcher to gather sequence and annotation data from external databases, and utilize Jalview’s own dedicated SOAP web services for sequence alignment with ClustalW (Thompson et al., 1994), Muscle (Edgar, 2004) and MAFFT (Katoh et al., 2005) and secondary structure prediction with Jpred3 (Cole et al., 2008)....
[...]
...…to gather sequence and annotation data from external databases, and utilize Jalview’s own dedicated SOAP web services for sequence alignment with ClustalW (Thompson et al., 1994), Muscle (Edgar, 2004) and MAFFT (Katoh et al., 2005) and secondary structure prediction with Jpred3 (Cole et al., 2008)....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

[...]

Stephen F. Altschul¹, Thomas L. Madden, Alejandro A. Schäffer¹, Jinghui Zhang, Zheng Zhang², Webb Miller², David J. Lipman - Show less +3 more•Institutions (2)

National Institutes of Health¹, Pennsylvania State University²

01 Sep 1997-Nucleic Acids Research

TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.

...read moreread less

Abstract: The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

...read moreread less

70,111 citations

"MUSCLE: multiple sequence alignment..." refers methods in this paper

...We used the fullchain sequence of each structure to make a PSI-BLAST (37,38) search of the NCBI non-redundant protein sequence database (39), keeping locally aligned regions of hits with e-values below 0....
[...]
...We used the fullchain sequence of each structure to make a PSI-BLAST (37,38) search of the NCBI non-redundant protein sequence database (39), keeping locally aligned regions of hits with e-values below 0.01....
[...]

Journal Article•DOI•

Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice

[...]

Julie D. Thompson, Desmond G. Higgins, Toby J. Gibson

11 Nov 1994-Nucleic Acids Research

TL;DR: The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved and modifications are incorporated into a new program, CLUSTAL W, which is freely available.

...read moreread less

Abstract: The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved for the alignment of divergent protein sequences. Firstly, individual weights are assigned to each sequence in a partial alignment in order to down-weight near-duplicate sequences and up-weight the most divergent ones. Secondly, amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned. Thirdly, residue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure. Fourthly, positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions. These modifications are incorporated into a new program, CLUSTAL W which is freely available.

...read moreread less

63,427 citations

"MUSCLE: multiple sequence alignment..." refers methods in this paper

...We compared these with four other methods: CLUSTALW ( 25 ), probably the most widely used program at the time of writing; T-Coffee, which has the best BAliBASE score reported to date; and two MAFFT scripts: FFTNS1, the fastest previously published method known to the author (in which diagonal finding by fast Fourier transform is enabled and a progressive alignment constructed), and NWNSI, the slowest but most accurate of the MAFFT methods ......
[...]

Journal Article•DOI•

The neighbor-joining method: a new method for reconstructing phylogenetic trees.

[...]

Naruya Saitou¹, Masatoshi Nei•Institutions (1)

University of Texas Health Science Center at Houston¹

01 Jul 1987-Molecular Biology and Evolution

TL;DR: The neighbor-joining method and Sattath and Tversky's method are shown to be generally better than the other methods for reconstructing phylogenetic trees from evolutionary distance data.

...read moreread less

Abstract: A new method called the neighbor-joining method is proposed for reconstructing phylogenetic trees from evolutionary distance data. The principle of this method is to find pairs of operational taxonomic units (OTUs [= neighbors]) that minimize the total branch length at each stage of clustering of OTUs starting with a starlike tree. The branch lengths as well as the topology of a parsimonious tree can quickly be obtained by using this method. Using computer simulation, we studied the efficiency of this method in obtaining the correct unrooted tree in comparison with that of five other tree-making methods: the unweighted pair group method of analysis, Farris's method, Sattath and Tversky's method, Li's method, and Tateno et al.'s modified Farris method. The new, neighbor-joining method and Sattath and Tversky's method are shown to be generally better than the other methods.

...read moreread less

57,055 citations

"MUSCLE: multiple sequence alignment..." refers methods in this paper

...Distance matrices are clustered using UPGMA (11), which we ®nd to give slightly improved results over neighbor-joining (12), despite the expectation that neighbor-joining will give a more reliable estimate of the evolutionary tree....
[...]

Journal Article•DOI•

MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform

[...]

Kazutaka Katoh¹, Kazuharu Misawa, Kei-ichi Kuma¹, Takashi Miyata¹•Institutions (1)

Kyoto University¹

15 Jul 2002-Nucleic Acids Research

TL;DR: A simplified scoring system is proposed that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length.

...read moreread less

Abstract: A multiple sequence alignment program, MAFFT, has been developed. The CPU time is drastically reduced as compared with existing methods. MAFFT includes two novel techniques. (i) Homologous regions are rapidly identified by the fast Fourier transform (FFT), in which an amino acid sequence is converted to a sequence composed of volume and polarity values of each amino acid residue. (ii) We propose a simplified scoring system that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length. Two different heuristics, the progressive method (FFT-NS-2) and the iterative refinement method (FFT-NS-i), are implemented in MAFFT. The performances of FFT-NS-2 and FFT-NS-i were compared with other methods by computer simulations and benchmark tests; the CPU time of FFT-NS-2 is drastically reduced as compared with CLUSTALW with comparable accuracy. FFT-NS-i is over 100 times faster than T-COFFEE, when the number of input sequences exceeds 60, without sacrificing the accuracy.

...read moreread less

12,003 citations

"MUSCLE: multiple sequence alignment..." refers background or methods in this paper

...for alignment accuracy discrimination (5,7,8) as fewer assumptions are made about the population distribution....
[...]
...Positionspeci®c gap penalties are used, employing heuristics similar to those found in MAFFT and LAGAN (17)....
[...]
...This is similar to the strategies used by PRRP (7) and MAFFT (8)....
[...]
...Tested versions were MUSCLE 3.2, CLUSTALW 1.82, T-Coffee 1.37 and MAFFT 3.82....
[...]
...We compared these with four other methods: CLUSTALW (25), probably the most widely used program at the time of writing; T-Coffee, which has the best BAliBASE score reported to date; and two MAFFT scripts: FFTNS1, the fastest previously published method known to the author (in which diagonal ®nding by fast Fourier transform is enabled and a progressive alignment constructed), and NWNSI, the slowest but most accurate of the MAFFT methods (in which fast Fourier transform is disabled and re®nement is enabled)....
[...]

Book•

The Neutral Theory of Molecular Evolution

[...]

Motoo Kimura

01 Jan 1983

TL;DR: The neutral theory as discussed by the authors states that the great majority of evolutionary changes at the molecular level are caused not by Darwinian selection but by random drift of selectively neutral mutants, which has caused controversy ever since.

...read moreread less

Abstract: Motoo Kimura, as founder of the neutral theory, is uniquely placed to write this book. He first proposed the theory in 1968 to explain the unexpectedly high rate of evolutionary change and very large amount of intraspecific variability at the molecular level that had been uncovered by new techniques in molecular biology. The theory - which asserts that the great majority of evolutionary changes at the molecular level are caused not by Darwinian selection but by random drift of selectively neutral mutants - has caused controversy ever since. This book is the first comprehensive treatment of this subject and the author synthesises a wealth of material - ranging from a historical perspective, through recent molecular discoveries, to sophisticated mathematical arguments - all presented in a most lucid manner.

...read moreread less

7,874 citations