scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Fast, scalable generation of high‐quality protein multiple sequence alignments using Clustal Omega

TL;DR: A new program called Clustal Omega is described, which can align virtually any number of protein sequences quickly and that delivers accurate alignments, and which outperforms other packages in terms of execution time and quality.
Abstract: Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This version of MAFFT has several new features, including options for adding unaligned sequences into an existing alignment, adjustment of direction in nucleotide alignment, constrained alignment and parallel processing, which were implemented after the previous major update.
Abstract: We report a major update of the MAFFT multiple sequence alignment program. This version has several new features, including options for adding unaligned sequences into an existing alignment, adjustment of direction in nucleotide alignment, constrained alignment and parallel processing, which were implemented after the previous major update. This report shows actual examples to explain how these features work, alone and in combination. Some examples incorrectly aligned by MAFFT are also shown to clarify its limitations. We discuss how to avoid misalignments, and our ongoing efforts to overcome such limitations.

27,771 citations

Journal ArticleDOI
TL;DR: This major upgrade has been fully re-engineered to enhance speed, accuracy and usability with interactive 3D visualization of ENDscript 2 and ESPript 3 to handle a large number of data with reduced computation time.
Abstract: ENDscript 2 is a friendly Web server for extracting and rendering a comprehensive analysis of primary to quaternary protein structure information in an automated way. This major upgrade has been fully re-engineered to enhance speed, accuracy and usability with interactive 3D visualization. It takes advantage of the new version 3 of ESPript, our well-known sequence alignment renderer, improved to handle a large number of data with reduced computation time. From a single PDB entry or file, ENDscript produces high quality figures displaying multiple sequence alignment of proteins homologous to the query, colored according to residue conservation. Furthermore, the experimental secondary structure elements and a detailed set of relevant biophysical and structural data are depicted. All this information and more are now mapped on interactive 3D PyMOL representations. Thanks to its adaptive and rigorous algorithm, beginner to expert users can modify settings to fine-tune ENDscript to their needs. ENDscript has also been upgraded as an open platform for the visualization of multiple biochemical and structural data coming from external biotool Web servers, with both 2D and 3D representations. ENDscript 2 and ESPript 3 are freely available at http://endscript.ibcp.fr and http://espript.ibcp.fr, respectively.

4,722 citations


Cites methods from "Fast, scalable generation of high‐q..."

  • ...Hence, we replaced the programs BLAST and ClustalW2 by their latest revisions (BLAST+ (17) and Clustal Omega (18), respectively) to gain in scalability, accuracy and performance....

    [...]

Journal ArticleDOI
TL;DR: The Web interface for recently developed options for large data and interactive usage to refine sequence data sets and MSAs for multiple sequence alignment are explained.
Abstract: This article describes several features in the MAFFT online service for multiple sequence alignment (MSA). As a result of recent advances in sequencing technologies, huge numbers of biological sequences are available and the need for MSAs with large numbers of sequences is increasing. To extract biologically relevant information from such data, sophistication of algorithms is necessary but not sufficient. Intuitive and interactive tools for experimental biologists to semiautomatically handle large data are becoming important. We are working on development of MAFFT toward these two directions. Here, we explain (i) the Web interface for recently developed options for large data and (ii) interactive usage to refine sequence data sets and MSAs.

4,135 citations


Cites background or methods from "Fast, scalable generation of high‐q..."

  • ...[4], for some MAFFT options available on our online server...

    [...]

  • ...However, there are also other popular programs, such as Clustal Omega [4] and UPP [22], for this purpose....

    [...]

  • ...Clustal Omega uses the mBed algorithm [23] to build a guide tree with a time complexity of OðN log NÞ....

    [...]

Journal ArticleDOI
TL;DR: These analyses provide insights into the receptor usage, cell entry, host cell infectivity and animal origin of 2019-nCoV and may help epidemic surveillance and preventive measures against 2019- nCoV.
Abstract: Recently, a novel coronavirus (2019-nCoV) has emerged from Wuhan, China, causing symptoms in humans similar to those caused by severe acute respiratory syndrome coronavirus (SARS-CoV). Since the SARS-CoV outbreak in 2002, extensive structural analyses have revealed key atomic-level interactions between the SARS-CoV spike protein receptor-binding domain (RBD) and its host receptor angiotensin-converting enzyme 2 (ACE2), which regulate both the cross-species and human-to-human transmissions of SARS-CoV. Here, we analyzed the potential receptor usage by 2019-nCoV, based on the rich knowledge about SARS-CoV and the newly released sequence of 2019-nCoV. First, the sequence of 2019-nCoV RBD, including its receptor-binding motif (RBM) that directly contacts ACE2, is similar to that of SARS-CoV, strongly suggesting that 2019-nCoV uses ACE2 as its receptor. Second, several critical residues in 2019-nCoV RBM (particularly Gln493) provide favorable interactions with human ACE2, consistent with 2019-nCoV's capacity for human cell infection. Third, several other critical residues in 2019-nCoV RBM (particularly Asn501) are compatible with, but not ideal for, binding human ACE2, suggesting that 2019-nCoV has acquired some capacity for human-to-human transmission. Last, while phylogenetic analysis indicates a bat origin of 2019-nCoV, 2019-nCoV also potentially recognizes ACE2 from a diversity of animal species (except mice and rats), implicating these animal species as possible intermediate hosts or animal models for 2019-nCoV infections. These analyses provide insights into the receptor usage, cell entry, host cell infectivity and animal origin of 2019-nCoV and may help epidemic surveillance and preventive measures against 2019-nCoV.IMPORTANCE The recent emergence of Wuhan coronavirus (2019-nCoV) puts the world on alert. 2019-nCoV is reminiscent of the SARS-CoV outbreak in 2002 to 2003. Our decade-long structural studies on the receptor recognition by SARS-CoV have identified key interactions between SARS-CoV spike protein and its host receptor angiotensin-converting enzyme 2 (ACE2), which regulate both the cross-species and human-to-human transmissions of SARS-CoV. One of the goals of SARS-CoV research was to build an atomic-level iterative framework of virus-receptor interactions to facilitate epidemic surveillance, predict species-specific receptor usage, and identify potential animal hosts and animal models of viruses. Based on the sequence of 2019-nCoV spike protein, we apply this predictive framework to provide novel insights into the receptor usage and likely host range of 2019-nCoV. This study provides a robust test of this reiterative framework, providing the basic, translational, and public health research communities with predictive insights that may help study and battle this novel 2019-nCoV.

3,527 citations


Cites methods from "Fast, scalable generation of high‐q..."

  • ...Protein sequence alignments were done using Clustal Omega (33)....

    [...]

Journal ArticleDOI
David E. Gordon, Gwendolyn M. Jang, Mehdi Bouhaddou, Jiewei Xu, Kirsten Obernier, Kris M. White1, Matthew J. O’Meara2, Veronica V. Rezelj3, Jeffrey Z. Guo, Danielle L. Swaney, Tia A. Tummino4, Ruth Hüttenhain, Robyn M. Kaake, Alicia L. Richards, Beril Tutuncuoglu, Helene Foussard, Jyoti Batra, Kelsey M. Haas, Maya Modak, Minkyu Kim, Paige Haas, Benjamin J. Polacco, Hannes Braberg, Jacqueline M. Fabius, Manon Eckhardt, Margaret Soucheray, Melanie J. Bennett, Merve Cakir, Michael McGregor, Qiongyu Li, Bjoern Meyer3, Ferdinand Roesch3, Thomas Vallet3, Alice Mac Kain3, Lisa Miorin1, Elena Moreno1, Zun Zar Chi Naing, Yuan Zhou, Shiming Peng4, Ying Shi, Ziyang Zhang, Wenqi Shen, Ilsa T Kirby, James E. Melnyk, John S. Chorba, Kevin Lou, Shizhong Dai, Inigo Barrio-Hernandez5, Danish Memon5, Claudia Hernandez-Armenta5, Jiankun Lyu4, Christopher J.P. Mathy, Tina Perica4, Kala Bharath Pilla4, Sai J. Ganesan4, Daniel J. Saltzberg4, Rakesh Ramachandran4, Xi Liu4, Sara Brin Rosenthal6, Lorenzo Calviello4, Srivats Venkataramanan4, Jose Liboy-Lugo4, Yizhu Lin4, Xi Ping Huang7, Yongfeng Liu7, Stephanie A. Wankowicz, Markus Bohn4, Maliheh Safari4, Fatima S. Ugur, Cassandra Koh3, Nastaran Sadat Savar3, Quang Dinh Tran3, Djoshkun Shengjuler3, Sabrina J. Fletcher3, Michael C. O’Neal, Yiming Cai, Jason C.J. Chang, David J. Broadhurst, Saker Klippsten, Phillip P. Sharp4, Nicole A. Wenzell4, Duygu Kuzuoğlu-Öztürk4, Hao-Yuan Wang4, Raphael Trenker4, Janet M. Young8, Devin A. Cavero9, Devin A. Cavero4, Joseph Hiatt9, Joseph Hiatt4, Theodore L. Roth, Ujjwal Rathore9, Ujjwal Rathore4, Advait Subramanian4, Julia Noack4, Mathieu Hubert3, Robert M. Stroud4, Alan D. Frankel4, Oren S. Rosenberg, Kliment A. Verba4, David A. Agard4, Melanie Ott, Michael Emerman8, Natalia Jura, Mark von Zastrow, Eric Verdin4, Eric Verdin10, Alan Ashworth4, Olivier Schwartz3, Christophe d'Enfert3, Shaeri Mukherjee4, Matthew P. Jacobson4, Harmit S. Malik8, Danica Galonić Fujimori, Trey Ideker6, Charles S. Craik, Stephen N. Floor4, James S. Fraser4, John D. Gross4, Andrej Sali, Bryan L. Roth7, Davide Ruggero, Jack Taunton4, Tanja Kortemme, Pedro Beltrao5, Marco Vignuzzi3, Adolfo García-Sastre, Kevan M. Shokat, Brian K. Shoichet4, Nevan J. Krogan 
30 Apr 2020-Nature
TL;DR: A human–SARS-CoV-2 protein interaction map highlights cellular processes that are hijacked by the virus and that can be targeted by existing drugs, including inhibitors of mRNA translation and predicted regulators of the sigma receptors.
Abstract: A newly described coronavirus named severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which is the causative agent of coronavirus disease 2019 (COVID-19), has infected over 2.3 million people, led to the death of more than 160,000 individuals and caused worldwide social and economic disruption1,2. There are no antiviral drugs with proven clinical efficacy for the treatment of COVID-19, nor are there any vaccines that prevent infection with SARS-CoV-2, and efforts to develop drugs and vaccines are hampered by the limited knowledge of the molecular details of how SARS-CoV-2 infects cells. Here we cloned, tagged and expressed 26 of the 29 SARS-CoV-2 proteins in human cells and identified the human proteins that physically associated with each of the SARS-CoV-2 proteins using affinity-purification mass spectrometry, identifying 332 high-confidence protein–protein interactions between SARS-CoV-2 and human proteins. Among these, we identify 66 druggable human proteins or host factors targeted by 69 compounds (of which, 29 drugs are approved by the US Food and Drug Administration, 12 are in clinical trials and 28 are preclinical compounds). We screened a subset of these in multiple viral assays and found two sets of pharmacological agents that displayed antiviral activity: inhibitors of mRNA translation and predicted regulators of the sigma-1 and sigma-2 receptors. Further studies of these host-factor-targeting agents, including their combination with drugs that directly target viral enzymes, could lead to a therapeutic regimen to treat COVID-19. A human–SARS-CoV-2 protein interaction map highlights cellular processes that are hijacked by the virus and that can be targeted by existing drugs, including inhibitors of mRNA translation and predicted regulators of the sigma receptors.

3,319 citations

References
More filters
Journal ArticleDOI
TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.
Abstract: We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the logexpectation score, and refinement using treedependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.

37,524 citations


"Fast, scalable generation of high‐q..." refers methods in this paper

  • ...Code for fast UPGMA and guide tree handling routines was adopted from MUSCLE (Edgar, 2004)....

    [...]

  • ...Here, we present results from a range of packages tested on three benchmarks: BAliBASE (Thompson et al, 2005), Prefab (Edgar, 2004) and an extended version of HomFam (Blackshields et al, 2010)....

    [...]

  • ...For these tests, we just report results using the default settings for all programs but with two exceptions, which were needed to allow MUSCLE (Edgar, 2004) and MAFFT to align the biggest test cases in HomFam....

    [...]

  • ...For these tests, we just report results using the default settings for all programs but with two exceptions, which were needed to allow MUSCLE (Edgar, 2004) and MAFFT to align the biggest test cases in HomFam....

    [...]

Journal ArticleDOI
TL;DR: The Clustal W and ClUSTal X multiple sequence alignment programs have been completely rewritten in C++ to facilitate the further development of the alignment algorithms in the future and has allowed proper porting of the programs to the latest versions of Linux, Macintosh and Windows operating systems.
Abstract: Summary: The Clustal W and Clustal X multiple sequence alignment programs have been completely rewritten in C++. This will facilitate the further development of the alignment algorithms in the future and has allowed proper porting of the programs to the latest versions of Linux, Macintosh and Windows operating systems. Availability: The programs can be run on-line from the EBI web server: http://www.ebi.ac.uk/tools/clustalw2. The source code and executables for Windows, Linux and Macintosh computers are available from the EBI ftp site ftp://ftp.ebi.ac.uk/pub/software/clustalw2/ Contact: clustalw@ucd.ie

25,325 citations

Journal ArticleDOI
TL;DR: The definition and use of family-specific, manually curated gathering thresholds are explained and some of the features of domains of unknown function (also known as DUFs) are discussed, which constitute a rapidly growing class of families within Pfam.
Abstract: Pfam is a widely used database of protein families and domains. This article describes a set of major updates that we have implemented in the latest release (version 24.0). The most important change is that we now use HMMER3, the latest version of the popular profile hidden Markov model package. This software is approximately 100 times faster than HMMER2 and is more sensitive due to the routine use of the forward algorithm. The move to HMMER3 has necessitated numerous changes to Pfam that are described in detail. Pfam release 24.0 contains 11,912 families, of which a large number have been significantly updated during the past two years. Pfam is available via servers in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/).

14,075 citations


"Fast, scalable generation of high‐q..." refers background in this paper

  • ...There are already widely available collections of HMMs from many sources such as Pfam...

    [...]

Journal ArticleDOI
TL;DR: A simplified scoring system is proposed that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length.
Abstract: A multiple sequence alignment program, MAFFT, has been developed. The CPU time is drastically reduced as compared with existing methods. MAFFT includes two novel techniques. (i) Homologous regions are rapidly identified by the fast Fourier transform (FFT), in which an amino acid sequence is converted to a sequence composed of volume and polarity values of each amino acid residue. (ii) We propose a simplified scoring system that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length. Two different heuristics, the progressive method (FFT-NS-2) and the iterative refinement method (FFT-NS-i), are implemented in MAFFT. The performances of FFT-NS-2 and FFT-NS-i were compared with other methods by computer simulations and benchmark tests; the CPU time of FFT-NS-2 is drastically reduced as compared with CLUSTALW with comparable accuracy. FFT-NS-i is over 100 times faster than T-COFFEE, when the number of input sequences exceeds 60, without sacrificing the accuracy.

12,003 citations


"Fast, scalable generation of high‐q..." refers background in this paper

  • ...1 (http://www.clustal.org) DIALIGN 2.2.1 (http://dialign.gobics.de/) FSA 1.15.5 (http://sourceforge.net/projects/fsa/) Kalign 2.04 (http://msa.sbc.su.se/cgi-bin/msa.cgi) MAFFT 6.857 (http://mafft.cbrc.jp/alignment/software/source.html) MSAProbs 0.9.4 (http://sourceforge.net/projects/msaprobs/files/) MUSCLE version 3.8.31 posted 1 May 2010 (http://www.drive5. com/muscle/downloads.htm) PRANKv.100802, 2August 2010 (http://www.ebi.ac.uk/goldman-srv/ prank/src/prank/) Probalign v1....

    [...]

  • ...The consistency-based programsMSAprobs,MAFFT L-INS-i, Probalign, Probcons and T-Coffee, are again the most accurate but with long run times....

    [...]

  • ...For these tests, we just report results using the default settings for all programs but with two exceptions, which were needed to allow MUSCLE (Edgar, 2004) and MAFFT to align the biggest test cases in HomFam....

    [...]

  • ...There is then a gap to the faster progressive based programs of MUSCLE, MAFFT, Kalign (Lassmann and Sonnhammer, 2005) and Clustal W. Results from testing large alignments with up to 50000 sequences are given in Table III using HomFam....

    [...]

  • ...MAFFT with default settings, has a limit of 20000 sequences and we only use MAFFT with –parttree for the last section of Table III....

    [...]

Journal ArticleDOI
TL;DR: Pfam as discussed by the authors is a widely used database of protein families, containing 14 831 manually curated entries in the current version, version 27.0, and has been updated several times since 2012.
Abstract: Pfam, available via servers in the UK (http://pfam.sanger.ac.uk/) and the USA (http://pfam.janelia.org/), is a widely used database of protein families, containing 14 831 manually curated entries in the current release, version 27.0. Since the last update article 2 years ago, we have generated 1182 new families and maintained sequence coverage of the UniProt Knowledgebase (UniProtKB) at nearly 80%, despite a 50% increase in the size of the underlying sequence database. Since our 2012 article describing Pfam, we have also undertaken a comprehensive review of the features that are provided by Pfam over and above the basic family data. For each feature, we determined the relevance, computational burden, usage statistics and the functionality of the feature in a website context. As a consequence of this review, we have removed some features, enhanced others and developed new ones to meet the changing demands of computational biology. Here, we describe the changes to Pfam content. Notably, we now provide family alignments based on four different representative proteome sequence data sets and a new interactive DNA search interface. We also discuss the mapping between Pfam and known 3D structures.

9,415 citations