Identification of Oxa1 Homologs Operating in the Eukaryotic Endoplasmic Reticulum

doi:10.1016/J.CELREP.2017.12.006

Home
/
Papers
/
Identification of Oxa1 Homologs Operating in the Eukaryotic Endoplasmic Reticulum

Journal Article•DOI•

Identification of Oxa1 Homologs Operating in the Eukaryotic Endoplasmic Reticulum

S. Andrei Anghel¹, Philip T. McGilvray¹, Ramanujan S. Hegde², Robert J. Keenan¹•Institutions (2)

University of Chicago¹, Laboratory of Molecular Biology²

26 Dec 2017-Cell Reports (Elsevier)-Vol. 21, Iss: 13, pp 3708-3716

TL;DR: Findings suggest a specific biochemical function for TMCO1 and define a superfamily of proteins—the “Oxa1 superfamily”—whose shared function is to facilitate membrane protein biogenesis.

read less

About: This article is published in Cell Reports.The article was published on 2017-12-26 and is currently open access. It has received 91 citations till now. The article focuses on the topics: ER membrane protein complex & Chloroplast thylakoid membrane.

...read moreread less

Citations

PDF

Open Access

More filters

Genome-wide association study identifies susceptibility loci for open angle glaucoma at TMCO1 and CDKN2B-AS1

[...]

Kathryn P. Burdon, Stuart MacGregor, Alex W. Hewitt, Shiwani Sharma, Glyn Chidlow, Richard A. Mills, Patrick Danoy, Robert J Casson, Ananth C. Viswanathan, Jimmy Z. Liu, John Landers, Anjali K. Henders, John P. M. Wood, Emmanuelle Souzeau, April Crawford, Paul Leo, Jie Jin Wang, Elena Rochtchina, Dale R. Nyholt, Nicholas G. Martin, Grant W. Montgomery, Paul Mitchell, Brown, David A. Mackey, Jamie E Craig - Show less +21 more

01 Jan 2011

TL;DR: This paper reported a genome-wide association study for open-angle glaucoma (OAG) blindness using a discovery cohort of 590 individuals with severe visual field loss (cases) and 3,956 controls.

...read moreread less

Abstract: We report a genome-wide association study for open-angle glaucoma (OAG) blindness using a discovery cohort of 590 individuals with severe visual field loss (cases) and 3,956 controls. We identified associated loci at TMCO1 (rs4656461[G] odds ratio (OR) = 1.68, P = 6.1 × 10-10) and CDKN2B-AS1 (rs4977756[A] OR = 1.50, P = 4.7 × 10-9). We replicated these associations in an independent cohort of cases with advanced OAG (rs4656461 P = 0.010; rs4977756 P = 0.042) and two additional cohorts of less severe OAG (rs4656461 combined discovery and replication P = 6.00 × 10-14, OR = 1.51, 95% CI 1.35-1.68; rs4977756 combined P = 1.35 × 10-14, OR = 1.39, 95% CI 1.28-1.51). We show retinal expression of genes at both loci in human ocular tissues. We also show that CDKN2A and CDKN2B are upregulated in the retina of a rat model of glaucoma. © 2011 Nature America, Inc. All rights reserved.

...read moreread less

347 citations

Journal Article•DOI•

The ER membrane protein complex is a transmembrane domain insertase

[...]

Alina Guna¹, Norbert Volkmar², John C. Christianson², Ramanujan S. Hegde¹•Institutions (2)

Laboratory of Molecular Biology¹, Ludwig Institute for Cancer Research²

26 Jan 2018-Science

TL;DR: It is found that known membrane insertion pathways fail to effectively engage tail-anchored membrane proteins with moderately hydrophobic transmembrane domains, and these proteins are instead shielded in the cytosol by calmodulin.

...read moreread less

Abstract: Insertion of proteins into membranes is an essential cellular process. The extensive biophysical and topological diversity of membrane proteins necessitates multiple insertion pathways that remain incompletely defined. Here we found that known membrane insertion pathways fail to effectively engage tail-anchored membrane proteins with moderately hydrophobic transmembrane domains. These proteins are instead shielded in the cytosol by calmodulin. Dynamic release from calmodulin allowed sampling of the endoplasmic reticulum (ER), where the conserved ER membrane protein complex (EMC) was shown to be essential for efficient insertion in vitro and in cells. Purified EMC in synthetic liposomes catalyzed the insertion of its substrates in a reconstituted system. Thus, EMC is a transmembrane domain insertase, a function that may explain its widely pleiotropic membrane-associated phenotypes across organisms.

...read moreread less

204 citations

Journal Article•DOI•

Biochemistry and Molecular Biology of Flaviviruses

[...]

Nicholas J. Barrows¹, Nicholas J. Barrows², Rafael K. Campos², Rafael K. Campos¹, Kuo-Chieh Liao³, K. Reddisiva Prasanth¹, Ruben Soto-Acosta¹, Shih Chia Yeh³, Geraldine Schott-Lerner¹, Julien Pompon⁴, Julien Pompon³, October M. Sessions³, Shelton S. Bradrick¹, Mariano A. Garcia-Blanco³, Mariano A. Garcia-Blanco¹ - Show less +11 more•Institutions (4)

University of Texas Medical Branch¹, Duke University², National University of Singapore³, University of Montpellier⁴

13 Apr 2018-Chemical Reviews

TL;DR: This review examines the molecular biology of flaviviruses touching on the structure and function of viral components and how these interact with host factors, and highlights the role of a noncoding RNA produced by flavIViruses to impair antiviral host immune responses.

...read moreread less

Abstract: Flaviviruses, such as dengue, Japanese encephalitis, tick-borne encephalitis, West Nile, yellow fever, and Zika viruses, are critically important human pathogens that sicken a staggeringly high number of humans every year. Most of these pathogens are transmitted by mosquitos, and not surprisingly, as the earth warms and human populations grow and move, their geographic reach is increasing. Flaviviruses are simple RNA–protein machines that carry out protein synthesis, genome replication, and virion packaging in close association with cellular lipid membranes. In this review, we examine the molecular biology of flaviviruses touching on the structure and function of viral components and how these interact with host factors. The latter are functionally divided into pro-viral and antiviral factors, both of which, not surprisingly, include many RNA binding proteins. In the interface between the virus and the hosts we highlight the role of a noncoding RNA produced by flaviviruses to impair antiviral host immune ...

...read moreread less

184 citations

Journal Article•DOI•

EMC Is Required to Initiate Accurate Membrane Protein Topogenesis

[...]

Patrick J. Chitwood¹, Szymon Juszkiewicz¹, Alina Guna¹, Sichen Shao², Ramanujan S. Hegde¹ - Show less +1 more•Institutions (2)

Laboratory of Molecular Biology¹, Harvard University²

29 Nov 2018-Cell

TL;DR: It is found that efficient biogenesis of β1-adrenergic receptor (β1AR) and other G protein-coupled receptors (GPCRs) requires the conserved ER membrane protein complex (EMC), which inserts TMDs co-translationally and cooperates with the Sec61 translocon to ensure accurate topogenesis of many membrane proteins.

...read moreread less

145 citations

Journal Article•DOI•

The ER membrane protein complex interacts cotranslationally to enable biogenesis of multipass membrane proteins.

[...]

Matthew J Shurtleff¹, Daniel N. Itzhak², Jeffrey A. Hussmann¹, Nicole T. Schirle Oakdale¹, Elizabeth A. Costa¹, Martin C. Jonikas¹, Jimena Weibezahn¹, Katerina D Popova¹, Calvin H. Jan¹, Pavel Sinitcyn², Shruthi S. Vembar³, Hilda Hernandez¹, Jürgen Cox², Alma L. Burlingame¹, Jeffrey L. Brodsky³, Adam Frost¹, Georg H. H. Borner², Jonathan S. Weissman¹ - Show less +14 more•Institutions (3)

University of California, San Francisco¹, Max Planck Society², University of Pittsburgh³

29 May 2018-eLife

TL;DR: The systematic proteomic approaches revealed that the ER membrane protein complex (EMC) binds to and promotes the biogenesis of a range of multipass transmembrane proteins, with a particular enrichment for transporters.

...read moreread less

Abstract: The endoplasmic reticulum (ER) supports biosynthesis of proteins with diverse transmembrane domain (TMD) lengths and hydrophobicity. Features in transmembrane domains such as charged residues in ion channels are often functionally important, but could pose a challenge during cotranslational membrane insertion and folding. Our systematic proteomic approaches in both yeast and human cells revealed that the ER membrane protein complex (EMC) binds to and promotes the biogenesis of a range of multipass transmembrane proteins, with a particular enrichment for transporters. Proximity-specific ribosome profiling demonstrates that the EMC engages clients cotranslationally and immediately following clusters of TMDs enriched for charged residues. The EMC can remain associated after completion of translation, which both protects clients from premature degradation and allows recruitment of substrate-specific and general chaperones. Thus, the EMC broadly enables the biogenesis of multipass transmembrane proteins containing destabilizing features, thereby mitigating the trade-off between function and stability.

...read moreread less

145 citations

Cites background from "Identification of Oxa1 Homologs Ope..."

...Interestingly, 344 EMC3 may share a common ancestry with the universally conserved YidC/Oxa1/Alb3 protein 345 family in bacteria and mitochondria (38)....
[...]
...…wide diversity of membrane spanning sequences by directly interacting with select membrane proteins with destabilizing features in TMDs. Interestingly, EMC3 may share a common ancestry with the universally conserved YidC/Oxa1/Alb3 protein family in bacteria and mitochondria (Anghel et al., 2017)....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

[...]

Stephen F. Altschul¹, Thomas L. Madden, Alejandro A. Schäffer¹, Jinghui Zhang, Zheng Zhang², Webb Miller², David J. Lipman - Show less +3 more•Institutions (2)

National Institutes of Health¹, Pennsylvania State University²

01 Sep 1997-Nucleic Acids Research

TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.

...read moreread less

Abstract: The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

...read moreread less

70,111 citations

"Identification of Oxa1 Homologs Ope..." refers methods in this paper

...For each of these protein families, homologs were retrieved using PSI-Blast (Altschul et al., 1997) with an expected threshold cutoff of 10−1....
[...]

Journal Article•DOI•

MUSCLE: multiple sequence alignment with high accuracy and high throughput

[...]

Robert C. Edgar

01 Mar 2004-Nucleic Acids Research

TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.

...read moreread less

Abstract: We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the logexpectation score, and refinement using treedependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.

...read moreread less

37,524 citations

"Identification of Oxa1 Homologs Ope..." refers methods in this paper

...[PubMed: 9184221] Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput....
[...]
...Proteins in this list were then aligned using MUSCLE (Edgar, 2004)....
[...]

Journal Article•DOI•

A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood.

[...]

Stéphane Guindon¹, Olivier Gascuel¹•Institutions (1)

Centre national de la recherche scientifique¹

01 Oct 2003-Systematic Biology

TL;DR: This work has used extensive and realistic computer simulations to show that the topological accuracy of this new method is at least as high as that of the existing maximum-likelihood programs and much higher than the performance of distance-based and parsimony approaches.

...read moreread less

Abstract: The increase in the number of large data sets and the complexity of current probabilistic sequence evolution models necessitates fast and reliable phylogeny reconstruction methods. We describe a new approach, based on the maximum- likelihood principle, which clearly satisfies these requirements. The core of this method is a simple hill-climbing algorithm that adjusts tree topology and branch lengths simultaneously. This algorithm starts from an initial tree built by a fast distance-based method and modifies this tree to improve its likelihood at each iteration. Due to this simultaneous adjustment of the topology and branch lengths, only a few iterations are sufficient to reach an optimum. We used extensive and realistic computer simulations to show that the topological accuracy of this new method is at least as high as that of the existing maximum-likelihood programs and much higher than the performance of distance-based and parsimony approaches. The reduction of computing time is dramatic in comparison with other maximum-likelihood packages, while the likelihood maximization ability tends to be higher. For example, only 12 min were required on a standard personal computer to analyze a data set consisting of 500 rbcL sequences with 1,428 base pairs from plant plastids, thus reaching a speed of the same order as some popular distance-based and parsimony algorithms. This new method is implemented in the PHYML program, which is freely available on our web page: http://www.lirmm.fr/w3ifa/MAAS/. (Algorithm; computer simulations; maximum likelihood; phylogeny; rbcL; RDPII project.) The size of homologous sequence data sets has in- creased dramatically in recent years, and many of these data sets now involve several hundreds of taxa. More- over, current probabilistic sequence evolution models (Swofford et al., 1996 ; Page and Holmes, 1998 ), notably those including rate variation among sites (Uzzell and Corbin, 1971 ; Jin and Nei, 1990 ; Yang, 1996 ), require an increasing number of calculations. Therefore, the speed of phylogeny reconstruction methods is becoming a sig- nificant requirement and good compromises between speed and accuracy must be found. The maximum likelihood (ML) approach is especially accurate for building molecular phylogenies. Felsenstein (1981) brought this framework to nucleotide-based phy- logenetic inference, and it was later also applied to amino acid sequences (Kishino et al., 1990). Several vari- ants were proposed, most notably the Bayesian meth- ods (Rannala and Yang 1996; and see below), and the discrete Fourier analysis of Hendy et al. (1994), for ex- ample. Numerous computer studies (Huelsenbeck and Hillis, 1993; Kuhner and Felsenstein, 1994; Huelsenbeck, 1995; Rosenberg and Kumar, 2001; Ranwez and Gascuel, 2002) have shown that ML programs can recover the cor- rect tree from simulated data sets more frequently than other methods can. Another important advantage of the ML approach is the ability to compare different trees and evolutionary models within a statistical framework (see Whelan et al., 2001, for a review). However, like all optimality criterion-based phylogenetic reconstruction approaches, ML is hampered by computational difficul- ties, making it impossible to obtain the optimal tree with certainty from even moderate data sets (Swofford et al., 1996). Therefore, all practical methods rely on heuristics that obtain near-optimal trees in reasonable computing time. Moreover, the computation problem is especially difficult with ML, because the tree likelihood not only depends on the tree topology but also on numerical pa- rameters, including branch lengths. Even computing the optimal values of these parameters on a single tree is not an easy task, particularly because of possible local optima (Chor et al., 2000). The usual heuristic method, implemented in the pop- ular PHYLIP (Felsenstein, 1993 ) and PAUP ∗ (Swofford, 1999 ) packages, is based on hill climbing. It combines stepwise insertion of taxa in a growing tree and topolog- ical rearrangement. For each possible insertion position and rearrangement, the branch lengths of the resulting tree are optimized and the tree likelihood is computed. When the rearrangement improves the current tree or when the position insertion is the best among all pos- sible positions, the corresponding tree becomes the new current tree. Simple rearrangements are used during tree growing, namely "nearest neighbor interchanges" (see below), while more intense rearrangements can be used once all taxa have been inserted. The procedure stops when no rearrangement improves the current best tree. Despite significant decreases in computing times, no- tably in fastDNAml (Olsen et al., 1994 ), this heuristic becomes impracticable with several hundreds of taxa. This is mainly due to the two-level strategy, which sepa- rates branch lengths and tree topology optimization. In- deed, most calculations are done to optimize the branch lengths and evaluate the likelihood of trees that are finally rejected. New methods have thus been proposed. Strimmer and von Haeseler (1996) and others have assembled four- taxon (quartet) trees inferred by ML, in order to recon- struct a complete tree. However, the results of this ap- proach have not been very satisfactory to date (Ranwez and Gascuel, 2001 ). Ota and Li (2000, 2001) described

...read moreread less

16,261 citations

"Identification of Oxa1 Homologs Ope..." refers methods in this paper

...A maximum-likelihood phylogenetic tree was built using PhyML-SMS (Guindon and Gascuel, 2003) using nearest-neighbor interchange (NNI) and the Akaike information criterion....
[...]

Journal Article•DOI•

Crystallography & NMR System: A New Software Suite for Macromolecular Structure Determination

[...]

Axel T. Brunger¹, Axel T. Brunger², Paul D. Adams¹, G M Clore³, W. L. DeLano⁴, Piet Gros⁵, R.W. Grosse-Kunstleve², R.W. Grosse-Kunstleve¹, Jiansheng Jiang⁶, J. Kuszewski³, Michael Nilges, Navraj S. Pannu⁷, Randy J. Read⁷, Luke M. Rice¹, Thomas Simonson⁸, Gregory L. Warren¹ - Show less +12 more•Institutions (8)

Yale University¹, Howard Hughes Medical Institute², National Institutes of Health³, University of California, San Francisco⁴, Utrecht University⁵, Brookhaven National Laboratory⁶, University of Alberta⁷, Centre national de la recherche scientifique⁸

01 Sep 1998-Acta Crystallographica Section D-biological Crystallography

TL;DR: The Crystallography & NMR System (CNS) as mentioned in this paper is a software suite for macromolecular structure determination by X-ray crystallography or solution nuclear magnetic resonance (NMR) spectroscopy.

...read moreread less

Abstract: A new software suite, called Crystallography & NMR System (CNS), has been developed for macromolecular structure determination by X-ray crystallography or solution nuclear magnetic resonance (NMR) spectroscopy. In contrast to existing structure-determination programs the architecture of CNS is highly flexible, allowing for extension to other structure-determination methods, such as electron microscopy and solid-state NMR spectroscopy. CNS has a hierarchical structure: a high-level hypertext markup language (HTML) user interface, task-oriented user input files, module files, a symbolic structure-determination language (CNS language), and low-level source code. Each layer is accessible to the user. The novice user may just use the HTML interface, while the more advanced user may use any of the other layers. The source code will be distributed, thus source-code modification is possible. The CNS language is sufficiently powerful and flexible that many new algorithms can be easily implemented in the CNS language without changes to the source code. The CNS language allows the user to perform operations on data structures, such as structure factors, electron-density maps, and atomic properties. The power of the CNS language has been demonstrated by the implementation of a comprehensive set of crystallographic procedures for phasing, density modification and refinement. User-friendly task-oriented input files are available for nearly all aspects of macromolecular structure determination by X-ray crystallography and solution NMR.

...read moreread less

15,182 citations

Journal Article•DOI•

trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses

[...]

Salvador Capella-Gutierrez, José M. Silla-Martínez, Toni Gabaldón

01 Aug 2009-Bioinformatics

TL;DR: TrimAl is a tool for automated alignment trimming, which is especially suited for large-scale phylogenetic analyses and can automatically select the parameters to be used in each specific alignment so that the signal-to-noise ratio is optimized.

...read moreread less

Abstract: Multiple sequence alignments (MSA) are central to many areas of bioinformatics, including phylogenetics, homology modeling, database searches and motif finding. Recently, such MSA-based techniques have been incorporated in high-throughput pipelines such as genome annotation and phylogenomics analyses. In all these applications, the reliability and accuracy of the analyses depend critically on the quality of the underlying alignments. A plethora of computer programs and algorithms for MSA are currently available (Notredame, 2007), which implement different heuristics to find mathematically optimal solutions to the MSA problem. Accuracies of 80–90% have been reported for the best algorithms, but even the best scoring alignment algorithms may fail with certain protein families or at specific regions in the alignment. The situation worsens in large-scale analyses, where faster but less reliable algorithms and large numbers of automatically selected sequences are used. It is therefore generally assumed that trimming the alignment, so that poorly aligned regions are eliminated, increases the accuracy of the resulting MSA-based applications (Talavera and Castresana, 2007). Some programs such as G-blocks (Castresana, 2000) have been developed to assist in the MSA trimming phase by selecting blocks of conserved regions. They have become very popular and are extensively used, with good performance, in small-to-medium scale datasets, where several parameters can be tested manually (Talavera and Castresana, 2007). However, their use over larger datasets is hampered by the need for defining, prior to the analysis, the set of parameters that will be used for all sequence families. Here, we present trimAl, a tool for automated alignment trimming. Its speed and the possibility for automatically adjusting the parameters to improve the phylogenetic signal-to-noise ratio, makes trimAl especially suited for large-scale phylogenomic analyses, involving thousands of large alignments. trimAl has been developed in a GNU/Linux environment using C++ programming language and has been tested on various UNIX, Mac and Windows platforms. Moreover, we have developed a web server to run trimAl online (http://phylemon2.bioinfo.cipf.es/), which has been included in the Phylemon suite for phylogenetic and phylogenomic tools (Tarraga et al., 2007). The documentation, source files and additional information for trimAl are available through a wiki page (http://trimal.cgenomics.org). trimAl reads and renders protein or nucleotide alignments in several standard formats. trimAl starts by reading all columns in an alignment and computes a score (Sx) for each of them. This score can be a gap score (Sg), a similarity score (Ss) or a consistency score (Sc). The score for each column can be computed based only on the information from that column or, if a window size of w is specified, it corresponds to the average value of w columns around the position considered. The gap score (Sg) for a column is the fraction of sequences without a gap in that position. The residue similarity score (Ss) consists of mean distance (MD) scores as described in Thompson et al. (2001) and Supplementary Material. This score uses the MD between pairs of residues, as defined by a given scoring matrix. Finally, the consistency score (Sc) can only be computed when more than one alignment for the same set of sequences is provided. Details on how these scores are computed are provided in the Supplementary Material. In brief, Sc measures the level of consistency of all the residue pairs found in a column as compared with the other alignments. The alignment with the highest consistency is chosen and then trimmed to remove the columns that are less conserved, according to Sc or other thresholds set by the user. Once all column scores have been computed trimAl can proceed in two ways. If both a score and a minimum conservation threshold are provided, trimAl renders a trimmed alignment in which only the columns with scores above the score threshold are included, as far as the number of selected columns is above a conservation threshold defined by the user. If this number is below the conservation threshold, trimAl will add more columns to the trimmed alignment in a decreasing order of scores until the conservation threshold is reached. The conservation threshold corresponds to the minimum percentage of columns, from the original alignment, which the user wants to include in the trimmed alignment. Alternatively, if the automatic selection of parameters options is selected, trimAl will compute specific score thresholds depending on the inherent characteristics of each alignment. So far, trimAl incorporates three modes for the automated selection of parameters, gappyout, strict and strictplus, which are based on the different use of gap and similarity scores. Moreover, the option automated1 implements a heuristic to decide the most appropriate mode depending on the alignment characteristics. The heuristics to define such parameters have been designed based on the results of a benchmark. Details on the heuristics and the benchmark can be found in the online documentation of the program. In brief, the automatic selection of parameters approximate optimal cutoffs by plotting, internally, the cumulative graphs of gap and similarity scores of the columns in the alignment (see online documentation). We expanded, using ROSE simulations (Stoye et al., 1998) a benchmark set that has been used previously to test the improvement in phylogenetic performance after an alignment trimming phase (Talavera and Castresana, 2007). This dataset simulates several evolutionary scenarios varying in the number and length of the sequences, the topology of the underlying tree and the level of sequence divergence considered. We compared the results obtained from MUSCLE alignments before and after trimming with trimAl using automated selection of parameters. The accuracy of the resulting trees was measured by comparing them with the original trees used to generate the sequence sets, and measuring the Robinson Foulds distance (Robinson and Foulds, 1981). We observed an overall improvement of the phylogenetic accuracy after trimming. Using -automated1 option of trimAl, the trimmed alignment always produced Maximum Likelihood trees that were of equal (36%) or significantly better (64%) quality as compared with the tree derived from the complete alignment. For Neighbor Joining reconstruction the -strictplus option of trimAl worked best, improving the phylogenetic accuracy in 89% of the scenarios. In most scenarios (90%), trimAl outperformed Gblocks v0.91b with default parameters. Most importantly, the use of Gblocks default parameters diminished the accuracy of the subsequent tree reconstruction in half of the scenarios considered. In contrast, the use of trimAl automated methods rarely (1.5%) undermined the topological accuracy of the resulting phylogenetic tree (see Supplementary Material for more details). To test the applicability of trimAl on real datasets as well as its suitability for large-scale phylogenetic datasets, we ran trimAl on the complete set of MUSCLE alignments generated for the Human Phylome project (Huerta-Cepas et al., 2007). This includes a total of 31 182 alignments, containing, on average, 67 sequences of 1472 positions of length. Trimming these alignments using the -gappyout and automated1 options used 5 min 45 s and 125 min, 2 s, respectively, on a computer with an Intel QuadCore XEON E5410 processors and 8 GB of RAM. trimAl has been used previously in a pipeline to reconstruct complete collections of gene trees. In this case, the parameter sets used were a minimum conservation threshold of 60% and a gap threshold of 90% (-cons 60 -gt 0.9). Complete and trimmed alignments used to generate the phylomes included in PhylomeDB (Huerta-Cepas et al., 2008) can be viewed through this database.

...read moreread less

6,807 citations