jModelTest 2: more models, new heuristics and parallel computing.

doi:10.1038/NMETH.2109

Home
/
Papers
/
jModelTest 2: more models, new heuristics and parallel computing.

Journal Article•DOI•

jModelTest 2: more models, new heuristics and parallel computing.

Diego Darriba¹, Diego Darriba², Guillermo L. Taboada², Ramón Doallo², David Posada¹ - Show less +1 more•Institutions (2)

University of Vigo¹, University of A Coruña²

01 Aug 2012-Nature Methods (Nature Research)-Vol. 9, Iss: 8, pp 772-772

TL;DR: jModelTest 2: more models, new heuristics and parallel computing Diego Darriba, Guillermo L. Taboada, Ramón Doallo and David Posada.

read less

Abstract: jModelTest 2: more models, new heuristics and parallel computing Diego Darriba, Guillermo L. Taboada, Ramón Doallo and David Posada Supplementary Table 1. New features in jModelTest 2 Supplementary Table 2. Model selection accuracy Supplementary Table 3. Mean square errors for model averaged estimates Supplementary Note 1. Hill-climbing hierarchical clustering algorithm Supplementary Note 2. Heuristic filtering Supplementary Note 3. Simulations from prior distributions Supplementary Note 4. Speed-up benchmark on real and simulated datasets

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

ModelFinder: fast model selection for accurate phylogenetic estimates

[...]

Subha Kalyaanamoorthy¹, Subha Kalyaanamoorthy², Bui Quang Minh³, Thomas K. F. Wong¹, Thomas K. F. Wong⁴, Arndt von Haeseler⁵, Arndt von Haeseler⁶, Lars S. Jermiin⁴, Lars S. Jermiin¹ - Show less +5 more•Institutions (6)

Commonwealth Scientific and Industrial Research Organisation¹, University of Alberta², Max F. Perutz Laboratories³, Australian National University⁴, Medical University of Vienna⁵, University of Vienna⁶

01 Jun 2017-Nature Methods

TL;DR: ModelFinder is presented, a fast model-selection method that greatly improves the accuracy of phylogenetic estimates by incorporating a model of rate heterogeneity across sites not previously considered in this context and by allowing concurrent searches of model space and tree space.

...read moreread less

Abstract: Model-based molecular phylogenetics plays an important role in comparisons of genomic data, and model selection is a key step in all such analyses. We present ModelFinder, a fast model-selection method that greatly improves the accuracy of phylogenetic estimates by incorporating a model of rate heterogeneity across sites not previously considered in this context and by allowing concurrent searches of model space and tree space.

...read moreread less

7,425 citations

Journal Article•DOI•

W-IQ-TREE: a fast online phylogenetic tool for maximum likelihood analysis.

[...]

Jana Trifinopoulos¹, Lam Tung Nguyen¹, Arndt von Haeseler¹, Bui Quang Minh¹•Institutions (1)

Medical University of Vienna¹

08 Jul 2016-Nucleic Acids Research

TL;DR: W-IQ-TREE supports multiple sequence types in common alignment formats and a wide range of evolutionary models including mixture and partition models, performing fast model selection, partition scheme finding, efficient tree reconstruction, ultrafast bootstrapping, branch tests, and tree topology tests.

...read moreread less

Abstract: This article presents W-IQ-TREE, an intuitive and user-friendly web interface and server for IQ-TREE, an efficient phylogenetic software for maximum likelihood analysis. W-IQ-TREE supports multiple sequence types (DNA, protein, codon, binary and morphology) in common alignment formats and a wide range of evolutionary models including mixture and partition models. W-IQ-TREE performs fast model selection, partition scheme finding, efficient tree reconstruction, ultrafast bootstrapping, branch tests, and tree topology tests. All computations are conducted on a dedicated computer cluster and the users receive the results via URL or email. W-IQ-TREE is available at http://iqtree.cibiv.univie.ac.at It is free and open to all users and there is no login requirement.

...read moreread less

2,488 citations

Cites methods from "jModelTest 2: more models, new heur..."

...W-IQ-TREE supports a ‘standard’ model selection procedure like jModelTest (18) and ProtTest (19) as well as an extended procedure (i....
[...]

Journal Article•DOI•

ConSurf 2016: an improved methodology to estimate and visualize evolutionary conservation in macromolecules.

[...]

Haim Ashkenazy¹, Shiran Abadi¹, Eric Martz², Ofer Chay¹, Itay Mayrose¹, Tal Pupko¹, Nir Ben-Tal¹ - Show less +3 more•Institutions (2)

Tel Aviv University¹, University of Massachusetts Amherst²

08 Jul 2016-Nucleic Acids Research

TL;DR: Several new features into ConSurf are introduced, including automatic selection of the best evolutionary model used to infer the rates, the able to homology-model query proteins, prediction of the secondary structure of query RNA molecules from sequence, the ability to view the biological assembly of a query (in addition to the single chain), mapping of the conservation grades onto 2D RNA models and an advanced view of the phylogenetic tree.

...read moreread less

Abstract: The degree of evolutionary conservation of an amino acid in a protein or a nucleic acid in DNA/RNA reflects a balance between its natural tendency to mutate and the overall need to retain the structural integrity and function of the macromolecule. The ConSurf web server (http://consurf.tau.ac.il), established over 15 years ago, analyses the evolutionary pattern of the amino/nucleic acids of the macromolecule to reveal regions that are important for structure and/or function. Starting from a query sequence or structure, the server automatically collects homologues, infers their multiple sequence alignment and reconstructs a phylogenetic tree that reflects their evolutionary relations. These data are then used, within a probabilistic framework, to estimate the evolutionary rates of each sequence position. Here we introduce several new features into ConSurf, including automatic selection of the best evolutionary model used to infer the rates, the ability to homology-model query proteins, prediction of the secondary structure of query RNA molecules from sequence, the ability to view the biological assembly of a query (in addition to the single chain), mapping of the conservation grades onto 2D RNA models and an advanced view of the phylogenetic tree that enables interactively rerunning ConSurf with the taxa of a sub-tree.

...read moreread less

2,159 citations

Journal Article•DOI•

A DNA-Based Registry for All Animal Species: The Barcode Index Number (BIN) System

[...]

Sujeevan Ratnasingham¹, Paul D. N. Hebert¹•Institutions (1)

University of Guelph¹

08 Jul 2013-PLOS ONE

TL;DR: A persistent, species-level taxonomic registry for the animal kingdom is developed based on the analysis of patterns of nucleotide variation in the barcode region of the cytochrome c oxidase I (COI) gene.

...read moreread less

Abstract: Because many animal species are undescribed, and because the identification of known species is often difficult, interim taxonomic nomenclature has often been used in biodiversity analysis. By assigning individuals to presumptive species, called operational taxonomic units (OTUs), these systems speed investigations into the patterning of biodiversity and enable studies that would otherwise be impossible. Although OTUs have conventionally been separated through their morphological divergence, DNA-based delineations are not only feasible, but have important advantages. OTU designation can be automated, data can be readily archived, and results can be easily compared among investigations. This study exploits these attributes to develop a persistent, species-level taxonomic registry for the animal kingdom based on the analysis of patterns of nucleotide variation in the barcode region of the cytochrome c oxidase I (COI) gene. It begins by examining the correspondence between groups of specimens identified to a species through prior taxonomic work and those inferred from the analysis of COI sequence variation using one new (RESL) and four established (ABGD, CROP, GMYC, jMOTU) algorithms. It subsequently describes the implementation, and structural attributes of the Barcode Index Number (BIN) system. Aside from a pragmatic role in biodiversity assessments, BINs will aid revisionary taxonomy by flagging possible cases of synonymy, and by collating geographical information, descriptive metadata, and images for specimens that are likely to belong to the same species, even if it is undescribed. More than 274,000 BIN web pages are now available, creating a biodiversity resource that is positioned for rapid growth.

...read moreread less

1,571 citations

Cites methods from "jModelTest 2: more models, new heur..."

...Prior to phylogeny reconstruction, the most appropriate model of evolution was separately estimated for each dataset from alignments using jModelTest [50]....
[...]

Journal Article•DOI•

SMS: Smart Model Selection in PhyML.

[...]

Vincent Lefort¹, Jean-Emmanuel Longueville¹, Olivier Gascuel¹, Olivier Gascuel²•Institutions (2)

University of Montpellier¹, Pasteur Institute²

01 Sep 2017-Molecular Biology and Evolution

TL;DR: The software, “Smart Model Selection” (SMS), is implemented in the PhyML environment and available using two interfaces: command-line (to be integrated in pipelines) and a web server (http://www.atgc-montpellier.fr/phyml-sms/).

...read moreread less

Abstract: Model selection using likelihood-based criteria (e.g., AIC) is one of the first steps in phylogenetic analysis. One must select both a substitution matrix and a model for rates across sites. A simple method is to test all combinations and select the best one. We describe heuristics to avoid these extensive calculations. Runtime is divided by $2 with results remaining nearly the same, and the method performs well compared with ProtTest and jModelTest2. Our software, "Smart Model Selection" (SMS), is implemented in the PhyML environment and available using two interfaces: command-line (to be integrated in pipelines) and a web server (http://www.atgc-montpellier.fr/phyml-sms/).

...read moreread less

1,323 citations

Cites background or methods from "jModelTest 2: more models, new heur..."

...Below, we summarize the main features of SMS and its performance compared with the exhaustive approach, as well as to jModelTest2 (Darriba et al. 2012) and ProtTest....
[...]
...These strategies are partly inspired by Posada and Crandall (1998) and Darriba et al. (2012)....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

A new look at the statistical model identification

[...]

Hirotugu Akaike

01 Dec 1974-IEEE Transactions on Automatic Control

TL;DR: In this article, a new estimate minimum information theoretical criterion estimate (MAICE) is introduced for the purpose of statistical identification, which is free from the ambiguities inherent in the application of conventional hypothesis testing procedure.

...read moreread less

Abstract: The history of the development of statistical hypothesis testing in time series analysis is reviewed briefly and it is pointed out that the hypothesis testing procedure is not adequately defined as the procedure for statistical model identification. The classical maximum likelihood estimation procedure is reviewed and a new estimate minimum information theoretical criterion (AIC) estimate (MAICE) which is designed for the purpose of statistical identification is introduced. When there are several competing models the MAICE is defined by the model and the maximum likelihood estimates of the parameters which give the minimum of AIC defined by AIC = (-2)log-(maximum likelihood) + 2(number of independently adjusted parameters within the model). MAICE provides a versatile procedure for statistical model identification which is free from the ambiguities inherent in the application of conventional hypothesis testing procedure. The practical utility of MAICE in time series analysis is demonstrated with some numerical examples.

...read moreread less

47,133 citations

Journal Article•DOI•

Estimating the Dimension of a Model

[...]

Gideon Schwarz

01 Mar 1978-Annals of Statistics

TL;DR: In this paper, the problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion.

...read moreread less

Abstract: The problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms of its asymptotic expansion. These terms are a valid large-sample criterion beyond the Bayesian context, since they do not depend on the a priori distribution.

...read moreread less

38,681 citations

Estimating the dimension of a model

[...]

Gideon Schwarz

01 Jan 2005

...read moreread less

36,760 citations

Journal Article•DOI•

MODELTEST: testing the model of DNA substitution.

[...]

David Posada¹, Keith A. Crandall•Institutions (1)

Brigham Young University¹

01 Jan 1998-Bioinformatics

TL;DR: The program MODELTEST uses log likelihood scores to establish the model of DNA evolution that best fits the data.

...read moreread less

Abstract: Summary: The program MODELTEST uses log likelihood scores to establish the model of DNA evolution that best fits the data. Availability: The MODELTEST package, including the source code and some documentation is available at http://bioag.byu.edu/zoology/crandall―lab/modeltest.html. Contact: dp47@email.byu.edu.

...read moreread less

20,105 citations

Journal Article•DOI•

A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood.

[...]

Stéphane Guindon¹, Olivier Gascuel¹•Institutions (1)

Centre national de la recherche scientifique¹

01 Oct 2003-Systematic Biology

TL;DR: This work has used extensive and realistic computer simulations to show that the topological accuracy of this new method is at least as high as that of the existing maximum-likelihood programs and much higher than the performance of distance-based and parsimony approaches.

...read moreread less

Abstract: The increase in the number of large data sets and the complexity of current probabilistic sequence evolution models necessitates fast and reliable phylogeny reconstruction methods. We describe a new approach, based on the maximum- likelihood principle, which clearly satisfies these requirements. The core of this method is a simple hill-climbing algorithm that adjusts tree topology and branch lengths simultaneously. This algorithm starts from an initial tree built by a fast distance-based method and modifies this tree to improve its likelihood at each iteration. Due to this simultaneous adjustment of the topology and branch lengths, only a few iterations are sufficient to reach an optimum. We used extensive and realistic computer simulations to show that the topological accuracy of this new method is at least as high as that of the existing maximum-likelihood programs and much higher than the performance of distance-based and parsimony approaches. The reduction of computing time is dramatic in comparison with other maximum-likelihood packages, while the likelihood maximization ability tends to be higher. For example, only 12 min were required on a standard personal computer to analyze a data set consisting of 500 rbcL sequences with 1,428 base pairs from plant plastids, thus reaching a speed of the same order as some popular distance-based and parsimony algorithms. This new method is implemented in the PHYML program, which is freely available on our web page: http://www.lirmm.fr/w3ifa/MAAS/. (Algorithm; computer simulations; maximum likelihood; phylogeny; rbcL; RDPII project.) The size of homologous sequence data sets has in- creased dramatically in recent years, and many of these data sets now involve several hundreds of taxa. More- over, current probabilistic sequence evolution models (Swofford et al., 1996 ; Page and Holmes, 1998 ), notably those including rate variation among sites (Uzzell and Corbin, 1971 ; Jin and Nei, 1990 ; Yang, 1996 ), require an increasing number of calculations. Therefore, the speed of phylogeny reconstruction methods is becoming a sig- nificant requirement and good compromises between speed and accuracy must be found. The maximum likelihood (ML) approach is especially accurate for building molecular phylogenies. Felsenstein (1981) brought this framework to nucleotide-based phy- logenetic inference, and it was later also applied to amino acid sequences (Kishino et al., 1990). Several vari- ants were proposed, most notably the Bayesian meth- ods (Rannala and Yang 1996; and see below), and the discrete Fourier analysis of Hendy et al. (1994), for ex- ample. Numerous computer studies (Huelsenbeck and Hillis, 1993; Kuhner and Felsenstein, 1994; Huelsenbeck, 1995; Rosenberg and Kumar, 2001; Ranwez and Gascuel, 2002) have shown that ML programs can recover the cor- rect tree from simulated data sets more frequently than other methods can. Another important advantage of the ML approach is the ability to compare different trees and evolutionary models within a statistical framework (see Whelan et al., 2001, for a review). However, like all optimality criterion-based phylogenetic reconstruction approaches, ML is hampered by computational difficul- ties, making it impossible to obtain the optimal tree with certainty from even moderate data sets (Swofford et al., 1996). Therefore, all practical methods rely on heuristics that obtain near-optimal trees in reasonable computing time. Moreover, the computation problem is especially difficult with ML, because the tree likelihood not only depends on the tree topology but also on numerical pa- rameters, including branch lengths. Even computing the optimal values of these parameters on a single tree is not an easy task, particularly because of possible local optima (Chor et al., 2000). The usual heuristic method, implemented in the pop- ular PHYLIP (Felsenstein, 1993 ) and PAUP ∗ (Swofford, 1999 ) packages, is based on hill climbing. It combines stepwise insertion of taxa in a growing tree and topolog- ical rearrangement. For each possible insertion position and rearrangement, the branch lengths of the resulting tree are optimized and the tree likelihood is computed. When the rearrangement improves the current tree or when the position insertion is the best among all pos- sible positions, the corresponding tree becomes the new current tree. Simple rearrangements are used during tree growing, namely "nearest neighbor interchanges" (see below), while more intense rearrangements can be used once all taxa have been inserted. The procedure stops when no rearrangement improves the current best tree. Despite significant decreases in computing times, no- tably in fastDNAml (Olsen et al., 1994 ), this heuristic becomes impracticable with several hundreds of taxa. This is mainly due to the two-level strategy, which sepa- rates branch lengths and tree topology optimization. In- deed, most calculations are done to optimize the branch lengths and evaluate the likelihood of trees that are finally rejected. New methods have thus been proposed. Strimmer and von Haeseler (1996) and others have assembled four- taxon (quartet) trees inferred by ML, in order to recon- struct a complete tree. However, the results of this ap- proach have not been very satisfactory to date (Ranwez and Gascuel, 2001 ). Ota and Li (2000, 2001) described

...read moreread less

16,261 citations

"jModelTest 2: more models, new heur..." refers background in this paper

...…B HIV-1 whole genome 138 10693 http://www.hiv.lanl.gov/ C Yeast 106 genes 8 127060 Rokas et al. (2003) D simulated -- 40 500 Guindon and Gascuel (2003) E simulated -- 100 500 Guindon and Gascuel (2003) Datasets A and B are trimmed alignments initially downloaded from…...
[...]
...…B HIV-1 whole genome 138 10693 http://www.hiv.lanl.gov/ C Yeast 106 genes 8 127060 Rokas et al. (2003) D simulated -- 40 500 Guindon and Gascuel (2003) E simulated -- 100 500 Guindon and Gascuel (2003) Datasets A and B are trimmed alignments initially downloaded from http://www.hiv.lanl.gov/....
[...]