scispace - formally typeset
Search or ask a question

Showing papers in "Molecular Biology and Evolution in 2010"


Journal ArticleDOI
TL;DR: SeaView version 4 combines all the functions of the widely used programs SeaView and Phylo_win, and expands them by adding network access to sequence databases, alignment with arbitrary algorithm, maximum-likelihood tree building with PhyML, and display, printing, and copy-to-clipboard of rooted or unrooted, binary or multifurcating phylogenetic trees.
Abstract: We present SeaView version 4, a multiplatform program designed to facilitate multiple alignment and phylogenetic tree building from molecular sequence data through the use of a graphical user interface. SeaView version 4 combines all the functions of the widely used programs SeaView (in its previous versions) and Phylo_win, and expands them by adding network access to sequence databases, alignment with arbitrary algorithm, maximum-likelihood tree building with PhyML, and display, printing, and copy-to-clipboard of rooted or unrooted, binary or multifurcating phylogenetic trees. In relation to the wide present offer of tools and algorithms for phylogenetic analyses, SeaView is especially useful for teaching and for occasional users of such software. SeaView is freely available at http://pbil.univ-lyon1.fr/software/seaview.

5,074 citations


Journal ArticleDOI
TL;DR: It is demonstrated that both BEST and the new Bayesian Markov chain Monte Carlo method for the multispecies coalescent have much better estimation accuracy for species tree topology than concatenation, and the method outperforms BEST in divergence time and population size estimation.
Abstract: Until recently, it has been common practice for a phylogenetic analysis to use a single gene sequence from a single individual organism as a proxy for an entire species. With technological advances, it is now becoming more common to collect data sets containing multiple gene loci and multiple individuals per species. These data sets often reveal the need to directly model intraspecies polymorphism and incomplete lineage sorting in phylogenetic estimation procedures. For a single species, coalescent theory is widely used in contemporary population genetics to model intraspecific gene trees. Here, we present a Bayesian Markov chain Monte Carlo method for the multispecies coalescent. Our method coestimates multiple gene trees embedded in a shared species tree along with the effective population size of both extant and ancestral species. The inference is made possible by multilocus data from multiple individuals per species. Using a multiindividual data set and a series of simulations of rapid species radiations, we demonstrate the efficacy of our new method. These simulations give some insight into the behavior of the method as a function of sampled individuals, sampled loci, and sequence length. Finally, we compare our new method to both an existing method (BEST 2.2) with similar goals and the supermatrix (concatenation) method. We demonstrate that both BEST and our method have much better estimation accuracy for species tree topology than concatenation, and our method outperforms BEST in divergence time and population size estimation.

2,401 citations


Journal ArticleDOI
Jody Hey1
TL;DR: A method for studying the divergence of multiple closely related populations is described and assessed and analysis of simulated data sets reveals the kinds of history that are accessible with a multipopulation analysis.
Abstract: A method for studying the divergence of multiple closely related populations is described and assessed. The approach of Hey and Nielsen (2007, Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics. Proc Natl Acad Sci USA. 104:2785-2790) for fitting an isolation-with-migration model was extended to the case of multiple populations with a known phylogeny. Analysis of simulated data sets reveals the kinds of history that are accessible with a multipopulation analysis. Necessarily, processes associated with older time periods in a phylogeny are more difficult to estimate; and histories with high levels of gene flow are particularly difficult with more than two populations. However, for histories with modest levels of gene flow, or for very large data sets, it is possible to study large complex divergence problems that involve multiple closely related populations or species.

819 citations


Journal ArticleDOI
TL;DR: A well-established biogeographic barrier is used, the mid-Aegean trench separating the western and eastern Aegean archipelago, to estimate substitution rates in tenebrionid beetles, and a divergence rate of 3.54% My(-1) for the cox1 gene was obtained under the preferred partitioning scheme and substitution model selected using Bayes factors.
Abstract: Phylogenetic trees in insects are frequently dated by applying a ‘‘standard’’ mitochondrial DNA (mtDNA) clock estimated at 2.3% My � 1 , but despite its wide use reliable calibration points have been lacking. Here, we used a well-established biogeographic barrier, the mid-Aegean trench separating the western and eastern Aegean archipelago, to estimate substitution rates in tenebrionid beetles. Cytochrome oxidase I (cox1) for six codistributed genera across 28 islands (444 individuals) on both sides of the mid-Aegean trench revealed 60 independently coalescing entities delimited with a mixed Yule-coalescent model. One representative per entity was used for phylogenetic analysis of mitochondrial (cox1, 16S rRNA) and nuclear (Mp20, 28S rRNA) genes. Six nodes marked geographically congruent east–west splits whose separation was largely contemporaneous and likely to reflect the formation of the mid-Aegean trench at 9–12 Mya. Based on these ‘‘known’’ dates, a divergence rate of 3.54% My � 1 for the cox1 gene (2.69% when combined with the 16S rRNA gene) was obtained under the preferred partitioning scheme and substitution model selected using Bayes factors. An extensive survey suggests that discrepancies in mtDNA substitution rates in the entomological literature can be attributed to the use of different substitution models, the use of different mitochondrial gene regions, mixing of intraspecific with interspecific data, and not accounting for variance in coalescent times or postseparation gene flow. Different treatments of these factors in the literature confound estimates of mtDNA substitution rates in opposing directions and obscure lineagespecific differences in rates when comparing data from various sources.

723 citations


Journal ArticleDOI
TL;DR: In POPTREE2 genetic distances, average heterozygosities and G(ST) are computed through a simple and intuitive Windows interface, which will facilitate statistical analyses of polymorphism data for researchers in many different fields.
Abstract: Currently, there is a demand for software to analyze polymorphism data such as microsatellite DNA and single nucleotide polymorphism with easily accessible interface in many fields of research. In this article, we would like to make an announcement of POPTREE2, a computer program package, that can perform evolutionary analyses of allele frequency data. The original version (POPTREE) was a command-line program that runs on the Command Prompt of Windows and Unix. In POPTREE2 genetic distances (measures of the extent of genetic differentiation between populations) for constructing phylogenetic trees, average heterozygosities (H) (a measure of genetic variation within populations) and GST (a measure of genetic differentiation of subdivided populations) are computed through a simple and intuitive Windows interface. It will facilitate statistical analyses of polymorphism data for researchers in many different fields. POPTREE2 is available at http://www.med.kagawa-u.ac.jp/;genomelb/takezaki/poptree2/index.html.

601 citations


Journal ArticleDOI
TL;DR: A Bayesian statistical approach is presented to infer continuous phylogeographic diffusion using random walk models while simultaneously reconstructing the evolutionary history in time from molecular sequence data and demonstrates increased statistical efficiency in spatial reconstructions of overdispersed random walks.
Abstract: Research aimed at understanding the geographic context of evolutionary histories is burgeoning across biological disciplines. Recent endeavors attempt to interpret contemporaneous genetic variation in the light of increasingly detailed geographical and environmental observations. Such interest has promoted the development of phylogeographic inference techniques that explicitly aim to integrate such heterogeneous data. One promising development involves reconstructing phylogeographic history on a continuous landscape. Here, we present a Bayesian statistical approach to infer continuous phylogeographic diffusion using random walk models while simultaneously reconstructing the evolutionary history in time from molecular sequence data. Moreover, by accommodating branch-specific variation in dispersal rates, we relax the most restrictive assumption of the standard Brownian diffusion process and demonstrate increased statistical efficiency in spatial reconstructions of overdispersed random walks by analyzing both simulated and real viral genetic data. We further illustrate how drawing inference about summary statistics from a fully specified stochastic process over both sequence evolution and spatial movement reveals important characteristics of a rabies epidemic. Together with recent advances in discrete phylogeographic inference, the continuous model developments furnish a flexible statistical framework for biogeographical reconstructions that is easily expanded upon to accommodate various landscape genetic features.

594 citations


Journal ArticleDOI
TL;DR: Much of the diversity of plant bHLH proteins was established in early land plants, over 440 million years ago, according to whole-genome sequences from nine species of land plants and algae.
Abstract: Basic helix-loop-helix (bHLH) proteins are a class of transcription factors found throughout eukaryotic organisms. Classification of the complete sets of bHLH proteins in the sequenced genomes of Arabidopsis thaliana and Oryza sativa (rice) has defined the diversity of these proteins among flowering plants. However, the evolutionary relationships of different plant bHLH groups and the diversity of bHLH proteins in more ancestral groups of plants are currently unknown. In this study, we use whole-genome sequences from nine species of land plants and algae to define the relationships between these proteins in plants. We show that few (less than 5) bHLH proteins are encoded in the genomes of chlorophytes and red algae. In contrast, many bHLH proteins (100-170) are encoded in the genomes of land plants (embryophytes). Phylogenetic analyses suggest that plant bHLH proteins are monophyletic and constitute 26 subfamilies. Twenty of these subfamilies existed in the common ancestors of extant mosses and vascular plants, whereas six further subfamilies evolved among the vascular plants. In addition to the conserved bHLH domains, most subfamilies are characterized by the presence of highly conserved short amino acid motifs. We conclude that much of the diversity of plant bHLH proteins was established in early land plants, over 440 million years ago.

457 citations


Journal ArticleDOI
TL;DR: The mitochondrial genomes of Citrullus lanatus and Cucurbita pepo are sequenced--the two smallest characterized cucurbit mitochondrial genomes--and their RNA editing content is determined and it is found that Cuculbita has a significantly higher synonymous substitution rate (and presumably mutation rate) than Citrulla but comparable levels of RNA editing.
Abstract: The mitochondrial genomes of seed plants are unusually large and vary in size by at least an order of magnitude. Much of this variation occurs within a single family, the Cucurbitaceae, whose genomes range from an estimated 390 to 2,900 kb in size. We sequenced the mitochondrial genomes of Citrullus lanatus (watermelon: 379,236 nt) and Cucurbita pepo (zucchini: 982,833 nt)--the two smallest characterized cucurbit mitochondrial genomes--and determined their RNA editing content. The relatively compact Citrullus mitochondrial genome actually contains more and longer genes and introns, longer segmental duplications, and more discernibly nuclear-derived DNA. The large size of the Cucurbita mitochondrial genome reflects the accumulation of unprecedented amounts of both chloroplast sequences (>113 kb) and short repeated sequences (>370 kb). A low mutation rate has been hypothesized to underlie increases in both genome size and RNA editing frequency in plant mitochondria. However, despite its much larger genome, Cucurbita has a significantly higher synonymous substitution rate (and presumably mutation rate) than Citrullus but comparable levels of RNA editing. The evolution of mutation rate, genome size, and RNA editing are apparently decoupled in Cucurbitaceae, reflecting either simple stochastic variation or governance by different factors.

381 citations


Journal ArticleDOI
TL;DR: It is shown that uncertainties in the guide tree used by progressive alignment methods are a major source of alignment uncertainty, and a novel method for quantifying the robustness of each alignment column to guide tree uncertainty is developed, based on the widely used bootstrap method for perturbing the phylogenetic tree.
Abstract: Multiple sequence alignment (MSA) is the basis for a wide range of comparative sequence analyses from molecular phylogenetics to 3D structure prediction. Sophisticated algorithms have been developed for sequence alignment, but in practice, many errors can be expected and extensive portions of the MSA are unreliable. Hence, it is imperative to understand and characterize the various sources of errors in MSAs and to quantify site-specific alignment confidence. In this paper, we show that uncertainties in the guide tree used by progressive alignment methods are a major source of alignment uncertainty. We use this insight to develop a novel method for quantifying the robustness of each alignment column to guide tree uncertainty. We build on the widely used bootstrap method for perturbing the phylogenetic tree. Specifically, we generate a collection of trees and use each as a guide tree in the alignment algorithm, thus producing a set of MSAs. We next test the consistency of every column of the MSA obtained from the unperturbed guide tree with respect to the set of MSAs. We name this measure the ‘‘GUIDe tree based AligNment ConfidencE’’ (GUIDANCE) score. Using the Benchmark Alignment data BASE benchmark as well as simulation studies, we show that GUIDANCE scores accurately identify errors in MSAs. Additionally, we compare our results with the previously published Heads-or-Tails score and show that the GUIDANCE score is a better predictor of unreliably aligned regions.

328 citations


Journal ArticleDOI
TL;DR: This study investigates the effect of the ascertainment bias on inferences regarding genetic differentiation among populations in one of the common genome-wide genotyping platforms and presents a correction of the spectrum for the widely used Affymetrix SNP chips.
Abstract: Chip-basedhigh-throughput genotypinghas facilitatedgenome-wide studies of geneticdiversity.Many studies have utilized these large data sets to make inferences about the demographic history of human populations using measures of genetic differentiationsuch as FST or principal component analyses. However, the single nucleotide polymorphism (SNP) chip data suffer from ascertainmentbiases caused by the SNP discovery process in which a small number of individuals from selected populationsareusedasdiscoverypanels.Inthisstudy,weinvestigatetheeffectoftheascertainmentbiasoninferencesregarding genetic differentiationamong populationsin oneof the common genome-wide genotypingplatforms.We generateSNP genotyping data for individuals that previously have been subject to partial genome-wide Sanger sequencing and compare inferences based on genotyping data to inferences based on direct sequencing. In addition, we also analyze publicly availablegenome-widedata.We demonstratethattheascertainmentbiaseswilldistortmeasures ofhumandiversityandpossibly change conclusionsdrawnfromthesemeasures insome timesunexpectedways.Wealsoshowthatdetailsofthegenotyping calling algorithms can have a surprisingly large effect on populationgenetic inferences.We not only present a correction of the spectrumfor the widely used AffymetrixSNP chips but also show that such corrections are difficult to generalizeamong studies.

323 citations


Journal ArticleDOI
TL;DR: This work recovers monophyletic Porifera as the sister group to all other Metazoa and suggests that the basal position of the fast-evolving Ctenophora proposed by Dunn et al. was due to LBA and that broad taxon sampling is of fundamental importance to metazoan phylogenomic analyses.
Abstract: Despite expanding data sets and advances in phylogenomic methods, deep-level metazoan relationships remain highly controversial. Recent phylogenomic analyses depart from classical concepts in recovering ctenophores as the earliest branching metazoan taxon and propose a sister-group relationship between sponges and cnidarians (e.g., Dunn CW, Hejnol A, Matus DQ, et al. (18 co-authors). 2008. Broad phylogenomic sampling improves resolution of the animal tree of life. Nature 452:745–749). Here, we argue that these results are artifacts stemming from insufficient taxon sampling and long-branch attraction (LBA). By increasing taxon sampling from previously unsampled nonbilaterians and using an identical gene set to that reported by Dunn et al., we recover monophyletic Porifera as the sister group to all other Metazoa. This suggests that the basal position of the fast-evolving Ctenophora proposed by Dunn et al. was due to LBA and that broad taxon sampling is of fundamental importance to metazoan phylogenomic analyses. Additionally, saturation in the Dunn et al. character set is comparatively high, possibly contributing to the poor support for some nonbilaterian nodes.

Journal ArticleDOI
TL;DR: The resulting tree, the largest in number of genera and markers sampled to date and covering the whole family in a representative way, provides important insights into the evolution of the family on a broad scale.
Abstract: Brassicaceae is an important family at both the agronomic and scientific level. The family not only inlcudes several model species, but it is also becoming an evolutionary model at the family level. However, resolving the phylogenetic relationships within the family has been problematic, and a large-scale molecular phylogeny in terms of generic sampling and number of genes is still lacking. In particular, the deeper relationships within the family, for example between the three major recognized lineages, prove particularly hard to resolve. Using a slow-evolving mitochondrial marker (nad4 intron 1), we reconstructed a comprehensive phylogeny in generic representation for the family. In addition, and because resolution was very low in previous single marker phylogenies, we adopted a supermatrix approach by concatenating all checked and reliable sequences available on GenBank as well as new sequences for a total 207 currently recognized genera and eight molecular markers representing a comprehensive coverage of all three genomes. The supermatrix was dated under an uncorrelated relaxed molecular clock using a direct fossil calibration approach. Finally, a lineage-through-time-plot and rates of diversification for the family were generated. The resulting tree, the largest in number of genera and markers sampled to date and covering the whole family in a representative way, provides important insights into the evolution of the family on a broad scale. The backbone of the tree remained largely unresolved and is interpreted as the consequence of early rapid radiation within the family. The age of the family was inferred to be 37.6 (24.2–49.4) Ma, which largely agrees with previous studies. The ages of all major lineages and tribes are also reported. Analysis of diversification suggests that Brassicaceae underwent a rapid period of diversification, after the split with the early diverging tribe Aethionemeae. Given the dates found here, the family appears to have originated under a warm and humid climate approximately 37 Ma. We suggest that the rapid radiation detected was caused by a global cooling during the Oligocene coupled with a genome duplication event. This duplication could have allowed the family to rapidly adapt to the changing climate.

Journal ArticleDOI
TL;DR: It is shown that phylogenomic data can substantially advance the understanding of arthropod evolution and resolve several conflicts among existing hypotheses.
Abstract: Arthropods were the first animalsto conquer land and air. They encompass more than three quarters of all described living species.Thisextraordinaryevolutionarysuccessisbasedonanastoundinglywidearrayofhighly adaptivebody organizations. A lackofrobustlyresolvedphylogeneticrelationships,however,currentlyimpedesthereliablereconstructionoftheunderlyingevolutionaryprocesses.Here,weshowthatphylogenomicdatacansubstantiallyadvanceourunderstandingofarthropod evolution and resolve several conflicts among existinghypotheses. We assembled a data set of 233 taxa and 775 genes from which an optimally informative data set of 117 taxa and 129 genes was finally selected using new heuristics and compared with the unreduced data set.We included novelexpressed sequencetag (EST) data for 11 species and allpublished phylogenomic data augmentedbyrecentlypublishedESTdata ontaxonomicallyimportantarthropodtaxa.This thorough sampling reduces the chance of obtainingspurious results due to stochastic effects of undersampling taxa and genes. Orthology prediction of genes, alignmentmasking tools, and selection of most informativegenes due to a balanced taxa–gene ratio using new heuristics were established.Our optimizeddata setrobustly resolves major arthropod relationships.We received strong supportforasistergrouprelationshipofonychophoransandeuarthropodsandstrongsupportforacloseassociationoftardigrades andcycloneuralia.Withinpancrustaceans,our analysesyieldedparaphyleticcrustaceansandmonophyletichexapods and robustly resolved monophyletic endopterygoteinsects.However, our analysesalsoshowed for few deep splitsthat were recently thought to be resolved,for example,the positionofmyriapods, a remarkable sensitivitytomethods of analyses.

Journal ArticleDOI
TL;DR: The use of temporally structured sequence data within a Bayesian framework is explored to estimate the evolutionary rates for seven human dsDNA viruses, including variola virus (VARV) (the causative agent of smallpox) and herpes simplex virus-1, revealing that some ds DNA viruses may evolve at rates approaching those of RNA viruses.
Abstract: Double-stranded (ds) DNA viruses are often described as evolving through long-term codivergent associations with their hosts, a pattern that is expected to be associated with low rates of nucleotide substitution. However, the hypothesis of codivergence between dsDNA viruses and their hosts has rarely been rigorously tested, even though the vast majority of nucleotide substitution rate estimates for dsDNA viruses are based upon this assumption. It is therefore important to estimate the evolutionary rates of dsDNA viruses independent of the assumption of host-virus codivergence. Here, we explore the use of temporally structured sequence data within a Bayesian framework to estimate the evolutionary rates for seven human dsDNA viruses, including variola virus (VARV) (the causative agent of smallpox) and herpes simplex virus-1. Our analyses reveal that although the VARV genome is likely to evolve at a rate of approximately 1 10 5 substitutions/ site/year and hence approaching that of many RNA viruses, the evolutionary rates of many other dsDNA viruses remain problematic to estimate. Synthetic data sets were constructed to inform our interpretation of the substitution rates estimated for these dsDNA viruses and the analysis of these demonstrated that given a sequence data set of appropriate length and sampling depth, it is possible to use time-structured analyses to estimate the substitution rates of many dsDNA viruses independently from the assumption of host-virus codivergence. Finally, the discovery that some dsDNA viruses may evolve at rates approaching those of RNA viruses has important implications for our understanding of the long-term evolutionary history and emergence potential of this major group of viruses.

Journal ArticleDOI
TL;DR: Estimates suggest that a single novel RV (e.g., a vaccine escape mutant) can spread worldwide in little more than a decade, and re-emphasize the need for thorough and continued RV surveillance in order to detect such potential spreading events at an early stage.
Abstract: Rotaviruses (RVs) are responsible for more than 600,000 child deaths each year. The worldwide introduction of two life oral vaccines RotaTeq and Rotarix is believed to reduce this number significantly. Before the licensing of both vaccines, two new genotypes, G9 and G12, emerged in the human population and were able to spread across the entire globe in a very short time span. To quantify the VP7 mutation rates of these G9 and G12 genotypes and to estimate their most recent common ancestors, we used a Bayesian Markov chain Monte Carlo framework. Based on 356 sequences for G9 and 140 sequences for G12, we estimated mutation rates (nt substitutions/site/year) of 1.87 × 10(-3) (1.45-2.27 × 10(-3)) for G9 and 1.66 × 10(-3) (1.13-2.32 × 10(-3)) for G12. For both the G9 and G12 strains, one particular (sub) lineage was able to disseminate and cause disease across the world. The most recent common ancestors of these particular lineages were dated back to 1989 (1986-1992) and 1995 (1992-1998) for the G9 and G12 genotypes, respectively. These estimates suggest that a single novel RV (e.g., a vaccine escape mutant) can spread worldwide in little more than a decade. These results re-emphasize the need for thorough and continued RV surveillance in order to detect such potential spreading events at an early stage.

Journal ArticleDOI
TL;DR: It is found that insertions and deletions do not cause excessive false positives if the alignment is correct, but alignment errors can lead to unacceptably high false positives, and it is important to use reliable alignment methods.
Abstract: The detection of positive Darwinian selection affecting protein-coding genes remains a topic of great interest and importance. The "branch-site" test is designed to detect localized episodic bouts of positive selection that affect only a few amino acid residues on particular lineages and has been shown to have reasonable power and low false-positive rates for a wide range of selection schemes. Previous simulations examining the performance of the test, however, were conducted under idealized conditions without insertions, deletions, or alignment errors. As the test is sometimes used to analyze divergent sequences, the impact of indels and alignment errors is a major concern. Here, we used a recently developed indel-simulation program to examine the false-positive rate and power of the branch-site test. We find that insertions and deletions do not cause excessive false positives if the alignment is correct, but alignment errors can lead to unacceptably high false positives. Of the alignment methods evaluated, PRANK consistently outperformed MUSCLE, MAFFT, and ClustalW, mostly because the latter programs tend to place nonhomologous codons (or amino acids) into the same column, producing shorter and less accurate alignments and giving the false impression that many amino acid substitutions have occurred at those sites. Our examination of two previous studies suggests that alignment errors may impact the analysis of mammalian and vertebrate genes by the branch-site test, and it is important to use reliable alignment methods.

Journal ArticleDOI
TL;DR: This work investigates whether plants show evidence of adaptive evolution using an extension of the McDonald-Kreitman test that explicitly models slightly deleterious mutations by estimating the distribution of fitness effects of new mutations.
Abstract: The relative contribution of advantageous and neutral mutations to the evolutionary process is a central problem in evolutionary biology. Current estimates suggest that whereas Drosophila, mice, and bacteria have undergone extensive adaptive evolution, hominids show little or no evidence of adaptive evolution in protein-coding sequences. This may be a consequence of differences in effective population size. To study the matter further, we have investigated whether plants show evidence of adaptive evolution using an extension of the McDonald-Kreitman test that explicitly models slightly deleterious mutations by estimating the distribution of fitness effects of new mutations. We apply this method to data from nine pairs of species. Altogether more than 2,400 loci with an average length of approximate to 280 nucleotides were analyzed. We observe very similar results in all species; we find little evidence of adaptive amino acid substitution in any comparison except sunflowers. This may be because many plant species have modest effective population sizes.

Journal ArticleDOI
TL;DR: The effects of a number of violations of the "Isolation with Migration" (IM) model, including intralocus recombination, population structure, gene flow from an unsampled species, linkage among loci, and divergent selection, on demographic parameter estimates made using the program IMA are examined.
Abstract: Methods developed over the past decade have made it possible to estimate molecular demographic parameters such as effective population size, divergence time, and gene flow with unprecedented accuracy and precision. However, they make simplifying assumptions about certain aspects of the species’ histories and the nature of the genetic data, and it is not clear how robust they are to violations of these assumptions. Here, we use simulated data sets to examine the effects of a number of violations of the “Isolation with Migration” (IM) model, including intralocus recombination, population structure, gene flow from an unsampled species, linkage among loci, and divergent selection, on demographic parameter estimates made using the program IMA. We also examine the effect of having data that fit a nucleotide substitution model other than the two relatively simple models available in IMA. We find that IMA estimates are generally quite robust to small to moderate violations of the IM model assumptions, comparable with what is often encountered in real-world scenarios. In particular, population structure within species, a condition encountered to some degree in virtually all species, has little effect on parameter estimates even for fairly high levels of structure. Likewise, most parameter estimates are robust to significant levels of recombination when data sets are pared down to apparently nonrecombining blocks, although substantial bias is introduced to several estimates when the entire data set with recombination is included. In contrast, a poor fit to the nucleotide substitution model can result in an increased error rate, in some cases due to a predictable bias and in other cases due to an increase in variance in parameter estimates among data sets simulated under the same conditions.

Journal ArticleDOI
TL;DR: A phylogenomic falsification of the chromalveolate hypothesis that estimates signal strength across the three genomic compartments is devised, and the hypothesis is rejected as falsified in favor of more complex evolutionary scenarios involving multiple higher order eukaryotes-eukaryote endosymbioses.
Abstract: According to the chromalveolate hypothesis (Cavalier-Smith T. 1999. Principles of protein and lipid targeting in secondary symbiogenesis: euglenoid, dinoflagellate, and sporozoan plastid origins and the eukaryote family tree. J Eukaryot Microbiol 46:347-366), the four eukaryotic groups with chlorophyll c-containing plastids originate from a single photosynthetic ancestor, which acquired its plastids by secondary endosymbiosis with a red alga. So far, molecular phylogenies have failed to either support or disprove this view. Here, we devise a phylogenomic falsification of the chromalveolate hypothesis that estimates signal strength across the three genomic compartments: If the four chlorophyll c-containing lineages indeed derive from a single photosynthetic ancestor, then similar amounts of plastid, mitochondrial, and nuclear sequences should allow to recover their monophyly. Our results refute this prediction, with statistical support levels too different to be explained by evolutionary rate variation, phylogenetic artifacts, or endosymbiotic gene transfer. Therefore, we reject the chromalveolate hypothesis as falsified in favor of more complex evolutionary scenarios involving multiple higher order eukaryote-eukaryote endosymbioses.

Journal ArticleDOI
TL;DR: Large-scale differences in genes expressed in nacre-forming cells of Pinctada and Haliotis are compatible with the hypothesis that gastropod and bivalve nacre is the result of convergent evolution.
Abstract: The capacity to biomineralize is closely linked to the rapid expansion of animal life during the early Cambrian, with many skeletonized phyla first appearing in the fossil record at this time. The appearance of disparate molluscan forms during this period leaves open the possibility that shells evolved independently and in parallel in at least some groups. To test this proposition and gain insight into the evolution of structural genes that contribute to shell fabrication, we compared genes expressed in nacre (mother-of-pearl) forming cells in the mantle of the bivalve Pinctada maxima and the gastropod Haliotis asinina. Despite both species having highly lustrous nacre, we find extensive differences in these expressed gene sets. Following the removal of housekeeping genes, less than 10% of all gene clusters are shared between these molluscs, with some being conserved biomineralization genes that are also found in deuterostomes. These differences extend to secreted proteins that may localize to the organic shell matrix, with less than 15% of this secretome being shared. Despite these differences, H. asinina and P. maxima both secrete proteins with repetitive low-complexity domains (RLCDs). Pinctada maxima RLCD proteins-for example, the shematrins-are predominated by silk/fibroin-like domains, which are absent from the H. asinina data set. Comparisons of shematrin genes across three species of Pinctada indicate that this gene family has undergone extensive divergent evolution within pearl oysters. We also detect fundamental bivalve-gastropod differences in extracellular matrix proteins involved in mollusc-shell formation. Pinctada maxima expresses a chitin synthase at high levels and several chitin deacetylation genes, whereas only one protein involved in chitin interactions is present in the H. asinina data set, suggesting that the organic matrix on which calcification proceeds differs fundamentally between these species. Large-scale differences in genes expressed in nacre-forming cells of Pinctada and Haliotis are compatible with the hypothesis that gastropod and bivalve nacre is the result of convergent evolution. The expression of novel biomineralizing RLCD proteins in each of these two molluscs and, interestingly, sea urchins suggests that the evolution of such structural proteins has occurred independently multiple times in the Metazoa.

Journal ArticleDOI
Jody Hey1
TL;DR: The divergence of bonobos and three subspecies of the common chimpanzee was examined under a multipopulation isolation-with-migration (IM) model with data from 73 loci drawn from the literature, and an example of this was found in which gene flow is indicated between the westernCommon chimpanzee subspecies and the ancestor of the central and the eastern common chimpanzees subspecies.
Abstract: The divergence of bonobos and three subspecies of the common chimpanzee was examined under a multipopulation isolation-with-migration (IM) model with data from 73 loci drawn from the literature. A benefit of having a full multipopulation model, relative to conducting multiple pairwise analyses between sampled populations, is that a full model can reveal historical gene flow involving ancestral populations. An example of this was found in which gene flow is indicated between the western common chimpanzee subspecies and the ancestor of the central and the eastern common chimpanzee subspecies. The results of a full analysis on all four populations are strongly consistent with analyses on pairs of populations and generally similar to results from previous studies. The basal split between bonobos and common chimpanzees was estimated at 0.93 Ma (0.68–1.54 Ma, 95% highest posterior density interval), with the split among the ancestor of three common chimpanzee populations at 0.46 Ma (0.35–0.65), and the most recent split between central and eastern common chimpanzee populations at 0.093 Ma (0.041–0.157). Population size estimates mostly fell in the range from 5,000 to 10,000 individuals. The exceptions are the size of the ancestor of the common chimpanzee and the bonobo, at 17,000 (8,000–28,000) individuals, and the central common chimpanzee and its immediate ancestor with the eastern common chimpanzee, which have effective size estimates at 27,000 (16,000–44,000) and 32,000 (19,000–54,000) individuals, respectively.

Journal ArticleDOI
TL;DR: Significant evidence is found that rates of molecular evolution are correlated with generation time in invertebrates and that this effect applies consistently across genes and taxonomic groups.
Abstract: The rate of genome evolution varies significantly between species. Evidence is growing that at least some of this variation is associated with species characteristics, such as body size, diversification rate, or population size. One of the strongest correlates of the rate of molecular evolution in vertebrates is generation time (GT): Species with faster generation turnover tend to have higher rates of molecular evolution, presumably because their genomes are copied more frequently and therefore collect more DNA replication errors per unit time. But the GT effect has never been tested for nonvertebrate animals. Here, we present the first general test of the GT effect in invertebrates, using 15 genes from 143 species spread across the major eumetazoan superphyla (including arthropods, nematodes, molluscs, annelids, platyhelminthes, cnidarians, echinoderms, and urochordates). We find significant evidence that rates of molecular evolution are correlated with GT in invertebrates and that this effect applies consistently across genes and taxonomic groups. Furthermore, the GT effect is evident in nonsynonymous substitutions, whereas theory predicts (and most previous evidence has supported) a relationship only in synonymous changes. We discuss both the practical and theoretical implications of these findings.

Journal ArticleDOI
TL;DR: The timetree derived from a relaxed molecular clock Bayesian method suggests that the holocephalans originated in the Silurian about 420 Ma, having survived from the end-Permian mass extinction and undergoing familial diversifications during the late Jurassic to early Cretaceous (170-120 Ma).
Abstract: With our increasing ability for generating whole-genome sequences, comparative analysis of whole genomes has become a powerful tool for understanding the structure, function, and evolutionary history of human and other vertebrate genomes. By virtue of their position basal to bony vertebrates, cartilaginous fishes (class Chondrichthyes) are a valuable outgroup in comparative studies of vertebrates. Recently, a holocephalan cartilaginous fish, the elephant shark, Callorhinchus milii (Subclass Holocephali: Order Chimaeriformes), has been proposed as a model genome, and low-coverage sequence of its genome has been generated. Despite such an increasing interest, the evolutionary history of the modern holocephalans-a previously successful and diverse group but represented by only 39 extant species-and their relationship with elasmobranchs and other jawed vertebrates has been poorly documented largely owing to a lack of well-preserved fossil materials after the end-Permian about 250 Ma. In this study, we assembled the whole mitogenome sequences for eight representatives from all the three families of the modern holocephalans and investigated their phylogenetic relationships and evolutionary history. Unambiguously aligned sequences from these holocephalans together with 17 other vertebrates (9,409 nt positions excluding entire third codon positions) were subjected to partitioned maximum likelihood analysis. The resulting tree strongly supported a single origin of the modern holocephalans and their sister-group relationship with elasmobranchs. The mitogenomic tree recovered the most basal callorhinchids within the chimaeriforms, which is sister to a clade comprising the remaining two families (rhinochimaerids and chimaerids). The timetree derived from a relaxed molecular clock Bayesian method suggests that the holocephalans originated in the Silurian about 420 Ma, having survived from the end-Permian (250 Ma) mass extinction and undergoing familial diversifications during the late Jurassic to early Cretaceous (170-120 Ma). This postulated evolutionary scenario agrees well with that based on the paleontological observations.

Journal ArticleDOI
TL;DR: Evidence from 844 zebu mitochondrial DNA sequences surveyed from 19 Asiatic countries comprising 8 regional groups is reported, which identify 2 distinct mitochondrial haplogroups, termed I1 and I2, which support the Indus Valley as the most likely center of origin for the I1 haplogroup and a primary center of zebe domestication.
Abstract: Animal domestication was a major step forward in human prehistory, contributing to the emergence of more complex societies. At the time of the Neolithic transition, zebu cattle (Bos indicus) were probably the most abundant and important domestic livestock species in Southern Asia. Although archaeological evidence points toward the domestication of zebu cattle within the Indian subcontinent, the exact geographic origins and phylogenetic history of zebu cattle remains uncertain. Here, we report evidence from 844 zebu mitochondrial DNA (mtDNA) sequences surveyed from 19 Asiatic countries comprising 8 regional groups, which identify 2 distinct mitochondrial haplogroups, termed I1 and I2. The marked increase in nucleotide diversity (P < 0.001) for both the I1 and I2 haplogroups within the northern part of the Indian subcontinent is consistent with an origin for all domestic zebu in this area. For haplogroup I1, genetic diversity was highest within the Indus Valley among the three hypothesized domestication centers (Indus Valley, Ganges, and South India). These data support the Indus Valley as the most likely center of origin for the I1 haplogroup and a primary center of zebu domestication. However, for the I2 haplogroup, a complex pattern of diversity is detected, preventing the unambiguous pinpointing of the exact place of origin for this zebu maternal lineage. Our findings are discussed with respect to the archaeological record for zebu domestication within the Indian subcontinent.

Journal ArticleDOI
TL;DR: It is implied that both positive and purifying selection are more effective in C. grandiflora than in A. thaliana, consistent with the contrasting demographic history and effective population sizes of these species.
Abstract: Recent studies comparing genome-wide polymorphism and divergence in Drosophila have found evidence for a surprisingly high proportion of adaptive amino acid fixations, but results for other taxa are mixed. In particular, few studies have found convincing evidence for adaptive amino acid substitution in plants. To assess the generality of this finding, we have sequenced 257 loci in the outcrossing crucifer Capsella grandiflora, which has a large effective population size and low population structure. Using a new method that jointly infers selective and demographic effects, we estimate that 40% of amino acid substitutions were fixed by positive selection in this species, and we also infer a low proportion of slightly deleterious amino acid mutations. We contrast these estimates with those for a similar data set from the closely related Arabidopsis thaliana and find significantly higher rates of adaptive evolution and fewer nearly neutral mutations in C. grandiflora. In agreement with results for other taxa, genes involved in reproduction show the strongest evidence for positive selection in C. grandiflora. Taken together, these results imply that both positive and purifying selection are more effective in C. grandiflora than in A. thaliana, consistent with the contrasting demographic history and effective population sizes of these species.

Journal ArticleDOI
TL;DR: Adaptive evolution at TLRs does not appear to reflect a constant turnover of alleles and instead might be more episodic in nature, which is consistent with more ephemeral pathogen-host associations rather than with long-term coevolution.
Abstract: Frequent positive selection is a hallmark of genes involved in the adaptive immune system of vertebrates, but the incidence of positive selection for genes underlying innate immunity in vertebrates has not been well studied. The toll-like receptors (TLRs) of the innate immune system represent the first line of defense against pathogens. TLRs lie directly at the host‐environment interface, and they target microbial molecules. Because of this, they might be subject to frequent positive selection due to coevolutionary dynamics with their microbial counterparts. However, they also recognize conserved molecular motifs, and this might constrain their evolution. Here, we investigate the evolution of the ten human TLRs in the framework of these competing ideas. We studied rates of protein evolution among primate species and we analyzed patterns of polymorphism in humans and chimpanzees. This provides a window into TLR evolution at both long and short timescales. We found a clear signature of positive selection in the rates of substitution across primates in most TLRs. Some of the implicated sites fall in structurally important protein domains, involve radical amino acid changes, or overlap with polymorphisms with known clinical associations in humans. However, within species, patterns of nucleotide variation were generally compatible with purifying selection, and these patterns differed between humans and chimpanzees and between viral and nonviral TLRs. Thus, adaptive evolution at TLRs does not appear to reflect a constant turnover of alleles and instead might be more episodic in nature. This pattern is consistent with more ephemeral pathogen‐host associations rather than with long-term coevolution.

Journal ArticleDOI
TL;DR: This paper studied natural epigenetic variation in three allotetraploid sibling orchid species (Dactylorhiza majalis s.str, D. traunsteineri s.l., and D. ebudensis) that differ radically in geography/ecology.
Abstract: Epigenetic information includes heritable signals that modulate gene expression but are not encoded in the primary nucleotide sequence. We have studied natural epigenetic variation in three allotetraploid sibling orchid species (Dactylorhiza majalis s.str, D. traunsteineri s.l., and D. ebudensis) that differ radically in geography/ecology. The epigenetic variation released by genome doubling has been restructured in species-specific patterns that reflect their recent evolutionary history and have an impact on their ecology and evolution, hundreds of generations after their formation. Using two contrasting approaches that yielded largely congruent results, epigenome scans pinpointed epiloci under divergent selection that correlate with eco-environmental variables, mainly related to water availability and temperature. The stable epigenetic divergence in this group is largely responsible for persistent ecological differences, which then set the stage for species-specific genetic patterns to accumulate in response to further selection and/or drift. Our results strongly suggest a need to expand our current evolutionary framework to encompass a complementary epigenetic dimension when seeking to understand population processes that drive phenotypic evolution and adaptation.

Journal ArticleDOI
TL;DR: It is shown that incorporating phylogenetic uncertainty by integrating over topologies very rarely changes the inferred ancestral state and does not improve the accuracy of the reconstructed ancestral sequence, suggesting that ML can produce accurate ASRs, even in the face of phylogenetic Uncertainty.
Abstract: Ancestral sequence reconstruction (ASR) is widely used to formulate and test hypotheses about the sequences, functions, and structures of ancient genes. Ancestral sequences are usually inferred from an alignment of extant sequences using a maximum likelihood (ML) phylogenetic algorithm, which calculates the most likely ancestral sequence assuming a probabilistic model of sequence evolution and a specific phylogeny—typically the tree with the ML. The true phylogeny is seldom known with certainty, however. ML methods ignore this uncertainty, whereas Bayesian methods incorporate it by integrating the likelihood of each ancestral state over a distribution of possible trees. It is not known whether Bayesian approaches to phylogenetic uncertainty improve the accuracy of inferred ancestral sequences. Here, we use simulation-based experiments under both simplified and empirically derived conditions to compare the accuracy of ASR carried out using ML and Bayesian approaches. We show that incorporating phylogenetic uncertainty by integrating over topologies very rarely changes the inferred ancestral state and does not improve the accuracy of the reconstructed ancestral sequence. Ancestral state reconstructions are robust to uncertainty about the underlying tree because the conditions that produce phylogenetic uncertainty also make the ancestral state identical across plausible trees; conversely, the conditions under which different phylogenies yield different inferred ancestral states produce little or no ambiguity about the true phylogeny. Our results suggest that ML can produce accurate ASRs, even in the face of phylogenetic uncertainty. Using Bayesian integration to incorporate this uncertainty is neither necessary nor beneficial.

Journal ArticleDOI
TL;DR: It is found that humans retain larger numbers of ancestral OR genes that were in the common ancestor of NWMs/OWMs/hominoids than orangutans and macaques and that the OR gene repertoire in humans is more similar to that of marmosets than those of orangutan and macaque.
Abstract: Odor molecules in the environment are detected by olfactory receptors (ORs), being encoded by a large multigene family in mammalian genomes It is generally thought that primates are vision oriented and dependent weakly on olfaction Previous studies suggested that Old World monkeys (OWMs) and hominoids lost many functional OR genes after the divergence from New World monkeys (NWMs) due to the acquisition of well-developed trichromatic vision To examine this hypothesis, here we analyzed OR gene repertoires of five primate species including NWMs, OWMs, and hominoids for which high-coverage genome sequences are available, together with two prosimians and tree shrews with low-coverage genomes The results showed no significant differences in the number of functional OR genes between NWMs (marmosets) and OWMs/hominoids Two independent analyses, identification of orthologous genes among the five primates and estimation of the numbers of ancestral genes by the reconciled tree method, did not support a sudden loss of OR genes at the branch of the OWMs/hominoids ancestor but suggested a gradual loss in every lineage Moreover, we found that humans retain larger numbers of ancestral OR genes that were in the common ancestor of NWMs/OWMs/ hominoids than orangutans and macaques and that the OR gene repertoire in humans is more similar to that of marmosets than those of orangutans and macaques These results suggest that the degeneration of OR genes in primates cannot simply be explained by the acquisition of trichromatic vision, and our sense of smell may not be inferior to other primate species

Journal ArticleDOI
TL;DR: The highest correlations were observed between the amino acid frequencies in dis ordered proteins and the solvent-exposed loops and turns of ordered proteins, supporting an emerging structural model for disordered proteins.
Abstract: Most models of protein evolution are based upon proteins that form relatively rigid 3D structures. A significant fraction of proteins, the so-called disordered proteins, do not form rigid 3D structures and sample a broad conformational ensemble. Disordered proteins do not typically maintain long-range interactions, so the constraints on their evolution should be different than ordered proteins. To test this hypothesis, we developed and compared models of evolution for disordered and ordered proteins. Substitution matrices were constructed using the sequences of putative homologs for sets of experimentally characterized disordered and ordered proteins. Separate matrices, at three levels of sequence similarity (>85%, 85–60%, and 60–40%), were inferred for each type of protein structure. The substitution matrices for disordered and ordered proteins differed significantly at each level of sequence similarity. The disordered matrices reflected a greater likelihood of evolutionary changes, relative to the ordered matrices, and these changes involved nonconservative substitutions. Glutamic acid and asparagine were interesting exceptions to this result. Important differences between the substitutions that are accepted in disordered proteins relative to ordered proteins were also identified. In general, disordered proteins have fewer evolutionary constraints than ordered proteins. However, some residues like tryptophan and tyrosine are highly conserved in disordered proteins. This is due to their important role in forming protein–protein interfaces. Finally, the amino acid frequencies for disordered proteins, computed during the development of the matrices, were compared with amino acid frequencies for different categories of secondary structure in ordered proteins. The highest correlations were observed between the amino acid frequencies in disordered proteins and the solvent-exposed loops and turns of ordered proteins, supporting an emerging structural model for disordered proteins.