scispace - formally typeset
Search or ask a question

Showing papers in "Molecular Biology and Evolution in 2011"


Journal ArticleDOI
TL;DR: The newest addition in MEGA5 is a collection of maximum likelihood (ML) analyses for inferring evolutionary trees, selecting best-fit substitution models, inferring ancestral states and sequences, and estimating evolutionary rates site-by-site.
Abstract: Comparative analysis of molecular sequence data is essential for reconstructing the evolutionary histories of species and inferring the nature and extent of selective forces shaping the evolution of genes and species. Here, we announce the release of Molecular Evolutionary Genetics Analysis version 5 (MEGA5), which is a user-friendly software for mining online databases, building sequence alignments and phylogenetic trees, and using methods of evolutionary bioinformatics in basic biology, biomedicine, and evolution. The newest addition in MEGA5 is a collection of maximum likelihood (ML) analyses for inferring evolutionary trees, selecting best-fit substitution models (nucleotide or amino acid), inferring ancestral states and sequences (along with probabilities), and estimating evolutionary rates site-by-site. In computer simulation analyses, ML tree inference algorithms in MEGA5 compared favorably with other software packages in terms of computational efficiency and the accuracy of the estimates of phylogenetic trees, substitution parameters, and rate variation among sites. The MEGA user interface has now been enhanced to be activity driven to make it easier for the use of both beginners and experienced scientists. This version of MEGA is intended for the Windows platform, and it has been configured for effective use on Mac OS X and Linux desktops. It is available free of charge from http://www.megasoftware.net.

39,110 citations


Journal ArticleDOI
TL;DR: This paper presents a test for ancient admixture that exploits the asymmetry in the frequencies of the two nonconcordant gene trees in a three-population tree, and derives the analytic expectation of a test statistic, called the D statistic, which is sensitive to asymmetry under alternative demographic scenarios.
Abstract: One enduring question in evolutionary biology is the extent of archaic admixture in the genomes of present-day populations. In this paper, we present a test for ancient admixture that exploits the asymmetry in the frequencies of the two nonconcordant gene trees in a three-population tree. This test was first applied to detect interbreeding between Neandertals and modern humans. We derive the analytic expectation of a test statistic, called the D statistic, which is sensitive to asymmetry under alternative demographic scenarios. We show that the D statistic is insensitive to some demographic assumptions such as ancestral population sizes and requires only the assumption that the ancestral populations were randomly mating. An important aspect of D statistics is that they can be used to detect archaic admixture even when no archaic sample is available. We explore the effect of sequencing error on the false-positive rate of the test for admixture, and we show how to estimate the proportion of archaic ancestry in the genomes of present-day populations. We also investigate a model of subdivision in ancestral populations that can result in D statistics that indicate recent admixture.

1,047 citations


Journal ArticleDOI
TL;DR: Felsenstein's pruning algorithm is extended to allow efficient likelihood computations for models in which variation over branches (and not just sites) is described in the random effects likelihood framework, and this model treats the selective class of every branch at a particular site as an unobserved state that is chosen independently of that at any other branch.
Abstract: Adaptive evolution frequently occurs in episodic bursts, localized to a few sites in a gene, and to a small number of lineages in a phylogenetic tree. A popular class of “branch-site” evolutionary models provides a statistical framework to search for evidence of such episodic selection. For computational tractability, current branch-site models unrealistically assume that all branches in the tree can be partitioned a priori into two rigid classes—“foreground” branches that are allowed to undergo diversifying selective bursts and “background” branches that are negatively selected or neutral. We demonstrate that this assumption leads to unacceptably high rates of false positives or false negatives when the evolutionary process along background branches strongly deviates from modeling assumptions. To address this problem, we extend Felsenstein's pruning algorithm to allow efficient likelihood computations for models in which variation over branches (and not just sites) is described in the random effects likelihood framework. This enables us to model the process at every branch-site combination as a mixture of three Markov substitution models—our model treats the selective class of every branch at a particular site as an unobserved state that is chosen independently of that at any other branch. When benchmarked on a previously published set of simulated sequences, our method consistently matched or outperformed existing branch-site tests in terms of power and error rates. Using three empirical data sets, previously analyzed for episodic selection, we discuss how modeling assumptions can influence inference in practical situations.

420 citations


Journal ArticleDOI
TL;DR: Analysis of genome-wide sequence variations in Tibetans indicates strong signals of selective sweep in two hypoxia-related genes, EPAS1 and EGLN1, and suggests that during the long-term occupation of high-altitude areas, the functional sequence variations for acquiring biological adaptation to high-Altitude Hypoxia have been enriched in Tibetan populations.
Abstract: Modern humans have occupied almost all possible environments globally since exiting Africa about 100,000 years ago. Both behavioral and biological adaptations have contributed to their success in surviving the rigors of climatic extremes, including cold, strong ultraviolet radiation, and high altitude. Among these environmental stresses, high-altitude hypoxia is the only condition in which traditional technology is incapable of mediating its effects. Inhabiting at .3,000-m high plateau, the Tibetan population provides a widely studied example of high-altitude adaptation. Yet, the genetic mechanisms underpinning long-term survival in this environmental extreme remain unknown. We performed an analysis of genome-wide sequence variations in Tibetans. In combination with the reported data, we identified strong signals of selective sweep in two hypoxia-related genes, EPAS1 and EGLN1. For these two genes, Tibetans show unusually high divergence from the nonTibetan lowlanders (Han Chinese and Japanese) and possess high frequencies of many linked sequence variations as reflected by the Tibetan-specific haplotypes. Further analysis in seven Tibetan populations (1,334 individuals) indicates the prevalence of selective sweep across the Himalayan region. The observed indicators of natural selection on EPAS1 and EGLN1 suggest that during the long-term occupation of high-altitude areas, the functional sequence variations for acquiring biological adaptation to high-altitude hypoxia have been enriched in Tibetan populations.

337 citations


Journal ArticleDOI
TL;DR: Geraniaceae plastid genomes were sequenced and compared with other rosids and the previously published Pelargonium hortorum plastome to propose that increases in genomic rearrangements, repetitive DNA, nucleotide substitutions, and GC content may be caused by relaxed selection resulting from improper DNA repair.
Abstract: Geraniaceae plastid genomes (plastomes) have experienced a remarkable number of genomic changes. The plastomes of Erodium texanum, Geranium palmatum, and Monsonia speciosa were sequenced and compared with other rosids and the previously published Pelargonium hortorum plastome. Geraniaceae plastomes were found to be highly variable in size, gene content and order, repetitive DNA, and codon usage. Several unique plastome rearrangements include the disruption of two highly conserved operons (S10 and rps2-atpA), and the inverted repeat (IR) region in M. speciosa does not contain all genes in the ribosomal RNA operon. The sequence of M. speciosa is unusually small (128,787 bp); among angiosperm plastomes sequenced to date, only those of nonphotosynthetic species and those that have lost one IR copy are smaller. In contrast, the plastome of P. hortorum is the largest, at 217,942 bp. These genomes have experienced numerous gene and intron losses and partial and complete gene duplications. Some of the losses are shared throughout the family (e.g., trnTGGU and the introns of rps16 and rpl16); however, other losses are homoplasious (e.g., trnG-UCC intron in G. palmatum and M. speciosa). IR length is also highly variable. The IR in P. hortorum was previously shown to be greatly expanded to 76 kb, and the IR is lost in E. texanum and reduced in G. palmatum (11 kb) and M. speciosa (7 kb). Geraniaceae plastomes contain a high frequency of large repeats (.100 bp) relative to other rosids. Within each plastome, repeats are often located at rearrangement end points and many repeats shared among the four Geraniaceae flank rearrangement end points. GC content is elevated in the genomes and also in coding regions relative to other rosids. Codon usage per amino acid and GC content at third position sites are significantly different for Geraniaceae protein-coding sequences relative to other rosids. Our findings suggest that relaxed selection and/or mutational biases lead to increased GC content, and this in turn altered codon usage. We propose that increases in genomic rearrangements, repetitive DNA, nucleotide substitutions, and GC content may be caused by relaxed selection resulting from improper DNA repair.

310 citations


Journal ArticleDOI
TL;DR: The null distribution of the branch-site test is examined and the results suggest that the asymptotic theory is reliable for typical data sets, and indeed in the authors' simulations, the large-sample null distribution was reliable with as few as 20-50 codons in the alignment.
Abstract: The branch-site test is a likelihood ratio test to detect positive selection along prespecified lineages on a phylogeny that affects only a subset of codons in a protein-coding gene, with positive selection indicated by accelerated nonsynonymous substitutions (with ω = d(N)/d(S) > 1). This test may have more power than earlier methods, which average nucleotide substitution rates over sites in the protein and/or over branches on the tree. However, a few recent studies questioned the statistical basis of the test and claimed that the test generated too many false positives. In this paper, we examine the null distribution of the test and conduct a computer simulation to examine the false-positive rate and the power of the test. The results suggest that the asymptotic theory is reliable for typical data sets, and indeed in our simulations, the large-sample null distribution was reliable with as few as 20-50 codons in the alignment. We examined the impact of sequence length, the strength of positive selection, and the proportion of sites under positive selection on the power of the branch-site test. We found that the test was far more powerful in detecting episodic positive selection than branch-based tests, which average substitution rates over all codons in the gene and thus miss the signal when most codons are under strong selective constraint. Recent claims of statistical problems with the branch-site test are due to misinterpretations of simulation results. Our results, as well as previous simulation studies that have demonstrated the robustness of the test, suggest that the branch-site test may be a useful tool for detecting episodic positive selection and for generating biological hypotheses for mutation studies and functional analyses. The test is sensitive to sequence and alignment errors and caution should be exercised concerning its use when data quality is in doubt.

302 citations


Journal ArticleDOI
TL;DR: The MVA pathway is likely an ancestral metabolic route in all the three domains of life, and hence, it was probably present in the last common ancestor of all organisms (the cenancestor), and open the possibility that the cenANCestor had membranes containing isoprenoids.
Abstract: Isoprenoids are a very diverse family of organic compounds widespread in the three domains of life. Although they are produced from the condensation of the same precursors in all organisms (isopentenyl pyrophosphate and dimethylallyl diphosphate), the evolutionary origin of their biosynthesis remains controversial. Two independent nonhomologous metabolic pathways are known: the mevalonate (MVA) pathway in eukaryotes and archaea and the methylerythritol phosphate (MEP) pathway in bacteria and several photosynthetic eukaryotes. The MVA pathway is also found in a few bacteria, what has been explained in previous works by recent acquisition by horizontal gene transfer (HGT) from eukaryotic or archaeal donors. To reconsider the question of the evolutionary origin of the MVA pathway, we have studied the origin and the evolution of the enzymes of this pathway using phylogenomic analyses upon a taxon-rich sequence database. On the one hand, our results confirm the conservation in archaea of an MVA pathway partially different from eukaryotes. This implies that each domain of life possesses a characteristic major isoprenoid biosynthesis pathway: the classical MVA pathway in eukaryotes, a modified MVA pathway in archaea, and the MEP pathway in bacteria. On the other hand, despite the identification of several HGT events, our analyses support that the MVA pathway was ancestral not only in archaea and eukaryotes but also in bacteria, in contradiction with previous claims that the presence of this pathway in bacteria was due to HGT from other domains. Therefore, the MVA pathway is likely an ancestral metabolic route in all the three domains of life, and hence, it was probably present in the last common ancestor of all organisms (the cenancestor). These findings open the possibility that the cenancestor had membranes containing isoprenoids.

279 citations


Journal ArticleDOI
TL;DR: In this article, the authors conducted a genome-wide study of 1,000,000 genetic variants in 46 Tibetans and 92 Han Chinese (HAN) for identifying the signals of high-altitude adaptations (HAAs) in Tibetan genomes.
Abstract: Genetic studies of Tibetans, an ethnic group with a long-lasting presence on the Tibetan Plateau which is known as the highest plateau in the world, may offer a unique opportunity to understand the biological adaptations of human beings to high-altitude environments. We conducted a genome-wide study of 1,000,000 genetic variants in 46 Tibetans (TBN) and 92 Han Chinese (HAN) for identifying the signals of high-altitude adaptations (HAAs) in Tibetan genomes. We discovered the most differentiated variants between TBN and HAN at chromosome 1q42.2 and 2p21. EGLN1 (or HIFPH2, MIM 606425) and EPAS1 (or HIF2A, MIM 603349), both related to hypoxia-inducible factor, were found most differentiated in the two regions, respectively. Strong positive correlations were also observed between the frequency of TBN-dominant haplotypes in the two gene regions and altitude in East Asian populations. Linkage disequilibrium and further haplotype network analyses of world-wide populations suggested the antiquity of the TBN-dominant haplotypes and long-term persistence of the natural selection. Finally, a ‘‘dominant haplotype carrier’’ hypothesis could describe the role of the two genes in HAA. All of our population genomic and statistical analyses indicate that EPAS1 and EGLN1 are most likely responsible for HAA of Tibetans. Interestingly, one each but not both of the two genes were also identified by three recent studies. We reanalyzed the available data and found the escaped top signal (EPAS1) could be recaptured with data quality control and our approaches. Based on this experience, we call for more attention to be paid to controlling data quality and batch effects introduced in public data integration. Our results also suggest limitations of extended haplotype homozygositybased method due to its compromised power in case the natural selection initiated long time ago and particularly in genomic regions with recombination hotspots.

277 citations


Journal ArticleDOI
TL;DR: The use of Taylor expansion to approximate the likelihood during Markov chain Monte Carlo iteration is explored, and the results suggest that the approximate method may be useful for Bayesian dating analysis using large data sets.
Abstract: The molecularclockprovidesapowerfulwaytoestimatespeciesdivergencetimes.Ifinformationonsomespeciesdivergence times is available from the fossil or geological record, it can be used to calibrate a phylogeny and estimate divergence times for all nodes in the tree. The Bayesian method provides a natural framework to incorporate different sources of information concerningdivergencetimes,such asinformationinthefossiland molecular data.Currentmodels ofsequenceevolutionare intractable in a Bayesian setting, and Markov chain Monte Carlo (MCMC) is used to generate the posterior distribution of divergence timesandevolutionaryrates.This methodiscomputationallyexpensive,asitinvolvesthe repeatedcalculationof the likelihoodfunction.Here, we explorethe use ofTaylor expansionto approximatethe likelihoodduring MCMC iteration. The approximation is much faster than conventionallikelihood calculation. However, the approximationis expected to be poor when the proposed parameters are far from the likelihood peak. We explore the use of parameter transforms (square root, logarithm,andarcsine)toimprove theapproximationto the likelihoodcurve. We foundthat thenew methods, particularly thearcsine-basedtransform,providedvery good approximationsunder relaxedclockmodelsandalsounderthe global clock model when the global clock is not seriously violated. The approximationis poorer for analysis under the global clock when the globalclockisseriously wrong andshouldthus notbe used. The resultssuggest that theapproximatemethodmay be useful for Bayesiandating analysisusing large data sets.

247 citations


Journal ArticleDOI
TL;DR: It is shown that species, which show a pattern of adaptive evolution, can apparently be subject to weak purifying selection and vice versa, and that this bias can be removed by using a variant of the Cochran-Mantel-Haenszel procedure for estimating a weighted average OR.
Abstract: The McDonald–Kreitman (MK) test is a simple and widely used test of selection in which the numbers of nonsilent and silent substitutions (Dn and Ds) are compared with the numbers of nonsilent and silent polymorphisms (Pn and Ps). The neutrality index (NI 5 DsPn/DnPs), the odds ratio (OR) of the MK table, measures the direction and degree of departure from neutral evolution. The mean of NI values across genes is often taken to summarize patterns of selection in a species. Here, we show that this leads to statistical bias in both simulated and real data to the extent that species, which show a pattern of adaptive evolution, can apparently be subject to weak purifying selection and vice versa. We show that this bias can be removed by using a variant of the Cochran—Mantel–Haenszel procedure for estimating a weighted average OR. We also show that several point estimators of NI are statistically biased even when cutoff values are employed. We therefore suggest that a new statistic be used to study patterns of selection when data are sparse, the direction of selection: DoS 5 Dn/(Dn þ Ds) � Pn/(Pn þ Ps).

240 citations


Journal ArticleDOI
TL;DR: Investigating elements of B(12) metabolism in the sequenced genomes of 15 different algal species, with representatives of the red, green, and brown algae, diatoms, and coccolithophores, including both macro- and microalgae, and from marine and freshwater environments found a strong correlation between the absence of a functional METE gene and B( 12) auxotrophy.
Abstract: Vitamin B(12) (cobalamin) is a dietary requirement for humans because it is an essential cofactor for two enzymes, methylmalonyl-CoA mutase and methionine synthase (METH). Land plants and fungi neither synthesize or require cobalamin because they do not contain methylmalonyl-CoA mutase, and have an alternative B(12)-independent methionine synthase (METE). Within the algal kingdom, approximately half of all microalgal species need the vitamin as a growth supplement, but there is no phylogenetic relationship between these species, suggesting that the auxotrophy arose multiple times through evolution. We set out to determine the underlying cellular mechanisms for this observation by investigating elements of B(12) metabolism in the sequenced genomes of 15 different algal species, with representatives of the red, green, and brown algae, diatoms, and coccolithophores, including both macro- and microalgae, and from marine and freshwater environments. From this analysis, together with growth assays, we found a strong correlation between the absence of a functional METE gene and B(12) auxotrophy. The presence of a METE unitary pseudogene in the B(12)-dependent green algae Volvox carteri and Gonium pectorale, relatives of the B(12)-independent Chlamydomonas reinhardtii, suggest that B(12) dependence evolved recently in these lineages. In both C. reinhardtii and the diatom Phaeodactylum tricornutum, growth in the presence of cobalamin leads to repression of METE transcription, providing a mechanism for gene loss. Thus varying environmental conditions are likely to have been the reason for the multiple independent origins of B(12) auxotrophy in these organisms. Because the ultimate source of cobalamin is from prokaryotes, the selective loss of METE in different algal lineages will have had important physiological and ecological consequences for these organisms in terms of their dependence on bacteria.

Journal ArticleDOI
TL;DR: These analyses indicate that by improving the taxonomic sampling, complete mt genomes can solve the evolutionary relationships among major bird groups and support the choice of COX 1 among mt genes as target for developing DNA barcoding approaches in birds.
Abstract: Mitochondrial (mt) genes and genomes are among the major sources of data for evolutionary studies in birds. This places mitogenomic studies in birds at the core of intense debates in avian evolutionary biology. Indeed, complete mt genomes are actively been used to unveil the phylogenetic relationships among major orders, whereas single genes (e.g., cytochrome c oxidase I [COX1]) are considered standard for species identification and defining species boundaries (DNA barcoding). In this investigation, we study the time of origin and evolutionary relationships among Neoaves orders using complete mt genomes. First, we were able to solve polytomies previously observed at the deep nodes of the Neoaves phylogeny by analyzing 80 mt genomes, including 17 new sequences reported in this investigation. As an example, we found evidence indicating that columbiforms and charadriforms are sister groups. Overall, our analyses indicate that by improving the taxonomic sampling, complete mt genomes can solve the evolutionary relationships among major bird groups. Second, we used our phylogenetic hypotheses to estimate the time of origin of major avian orders as a way to test if their diversification took place prior to the Cretaceous/Tertiary (K/T) boundary. Such timetrees were estimated using several molecular dating approaches and conservative calibration points. Whereas we found time estimates slightly younger than those reported by others, most of the major orders originated prior to the K/T boundary. Finally, we used our timetrees to estimate the rate of evolution of each mt gene. We found great variation on the mutation rates among mt genes and within different bird groups. COX1 was the gene with less variation among Neoaves orders and the one with the least amount of rate heterogeneity across lineages. Such findings support the choice of COX 1 among mt genes as target for developing DNA barcoding approaches in birds.

Journal ArticleDOI
TL;DR: The complete plastid genome of the underground orchid, Rhizanthella gardneri, is sequenced, showing that plastids have other essential functions besides photosynthesis.
Abstract: Since the endosymbiotic origin of chloroplasts from cyanobacteria 2 billion years ago, the evolution of plastids has been characterized by massive loss of genes. Most plants and algae depend on photosynthesis for energy and have retained ∼110 genes in their chloroplast genome that encode components of the gene expression machinery and subunits of the photosystems. However, nonphotosynthetic parasitic plants have retained a reduced plastid genome, showing that plastids have other essential functions besides photosynthesis. We sequenced the complete plastid genome of the underground orchid, Rhizanthella gardneri. This remarkable parasitic subterranean orchid possesses the smallest organelle genome yet described in land plants. With only 20 proteins, 4 rRNAs, and 9 tRNAs encoded in 59,190 bp, it is the least gene-rich plastid genome known to date apart from the fragmented plastid genome of some dinoflagellates. Despite numerous differences, striking similarities with plastid genomes from unrelated parasitic plants identify a minimal set of protein-encoding and tRNA genes required to reside in plant plastids. This prime example of convergent evolution implies shared selective constraints on gene loss or transfer.

Journal ArticleDOI
TL;DR: A probabilistic framework for testing the coupling between continuous characters and parameters of the molecular substitution process is developed and a negative correlation between the rate of substitution and mass and longevity is observed, thus in partial agreement with the nearly neutral theory.
Abstract: The comparative approach is routinely used to test for possible correlations between phenotypic or life-history traits. To correct for phylogenetic inertia, the method of independent contrasts assumes that continuous characters evolve along the phylogeny according to a multivariate Brownian process. Brownian diffusion processes have also been used to describe time variations of the parameters of the substitution process, such as the rate of substitution or the ratio of synonymous to nonsynonymous substitutions. Here, we develop a probabilistic framework for testing the coupling between continuous characters and parameters of the molecular substitution process. Rates of substitution and continuous characters are jointly modeled as a multivariate Brownian diffusion process of unknown covariance matrix. The covariance matrix, the divergence times and the phylogenetic variations of substitution rates and continuous characters are all jointly estimated in a Bayesian Monte Carlo framework, imposing on the covariance matrix a prior conjugate to the Brownian process so as to achieve a greater computational efficiency. The coupling between rates and phenotypes is assessed by measuring the posterior probability of positive or negative covariances, whereas divergence dates and phenotypic variations are marginally reconstructed in the context of the joint analysis. As an illustration, we apply the model to a set of 410 mammalian cytochrome b sequences. We observe a negative correlation between the rate of substitution and mass and longevity, which was previously observed. We also find a positive correlation between ω = dN/dS and mass and longevity, which we interpret as an indirect effect of variations of effective population size, thus in partial agreement with the nearly neutral theory. The method can easily be extended to any parameter of the substitution process and to any continuous phenotypic or environmental character.

Journal ArticleDOI
TL;DR: It is suggested that within rosids there have been independent transfers of rpl22 to the nucleus in Fabaceae and Fagaceae and a putative third transfer in Passiflora and this does not predate the origin of angiosperms as suggested in an earlier study.
Abstract: Functional gene transfer from the plastid to the nucleus is rare among land plants despite evidence that DNA transfer to the nucleus is relatively frequent. During the course of sequencing plastid genomes from representative species from three rosid genera (Castanea, Prunus, Theobroma) and ongoing projects focusing on the Fagaceae and Passifloraceae, we identified putative losses of rpl22 in these two angiosperm families. We further characterized rpl22 from three species of Passiflora and one species of Quercus and identified sequences that likely represent pseudogenes. In Castanea and Quercus, both members of the Fagaceae, we identified a nuclear copy of rpl22, which consisted of two exons separated by an intron. Exon 1 encodes a transit peptide that likely targets the protein product back to the plastid and exon 2 encodes rpl22. We performed phylogenetic analyses of 97 taxa, including 93 angiosperms and four gymnosperm outgroups using alignments of 81 plastid genes to examine the phylogenetic distribution of rpl22 loss and transfer to the nucleus. Our results indicate that within rosids there have been independent transfers of rpl22 to the nucleus in Fabaceae and Fagaceae and a putative third transfer in Passiflora. The high level of sequence divergence between the transit peptides in Fabaceae and Fagaceae strongly suggest that these represent independent transfers. Furthermore, Blast searches did not identify the "donor" genes of the transit peptides, suggesting a de novo origin. We also performed phylogenetic analyses of rpl22 for 87 angiosperms and four gymnosperms, including nuclear-encoded copies for five species of Fabaceae and Fagaceae. The resulting trees indicated that the transfer of rpl22 to the nucleus does not predate the origin of angiosperms as suggested in an earlier study. Using previously published angiosperm divergence time estimates, we suggest that these transfers occurred approximately 56-58, 34-37, and 26-27 Ma for the Fabaceae, Fagaceae, and Passifloraceae, respectively.

Journal ArticleDOI
TL;DR: A stochastic mapping method utilizing advanced likelihood-based evolutionary models is used to quantify gene family acquisition events by HGT and shows that the biological function of a gene family is an insignificant factor in the determination of transferability when proteins with similar levels of connectivity are compared.
Abstract: Horizontal gene transfer (HGT) is a prevalent and a highly important phenomenon in microbial species evolution. One of the important challenges in HGT research is to better understand the factors that determine the tendency of genes to be successfully transferred and retained in evolution (i.e., transferability). It was previously observed that transferability of genes depends on the cellular process in which they are involved where genes involved in transcription or translation are less likely to be transferred than metabolic genes. It was further shown that gene connectivity in the protein‐protein interaction network affects HGT. These two factors were shown to be correlated, and their influence on HGT is collectively termed the ‘‘Complexity Hypothesis’’. In this study, we used a stochastic mapping method utilizing advanced likelihoodbased evolutionary models to quantify gene family acquisition events by HGT. We applied our methodology to an extensive across-species genome-wide dataset that enabled us to estimate the overall extent of transfer events in evolution and to study the trends and barriers to gene transferability. Focusing on the biological function and the connectivity of genes, we obtained novel insights regarding the ‘‘complexity hypothesis.’’ Specifically, we aimed to disentangle the relationships between protein connectivity, cellular function, and transferability and to quantify the relative contribution of each of these factors in determining transferability. We show that the biological function of a gene family is an insignificant factor in the determination of transferability when proteins with similar levels of connectivity are compared. In contrast, we found that connectivity is an important and a statistically significant factor in determining transferability when proteins with a similar function are compared.

Journal ArticleDOI
Yu Fan1, Rui Wu1, Ming-Hui Chen1, Lynn Kuo1, Paul O. Lewis1 
TL;DR: A new more accurate method for estimating the marginal likelihood of a model and a comparison with the HM method on both simulated and empirical data shows that the generalized SS method tends to choose simpler partition schemes that are more in line with expectation based on inferred patterns of molecular evolution.
Abstract: Bayesian phylogenetic analyses often depend on Bayes factors (BFs) to determine the optimal way to partition the data. The marginal likelihoods used to compute BFs, in turn, are most commonly estimated using the harmonic mean (HM) method, which has been shown to be inaccurate. We describe a new more accurate method for estimating the marginal likelihood of a model and compare it with the HM method on both simulated and empirical data. The new method generalizes our previously described stepping-stone (SS) approach by making use of a reference distribution parameterized using samples from the posterior distribution. This avoids one challenging aspect of the original SS method, namely the need to sample from distributions that are close (in the Kullback–Leibler sense) to the prior. We specifically address the choice of partition models and find that using the HM method can lead to a strong preference for an overpartitioned model. In contrast to the HM method and the original SS method, we show using simulated data that the generalized SS method is strikingly more precise (repeatable BF values of the same data and partition model) and yields BF values that are much more reasonable than those produced by the HM method. Comparisons of HM and generalized SS methods on an empirical data set demonstrate that the generalized SS method tends to choose simpler partition schemes that are more in line with expectation based on inferred patterns of molecular evolution. The generalized SS method shares with thermodynamic integration the need to sample from a series of distributions in addition to the posterior. Such dedicated path-based Markov chain Monte Carlo analyses appear to be a cost of estimating marginal likelihoods accurately.

Journal ArticleDOI
TL;DR: Adaptations in mitochondrial enzyme kinetics and O(2) transport capacity may contribute to the exceptional ability of bar-headed geese to fly high.
Abstract: Bar-headed geese (Anser indicus) fly at up to 9,000 m elevation during their migration over the Himalayas, sustaining high metabolic rates in the severe hypoxia at these altitudes. We investigated the evolution of cardiac energy metabolism and O(2) transport in this species to better understand the molecular and physiological mechanisms of high-altitude adaptation. Compared with low-altitude geese (pink-footed geese and barnacle geese), bar-headed geese had larger lungs and higher capillary densities in the left ventricle of the heart, both of which should improve O(2) diffusion during hypoxia. Although myoglobin abundance and the activities of many metabolic enzymes (carnitine palmitoyltransferase, citrate synthase, 3-hydroxyacyl-coA dehydrogenase, lactate dehydrogenase, and pyruvate kinase) showed only minor variation between species, bar-headed geese had a striking alteration in the kinetics of cytochrome c oxidase (COX), the heteromeric enzyme that catalyzes O(2) reduction in oxidative phosphorylation. This was reflected by a lower maximum catalytic activity and a higher affinity for reduced cytochrome c. There were small differences between species in messenger RNA and protein expression of COX subunits 3 and 4, but these were inconsistent with the divergence in enzyme kinetics. However, the COX3 gene of bar-headed geese contained a nonsynonymous substitution at a site that is otherwise conserved across vertebrates and resulted in a major functional change of amino acid class (Trp-116 → Arg). This mutation was predicted by structural modeling to alter the interaction between COX3 and COX1. Adaptations in mitochondrial enzyme kinetics and O(2) transport capacity may therefore contribute to the exceptional ability of bar-headed geese to fly high.

Journal ArticleDOI
TL;DR: A surprisingly large core gene set present in all genomes and a high number of diverse accessory genes in those Chlamydiae that do not primarily infect humans or animals are identified, including a chemosensory system in P. acanthamoebae and a type IV secretion system.
Abstract: Chlamydiae are evolutionarily well-separated bacteria that live exclusively within eukaryotic host cells. They include important human pathogens such as Chlamydia trachomatis as well as symbionts of protozoa. As these bacteria are experimentally challenging and genetically intractable, our knowledge about them is still limited. In this study, we obtained the genome sequences of Simkania negevensis Z, Waddlia chondrophila 2032/99, and Parachlamydia acanthamoebae UV-7. This enabled us to perform the first comprehensive comparative and phylogenomic analysis of representative members of four major families of the Chlamydiae, including the Chlamydiaceae. We identified a surprisingly large core gene set present in all genomes and a high number of diverse accessory genes in those Chlamydiae that do not primarily infect humans or animals, including a chemosensory system in P. acanthamoebae and a type IV secretion system. In S. negevensis, the type IV secretion system is encoded on a large conjugative plasmid (pSn, 132 kb). Phylogenetic analyses suggested that a plasmid similar to the S. negevensis plasmid was originally acquired by the last common ancestor of all four families and that it was subsequently reduced, integrated into the chromosome, or lost during diversification, ultimately giving rise to the extant virulence-associated plasmid of pathogenic chlamydiae. Other virulence factors, including a type III secretion system, are conserved among the Chlamydiae to variable degrees and together with differences in the composition of the cell wall reflect adaptation to different host cells including convergent evolution among the four chlamydial families. Phylogenomic analysis focusing on chlamydial proteins with homology to plant proteins provided evidence for the acquisition of 53 chlamydial genes by a plant progenitor, lending further support for the hypothesis of an early interaction between a chlamydial ancestor and the primary photosynthetic eukaryote.

Journal ArticleDOI
TL;DR: The repertoire of 17 metazoan TFs is uncovered in the amoeboid holozoan Capsaspora owczarzaki, a representative of a unicellular lineage that is closely related to choanoflagellates and metazoans, and new hypotheses regarding the origin and evolution of developmental metazoa TFs are formulated.
Abstract: How animals (metazoans) originated from their single-celled ancestors remains a major question in biology. As transcriptional regulation is crucial to animal development, deciphering the early evolution of associated transcription factors (TFs) is critical to understanding metazoan origins. In this study, we uncovered the repertoire of 17 metazoan TFs in the amoeboid holozoan Capsaspora owczarzaki, a representative of a unicellular lineage that is closely related to choanoflagellates and metazoans. Phylogenetic and comparative genomic analyses with the broadest possible taxonomic sampling allowed us to formulate new hypotheses regarding the origin and evolution of developmental metazoan TFs. We show that the complexity of the TF repertoire in C. owczarzaki is strikingly high, pushing back further the origin of some TFs formerly thought to be metazoan specific, such as T-box or Runx. Nonetheless, TF families whose beginnings antedate the origin of the animal kingdom, such as homeodomain or basic helix-loop-helix, underwent significant expansion and diversification along metazoan and eumetazoan stems.

Journal ArticleDOI
TL;DR: It is demonstrated how pervasive purifying selection can mask the ancient origins of recently sampled pathogens, in part due to the inability of nucleotide-based substitution models to properly account for complex patterns of spatial and temporal variability in selective pressures.
Abstract: Statistical methods for molecular dating of viral origins have been used extensively to infer the time of most common recent ancestor for many rapidly evolving pathogens. However, there are a number of cases, in which epidemiological, historical, or genomic evidence suggests much older viral origins than those obtained via molecular dating. We demonstrate how pervasive purifying selection can mask the ancient origins of recently sampled pathogens, in part due to the inability of nucleotide-based substitution models to properly account for complex patterns of spatial and temporal variability in selective pressures. We use codon-based substitution models to infer the length of branches in viral phylogenies; these models produce estimates that are often considerably longer than those obtained with traditional nucleotide-based substitution models. Correcting the apparent underestimation of branch lengths suggests substantially older origins for measles, Ebola, and avian influenza viruses. This work helps to reconcile some of the inconsistencies between molecular dating and other types of evidence concerning the age of viral lineages.

Journal ArticleDOI
TL;DR: It is hypothesized that levels and patterns of expression are not only the major determinants that explain nonsynonymous rate variation among genes but also a critical determinant of gene retention after duplication.
Abstract: Surprisingly, few studies have described evolutionary rate variation among plant nuclear genes, with little investigation of the causes of rate variation. Here, we describe evolutionary rates for 11,492 ortholog pairs between Arabidopsis thaliana and A. lyrata and investigate possible contributors to rate variation among these genes. Rates of evolution at synonymous sites vary along chromosomes, suggesting that mutation rates vary on genomic scales, perhaps as a function of recombination rate. Rates of evolution at nonsynonymous sites correlate most strongly with expression patterns, but they also vary as to whether a gene is duplicated and retained after a whole-genome duplication (WGD) event. WGD genes evolve more slowly, on average, than nonduplicated genes and non-WGD duplicates. We hypothesize that levels and patterns of expression are not only the major determinants that explain nonsynonymous rate variation among genes but also a critical determinant of gene retention after duplication.

Journal ArticleDOI
TL;DR: There is strong evidence that TEs in Drosophila across all orders and families are subject to purifying selection at the level of ectopic recombination, and it is shown that strength of this selection varies predictably with recombination rate, length of individual TEs, and copy number and length of otherTEs in the same family.
Abstract: Transposable elements (TEs) are the primary contributors to the genome bulk in many organisms and are major players in genome evolution. A clear and thorough understanding of the population dynamics of TEs is therefore essential for full comprehension of the eukaryotic genome evolution and function. Although TEs in Drosophila melanogaster have received much attention, population dynamics of most TE families in this species remains entirely unexplored. It is not clear whether the same population processes can account for the population behaviors of all TEs in Drosophila or whether, as has been suggested previously, different orders behave according to very different rules. In this work, we analyzed population frequencies for a large number of individual TEs (755 TEs) in five North American and one sub-Saharan African D. melanogaster populations (75 strains in total). These TEs have been annotated in the reference D. melanogaster euchromatic genome and have been sampled from all three major orders (non-LTR, LTR, and TIR) and from all families with more than 20 TE copies (55 families in total). We find strong evidence that TEs in Drosophila across all orders and families are subject to purifying selection at the level of ectopic recombination. We showed that strength of this selection varies predictably with recombination rate, length of individual TEs, and copy number and length of other TEs in the same family. Importantly, these rules do not appear to vary across orders. Finally, we built a statistical model that considered only individual TE-level (such as the TE length) and family-level properties (such as the copy number) and were able to explain more than 40% of the variation in TE frequencies in D. melanogaster.

Journal ArticleDOI
TL;DR: It is shown that the SLC superfamily is ancient with multiple branches that were present before early divergence of Bilateria and is valuable for annotation and prediction of substrates for the many SLCs that have not been tested in experimental transport assays.
Abstract: The Solute Carriers (SLCs) are membrane proteins that regulate transport of many types of substances over the cell membrane. The SLCs are found in at least 46 gene families in the human genome. Here, we performed the first evolutionary analysis of the entire SLC family based on whole genome sequences. We systematically mined and analyzed the genomes of 17 species to identify SLC genes. In all, we identified 4,813 SLC sequences in these genomes, and we delineated the evolutionary history of each of the subgroups. Moreover, we also identified ten new human sequences not previously classified as SLCs, which most likely belong to the SLC family. We found that 43 of the 46 SLC families found in Homo sapiens were also found in Caenorhabditis elegans, whereas 42 of them were also found in insects. Mammals have a higher number of SLC genes in most families, perhaps reflecting important roles for these in central nervous system functions. This study provides a systematic analysis of the evolutionary history of the SLC families in Eukaryotes showing that the SLC superfamily is ancient with multiple branches that were present before early divergence of Bilateria. The results provide foundation for overall classification of SLC genes and are valuable for annotation and prediction of substrates for the many SLCs that have not been tested in experimental transport assays.

Journal ArticleDOI
TL;DR: The first characterization of the entire TLR multigene family in non-model avian species is conducted and the list of candidate loci for avian eco-immunogenetics beyond the widely employed genes of the Major Histocompatibility Complex is extended.
Abstract: Toll-like receptors (TLR) are membrane-bound sensors of the innate immune system that recognize invariant and distinctive molecular features of invading microbes and are also essential for initiating adaptive immunity in vertebrates. The genetic variation at TLR genes has been directly related to differential pathogen outcomes in humans and livestock. Nonetheless, new insights about the impact of TLRs polymorphism on the evolutionary ecology of infectious diseases can be gained through the investigation of additional vertebrate groups not yet investigated in detail. In this study, we have conducted the first characterization of the entire TLR multigene family (N = 10 genes) in non-model avian species. Using primers targeting conserved coding regions, we aimed at amplifying large segments of the extracellular domains (275-435 aa) involved in pathogen recognition across seven phylogenetically diverse bird species. Our analyses suggest avian TLRs are dominated by stabilizing selection, suggesting that slow rates of nonsynonymous substitution help preserve biological function. Overall, mean values of ω (= d(n)/d(s)) at each TLR locus ranged from 0.196 to 0.517. However, we also found patterns of positive selection acting on specific amino acid sites that could be linked to species-specific differences in pathogen-associated molecular pattern recognition. Only 39 of 2,875 (∼1.35%) of the codons analyzed exhibited significant patterns of positive selection. At least one half of these positively selected codons can be mapped to putative ligand-binding regions, as suggested by crystallographic structures of TLRs and their ligands and mutagenic analyses. We also surveyed TLR polymorphism in wild populations of two bird species, the Lesser Kestrel Falco naumanni and the House Finch Carpodacus mexicanus. In general, avian TLRs displayed low to moderate single nucleotide polymorphism levels and an excess of silent nucleotide substitutions, but also conspicuous instances of positive directional selection. In particular, TLR5 and TLR15 exhibited the highest degree of genetic polymorphism and the highest occurrence of nonconservative amino acid substitutions. This study provides critical primers and a first look at the evolutionary patterns and implications of TLR polymorphism in non-model avian species and extends the list of candidate loci for avian eco-immunogenetics beyond the widely employed genes of the Major Histocompatibility Complex (MHC).

Journal ArticleDOI
TL;DR: It is proposed that AA speakers in India today are derived from dispersal from southeast Asia, followed by extensive sex-specific admixture with local Indian populations, strongly supporting the first of the two hypotheses.
Abstract: The geographic origin and time of dispersal of Austroasiatic (AA) speakers, presently settled in south and southeast Asia, remains disputed. Two rival hypotheses, both assuming a demic component to the language dispersal, have been proposed. The first of these places the origin of Austroasiatic speakers in southeast Asia with a later dispersal to south Asia during the Neolithic, whereas the second hypothesis advocates pre-Neolithic origins and dispersal of this language family from south Asia. To test the two alternative models, this study combines the analysis of uniparentally inherited markers with 610,000 common single nucleotide polymorphism loci from the nuclear genome. Indian AA speakers have high frequencies of Y chromosome haplogroup O2a; our results show that this haplogroup has significantly higher diversity and coalescent time (17-28 thousand years ago) in southeast Asia, strongly supporting the first of the two hypotheses. Nevertheless, the results of principal component and "structure-like" analyses on autosomal loci also show that the population history of AA speakers in India is more complex, being characterized by two ancestral components-one represented in the pattern of Y chromosomal and EDAR results and the other by mitochondrial DNA diversity and genomic structure. We propose that AA speakers in India today are derived from dispersal from southeast Asia, followed by extensive sex-specific admixture with local Indian populations.

Journal ArticleDOI
TL;DR: The three genomes of N. sylvestris, N. tomentosiformis, and N. tabacum have experienced different evolutionary trajectories, with genomes that are dynamic, stable, and downsized, respectively.
Abstract: We used next generation sequencing to characterize and compare the genomes of the recently derived allotetraploid, Nicotiana tabacum (<200,000 years old), with its diploid progenitors, Nicotiana sylvestris (maternal, S-genome donor), and Nicotiana tomentosiformis (paternal, T-genome donor). Analysis of 14,634 repetitive DNA sequences in the genomes of the progenitor species and N. tabacum reveal all major types of retroelements found in angiosperms (genome proportions range between 17-22.5% and 2.3-3.5% for Ty3-gypsy elements and Ty1-copia elements, respectively). The diploid N. sylvestris genome exhibits evidence of recent bursts of sequence amplification and/or homogenization, whereas the genome of N. tomentosiformis lacks this signature and has considerably fewer homogenous repeats. In the derived allotetraploid N. tabacum, there is evidence of genome downsizing and sequences loss across most repeat types. This is particularly evident amongst the Ty3-gypsy retroelements in which all families identified are underrepresented in N. tabacum, as is 35S ribosomal DNA. Analysis of all repetitive DNA sequences indicates the T-genome of N. tabacum has experienced greater sequence loss than the S-genome, revealing preferential loss of paternally derived repetitive DNAs at a genome-wide level. Thus, the three genomes of N. sylvestris, N. tomentosiformis, and N. tabacum have experienced different evolutionary trajectories, with genomes that are dynamic, stable, and downsized, respectively.

Journal ArticleDOI
TL;DR: It is shown that mtDNA intraorganismal heteroplasmy can have deterministic underpinnings and persist for hundreds of millions of years and support the hypothesis that proteins coded by the highly divergent maternally and paternally transmitted mt genomes could be directly involved in sex determination in freshwater mussels.
Abstract: Mitochondrial (mt) function depends critically on optimal interactions between components encoded by mt and nuclear DNAs. mitochondrial DNA (mtDNA) inheritance (SMI) is thought to have evolved in animal species to maintain mitonuclear complementarity by preventing the spread of selfish mt elements thus typically rendering mtDNA heteroplasmy evolutionarily ephemeral. Here, we show that mtDNA intraorganismal heteroplasmy can have deterministic underpinnings and persist for hundreds of millions of years. We demonstrate that the only exception to SMI in the animal kingdom, that is, the doubly uniparental mtDNA inheritance system in bivalves, with its three-way interactions among egg mt-, sperm mt- and nucleus-encoded gene products, is tightly associated with the maintenance of separate male and female sexes (dioecy) in freshwater mussels. Specifically, this mother-through-daughter and father-through-son mtDNA inheritance system, containing highly differentiated mt genomes, is found in all dioecious freshwater mussel species. Conversely, all hermaphroditic species lack the paternally transmitted mtDNA (5possess SMI) and have heterogeneous macromutations in the recently discovered, novel protein-coding gene (F-orf) in their maternally transmitted mt genomes. Using immunoelectron microscopy, we have localized the F-open reading frame (ORF) protein, likely involved in specifying separate sexes, in mitochondria and in the nucleus. Our results support the hypothesis that proteins coded by the highly divergent maternally and paternally transmitted mt genomes could be directly involved in sex determination in freshwater mussels. Concomitantly, our study demonstrates novel features for animal mt genomes: the existence of additional, lineage-specific, mtDNA-encoded proteins with functional significance and the involvement of mtDNA-encoded proteins in extra-mt functions. Our results open new avenues for the identification, characterization, and functional analyses of ORFs in the intergenic regions, previously defined as ‘‘noncoding,’’ found in a large proportion of animal mt genomes.

Journal ArticleDOI
TL;DR: Several lines of evidence indicate that HLA-G polymorphisms at the 5'-upstream regulatory region (5' URR) and 3'-untranslated region (3' UTR) may influence the Hla-G expression levels, and several lines ofevidence for balancing selection acting on the regulatory regions are reported.
Abstract: HLA-G molecule plays an important role on immune response regulation and has been implicated on the inhibition of T and natural killer cell cytolytic function and inhibition of allogeneic T-cell proliferation. Due to its immune-modulator properties, the HLA-G gene expression has been associated with the outcome of allograft and of autoimmune, infectious, and malignant disorders. Several lines of evidence indicate that HLA-G polymorphisms at the 5#-upstream regulatory region (5# URR) and 3#-untranslated region (3# UTR) may influence the HLA-G expression levels. Because Brazilians represent one of the most heterogeneous populations in the world with the widest HLA-G coding region variability already detected among the studied populations, a high level of variability and haplotype diversity would be expected in Brazilians. On this basis, the 5# URR, coding, and 3# UTR variability were evaluated in a Brazilian series consisting of 100 healthy bone marrow donors, as well as the linkage disequilibrium pattern along the gene and the extended haplotypes encompassing several gene segment variations. The HLA-G locus seems to present six different HLA-G lineages showing functional variations mainly in nucleotides of the regulatory regions. Differences were observed at the 5# URR in positions that either coincide with or are close to transcription factor–binding sites and at the 3# UTR mainly in positions that have already been reported to influence HLA-G mRNA availability. We report several lines of evidence for balancing selection acting on the regulatory regions, which may indicate that these HLA-G lineages may be related to the differential HLA-G expression profiles.

Journal ArticleDOI
TL;DR: It is suggested that common descent of at least 97% of the GPCRs sequences found in humans are from the cAMP family, and convincing evidence is found that the Rhodopsin family is parent to the important sensory families; Taste 2 and Vomeronasal type 1 as well as the Nematode chemoreceptor families.
Abstract: Several families of G protein-coupled receptors (GPCR) show no significant sequence similarities and it has been debated which groups of GPCRs that share a common origin. We developed and performed ...