scispace - formally typeset
Search or ask a question

Showing papers in "Molecular Biology and Evolution in 2004"


Journal ArticleDOI
TL;DR: The results suggest that the complexity of the pattern of substitution of real sequences is better captured by the CAT model, offering the possibility of studying its impact on phylogenetic reconstruction and its connections with structure-function determinants.
Abstract: Most current models of sequence evolution assume that all sites of a protein evolve under the same substitution process, characterized by a 20 x 20 substitution matrix. Here, we propose to relax this assumption by developing a Bayesian mixture model that allows the amino-acid replacement pattern at different sites of a protein alignment to be described by distinct substitution processes. Our model, named CAT, assumes the existence of distinct processes (or classes) differing by their equilibrium frequencies over the 20 residues. Through the use of a Dirichlet process prior, the total number of classes and their respective amino-acid profiles, as well as the affiliations of each site to a given class, are all free variables of the model. In this way, the CAT model is able to adapt to the complexity actually present in the data, and it yields an estimate of the substitutional heterogeneity through the posterior mean number of classes. We show that a significant level of heterogeneity is present in the substitution patterns of proteins, and that the standard one-matrix model fails to account for this heterogeneity. By evaluating the Bayes factor, we demonstrate that the standard model is outperformed by CAT on all of the data sets which we analyzed. Altogether, these results suggest that the complexity of the pattern of substitution of real sequences is better captured by the CAT model, offering the possibility of studying its impact on phylogenetic reconstruction and its connections with structure-function determinants.

1,399 citations


Journal ArticleDOI
TL;DR: SSRs within genes evolve through mutational processes similar to those for SSRs located in other genomic regions including replication slippage, point mutation, and recombination and may provide a molecular basis for fast adaptation to environmental changes in both prokaryotes and eukaryotes.
Abstract: Recently, increasingly more microsatellites, or simple sequence repeats (SSRs) have been found and characterized within protein-coding genes and their untranslated regions (UTRs). These data provide useful information to study possible SSR functions. Here, we review SSR distributions within expressed sequence tags (ESTs) and genes including protein-coding, 3'-UTRs and 5'-UTRs, and introns; and discuss the consequences of SSR repeat-number changes in those regions of both prokaryotes and eukaryotes. Strong evidence shows that SSRs are nonrandomly distributed across protein-coding regions, UTRs, and introns. Substantial data indicates that SSR expansions and/or contractions in protein-coding regions can lead to a gain or loss of gene function via frameshift mutation or expanded toxic mRNA. SSR variations in 5'-UTRs could regulate gene expression by affecting transcription and translation. The SSR expansions in the 3'-UTRs cause transcription slippage and produce expanded mRNA, which can be accumulated as nuclear foci, and which can disrupt splicing and, possibly, disrupt other cellular function. Intronic SSRs can affect gene transcription, mRNA splicing, or export to cytoplasm. Triplet SSRs located in the UTRs or intron can also induce heterochromatin-mediated-like gene silencing. All these effects caused by SSR expansions or contractions within genes can eventually lead to phenotypic changes. SSRs within genes evolve through mutational processes similar to those for SSRs located in other genomic regions including replication slippage, point mutation, and recombination. These mutational processes generate DNA changes that should be connected by DNA mismatch repair (MMR) system. Mutation that has escaped from the MMR system correction would become new alleles at the SSR loci, and then regulate and/or change gene products, and eventually lead to phenotype changes. Therefore, SSRs within genes should be subjected to stronger selective pressure than other genomic regions because of their functional importance. These SSRs may provide a molecular basis for fast adaptation to environmental changes in both prokaryotes and eukaryotes.

1,039 citations


Journal ArticleDOI
TL;DR: An ancient (late Paleoproterozoic) origin of photosynthetic eukaryotes with the primary endosymbiosis that gave rise to the first alga having occurred after the split of the Plantae from the opisthokonts sometime before 1,558 MYA is supported.
Abstract: The appearance of photosynthetic eukaryotes (algae and plants) dramatically altered the Earth's ecosystem, making possible all vertebrate life on land, including humans. Dating algal origin is, however, frustrated by a meager fossil record. We generated a plastid multi-gene phylogeny with Bayesian inference and then used maximum likelihood molecular clock methods to estimate algal divergence times. The plastid tree was used as a surrogate for algal host evolution because of recent phylogenetic evidence supporting the vertical ancestry of the plastid in the red, green, and glaucophyte algae. Nodes in the plastid tree were constrained with six reliable fossil dates and a maximum age of 3,500 MYA based on the earliest known eubacterial fossil. Our analyses support an ancient (late Paleoproterozoic) origin of photosynthetic eukaryotes with the primary endosymbiosis that gave rise to the first alga having occurred after the split of the Plantae (i.e., red, green, and glaucophyte algae plus land plants) from the opisthokonts sometime before 1,558 MYA. The split of the red and green algae is calculated to have occurred about 1,500 MYA, and the putative single red algal secondary endosymbiosis that gave rise to the plastid in the cryptophyte, haptophyte, and stramenopile algae (chromists) occurred about 1,300 MYA. These dates, which are consistent with fossil evidence for putative marine algae (i.e., acritarchs) from the early Mesoproterozoic (1,500 MYA) and with a major eukaryotic diversification in the very late Mesoproterozoic and Neoproterozoic, provide a molecular timeline for understanding algal evolution.

876 citations


Journal ArticleDOI
TL;DR: The complex genetic history of the nitrogenase family is explored, which is replete with gene duplication, recruitment, fusion, and horizontal gene transfer and is discussed in light of the hypothesized presence of nitrogenase in the last common ancestor of modern organisms.
Abstract: In recent years, our understanding of biological nitrogen fixation has been bolstered by a diverse array of scientific techniques. Still, the origin and extant distribution of nitrogen fixation has been perplexing from a phylogenetic perspective, largely because of factors that confound molecular phylogeny such as sequence divergence, paralogy, and horizontal gene transfer. Here, we make use of 110 publicly available complete genome sequences to understand how the core components of nitrogenase, including NifH, NifD, NifK, NifE, and NifN proteins, have evolved. These genes are universal in nitrogen fixing organisms-typically found within highly conserved operons-and, overall, have remarkably congruent phylogenetic histories. Additional clues to the early origins of this system are available from two distinct clades of nitrogenase paralogs: a group composed of genes essential to photosynthetic pigment biosynthesis and a group of uncharacterized genes present in methanogens and in some photosynthetic bacteria. We explore the complex genetic history of the nitrogenase family, which is replete with gene duplication, recruitment, fusion, and horizontal gene transfer and discuss these events in light of the hypothesized presence of nitrogenase in the last common ancestor of modern organisms, as well as the additional possibility that nitrogen fixation might have evolved later, perhaps in methanogenic archaea, and was subsequently transferred into the bacterial domain.

759 citations


Journal ArticleDOI
TL;DR: A systematic comparison of the draft genome sequences of Fugu and humans is made to identify paralogous chromosomal regions ("paralogons") in the Fugu that arose in the ray-finned fish lineage ("fish-specific").
Abstract: With about 24,000 extant species, teleosts are the largest group of vertebrates. They constitute more than 99% of the ray-finned fishes (Actinopterygii) that diverged from the lobe-finned fish lineage (Sarcopterygii) about 450 MYA. Although the role of genome duplication in the evolution of vertebrates is now established, its role in structuring the teleost genomes has been controversial. At least two hypotheses have been proposed: a whole-genome duplication in an ancient ray-finned fish and independent gene duplications in different lineages. These hypotheses are, however, based on small data sets and lack adequate statistical and phylogenetic support. In this study, we have made a systematic comparison of the draft genome sequences of Fugu and humans to identify paralogous chromosomal regions ("paralogons") in the Fugu that arose in the ray-finned fish lineage ("fish-specific"). We identified duplicate genes in the Fugu by phylogenetic analyses of the Fugu, human, and invertebrate sequences. Our analyses provide evidence for 425 fish-specific duplicate genes in the Fugu and show that at least 6.6% of the genome is represented by fish-specific paralogons. We estimated the ages of Fugu duplicate genes and paralogons using the molecular clock. Remarkably, the ages of duplicate genes and paralogons are clustered, with a peak around 350 MYA. These data strongly suggest a whole-genome duplication event early during the evolution of ray-finned fishes, probably before the origin of teleosts.

526 citations


Journal ArticleDOI
TL;DR: The reversible jump Markov chain Monte Carlo algorithm described here allows estimation of phylogeny (and other phylogenetic model parameters) to be performed while accounting for uncertainty in the model of DNA substitution.
Abstract: A common problem in molecular phylogenetics is choosing a model of DNA substitution that does a good job of explaining the DNA sequence alignment without introducing superfluous parameters. A number of methods have been used to choose among a small set of candidate substitution models, such as the likelihood ratio test, the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and Bayes factors. Current implementations of any of these criteria suffer from the limitation that only a small set of models are examined, or that the test does not allow easy comparison of non-nested models. In this article, we expand the pool of candidate substitution models to include all possible time-reversible models. This set includes seven models that have already been described. We show how Bayes factors can be calculated for these models using reversible jump Markov chain Monte Carlo, and apply the method to 16 DNA sequence alignments. For each data set, we compare the model with the best Bayes factor to the best models chosen using AIC and BIC. We find that the best model under any of these criteria is not necessarily the most complicated one; models with an intermediate number of substitution types typically do best. Moreover, almost all of the models that are chosen as best do not constrain a transition rate to be the same as a transversion rate, suggesting that it is the transition/ transversion rate bias that plays the largest role in determining which models are selected. Importantly, the reversible jump Markov chain Monte Carlo algorithm described here allows estimation of phylogeny (and other phylogenetic model parameters) to be performed while accounting for uncertainty in the model of DNA substitution.

480 citations


Journal ArticleDOI
TL;DR: Analysis of 13 eukaryotic species with sequenced mitochondrial and nuclear genomes reveals a large interspecific variation of NUMT number and size.
Abstract: Mitochondrial DNA sequences are frequently transferred to the nucleus giving rise to the so-called nuclear mitochondrial DNA (NUMT). Analysis of 13 eukaryotic species with sequenced mitochondrial and nuclear genomes reveals a large interspecific variation of NUMT number and size. Copy number ranges from none or few copies in Anopheles, Caenorhabditis, Plasmodium, Drosophila, and Fugu to more than 500 in human, rice, and Arabidopsis. The average size is between 62 (baker's yeast) and 647 bps (Neurospora), respectively. A correlation between the abundance of NUMTs and the size of the nuclear or the mitochondrial genomes, or of the nuclear gene density, is not evident. Other factors, such as the number and/or stability of mitochondria in the germline, or species-specific mechanisms controlling accumulation/loss of nuclear DNA, might be responsible for the interspecific diversity in NUMT accumulation.

450 citations


Journal ArticleDOI
TL;DR: Diverse GFP-like proteins from previously undersampled and completely new sources are described, including hydromedusae and planktonic Copepoda, and a new yellow protein seems to follow exactly the same structural solution to achieving the yellow color of fluorescence as YFP, an engineered yellow-emitting mutant variant of GFP.
Abstract: Homologs of the green fluorescent protein (GFP), including the recently described GFP-like domains of certain extracellular matrix proteins in Bilaterian organisms, are remarkably similar at the protein structure level, yet they often perform totally unrelated functions, thereby warranting recognition as a superfamily. Here we describe diverse GFP-like proteins from previously undersampled and completely new sources, including hydromedusae and planktonic Copepoda. In hydromedusae, yellow and nonfluorescent purple proteins were found in addition to greens. Notably, the new yellow protein seems to follow exactly the same structural solution to achieving the yellow color of fluorescence as YFP, an engineered yellow-emitting mutant variant of GFP. The addition of these new sequences made it possible to resolve deep-level phylogenetic relationships within the superfamily. Fluorescence (most likely green) must have already existed in the common ancestor of Cnidaria and Bilateria, and therefore GFP-like proteins may be responsible for fluorescence and/or coloration in virtually any animal. At least 15 color diversification events can be inferred following the maximum parsimony principle in Cnidaria. Origination of red fluorescence and nonfluorescent purple-blue colors on several independent occasions provides a remarkable example of convergent evolution of complex features at the molecular level.

421 citations


Journal ArticleDOI
TL;DR: It is shown that when branch lengths are unknown, it is better first to estimate branch lengths and then to estimate site-specific rates, and that a Bayesian approach is superior to maximum-likelihood under a wide range of conditions.
Abstract: The degree to which an amino acid site is free to vary is strongly dependent on its structural and functional importance. An amino acid that plays an essential role is unlikely to change over evolutionary time. Hence, the evolutionary rate at an amino acid site is indicative of how conserved this site is and, in turn, allows evaluation of its importance in maintaining the structure/function of the protein. When using probabilistic methods for site-specific rate inference, few alternatives are possible. In this study we use simulations to compare the maximum-likelihood and Bayesian paradigms. We study the dependence of inference accuracy on such parameters as number of sequences, branch lengths, the shape of the rate distribution, and sequence length. We also study the possibility of simultaneously estimating branch lengths and site-specific rates. Our results show that a Bayesian approach is superior to maximum-likelihood under a wide range of conditions, indicating that the prior that is incorporated into the Bayesian computation significantly improves performance. We show that when branch lengths are unknown, it is better first to estimate branch lengths and then to estimate site-specific rates. This procedure was found to be superior to estimating both the branch lengths and site-specific rates simultaneously. Finally, we illustrate the difference between maximum-likelihood and Bayesian methods when analyzing site-conservation for the apoptosis regulator protein Bcl-x(L).

420 citations


Journal ArticleDOI
TL;DR: This large data set provides a reliable phylogenetic framework for studying eukaryotic and animal evolution and will be easily extendable when large amounts of sequence information become available from a broader taxonomic range.
Abstract: Resolving the relationships between Metazoa and other eukaryotic groups as well as between metazoan phyla is central to the understanding of the origin and evolution of animals. The current view is based on limited data sets, either a single gene with many species (e.g., ribosomal RNA) or many genes but with only a few species. Because a reliable phylogenetic inference simultaneously requires numerous genes and numerous species, we assembled a very large data set containing 129 orthologous proteins ( approximately 30,000 aligned amino acid positions) for 36 eukaryotic species. Included in the alignments are data from the choanoflagellate Monosiga ovata, obtained through the sequencing of about 1,000 cDNAs. We provide conclusive support for choanoflagellates as the closest relative of animals and for fungi as the second closest. The monophyly of Plantae and chromalveolates was recovered but without strong statistical support. Within animals, in contrast to the monophyly of Coelomata observed in several recent large-scale analyses, we recovered a paraphyletic Coelamata, with nematodes and platyhelminths nested within. To include a diverse sample of organisms, data from EST projects were used for several species, resulting in a large amount of missing data in our alignment (about 25%). By using different approaches, we verify that the inferred phylogeny is not sensitive to these missing data. Therefore, this large data set provides a reliable phylogenetic framework for studying eukaryotic and animal evolution and will be easily extendable when large amounts of sequence information become available from a broader taxonomic range.

408 citations


Journal ArticleDOI
TL;DR: In this paper, a compositional bias was identified as responsible for this inconsistency and showed that it is reduced effectively by coding the nucleotides as purines and pyrimidines (RY-coding), reinforcing the original tree.
Abstract: Phylogenetic inference from sequences can be misled by both sampling (stochastic) error and systematic error (nonhistorical signals where reality differs from our simplified models). A recent study of eight yeast species using 106 concatenated genes from complete genomes showed that even small internal edges of a tree received 100% bootstrap support. This effective negation of stochastic error from large data sets is important, but longer sequences exacerbate the potential for biases (systematic error) to be positively misleading. Indeed, when we analyzed the same data set using minimum evolution optimality criteria, an alternative tree received 100% bootstrap support. We identified a compositional bias as responsible for this inconsistency and showed that it is reduced effectively by coding the nucleotides as purines and pyrimidines (RY-coding), reinforcing the original tree. Thus, a comprehensive exploration of potential systematic biases is still required, even though genome-scale data sets greatly reduce sampling error.

Journal ArticleDOI
TL;DR: The divergence of two subspecies of common chimpanzees and the bonobo was studied using a recently developed method for analyzing population divergence using large multilocus DNA sequence data sets drawn from the literature.
Abstract: The divergence of two subspecies of common chimpanzees (Pan troglodytes troglodytes and P. t. verus) and the bonobo (P. paniscus) was studied using a recently developed method for analyzing population divergence. Under the isolation with migration model, the posterior probability distributions of divergence time, migration rates, and effective population sizes were estimated for large multilocus DNA sequence data sets drawn from the literature. The bonobo and the common chimpanzee are estimated to have diverged approximately 0.86 to 0.89 MYA, and the divergence of the two common chimpanzee subspecies is estimated to have occurred 0.42 MYA. P. t. troglodytes appears to have had a larger effective population size (22,400 to 27,900) compared with P. paniscus, P. t. verus, and the ancestral populations of these species. No evidence of gene flow was found in the comparisons involving P. paniscus; however a clear signal of unidirectional gene flow was found from P. t. verus to P. t. troglodytes (2Nm ¼ 0.51).

Journal ArticleDOI
TL;DR: The results show that, in comparison to tissue-specific genes, housekeeping genes on average evolve more slowly and are under stronger selective constraints as reflected by significantly smaller values of Ka/Ks, and contrary to the old textbook concept, approximately 74% of theHousekeeping genes in this study belong to multigene families, not significantly different from that of the tissue- specific genes.
Abstract: Do housekeeping genes, which are turned on most of the time in almost every tissue, evolve more slowly than genes that are turned on only at specific developmental times or tissues? Recent large-scale gene expression studies enable us to have a better definition of housekeeping genes and to address the above question in detail. In this study, we examined 1581 human-mouse orthologous gene pairs for their patterns of sequence evolution, contrasting housekeeping genes with tissue-specific genes. Our results show that, in comparison to tissue-specific genes, housekeeping genes on average evolve more slowly and are under stronger selective constraints as reflected by significantly smaller values of Ka/Ks. Besides stronger purifying selection, we explored several other factors that can possibly slow down nonsynonymous rates in housekeeping genes. Although mutational bias might slightly slow the nonsynonymous rates in housekeeping genes, it is unlikely to be the major cause of the rate difference between the two types of genes. The codon usage pattern of housekeeping genes does not seem to differ from that of tissue-specific genes. Moreover, contrary to the old textbook concept, we found that approximately 74% of the housekeeping genes in our study belong to multigene families, not significantly different from that of the tissue-specific genes ( approximately 70%). Therefore, the stronger selective constraints on housekeeping genes are not due to a lower degree of genetic redundancy.

Journal ArticleDOI
TL;DR: Important changes in the genome of E. coli have occurred during the diversification of the species, allowing the virulence factors associated with severe acute diarrhea to arrive in the population.
Abstract: In bacteria, the evolution of pathogenicity seems to be the result of the constant arrival of virulence factors (VFs) into the bacterial genome. However, the integration, retention, and/or expression of these factors may be the result of the interaction between the new arriving genes and the bacterial genomic background. To test this hypothesis, a phylogenetic analysis was done on a collection of 98 Escherichia coli/Shigella strains representing the pathogenic and commensal diversity of the species. The distribution of 17 VFs associated to the different E. coli pathovars was superimposed on the phylogenetic tree. Three major types of VFs can be recognized: (1) VFs that arrive and are expressed in different genetic backgrounds (such as VFs associated with the pathovars of mild chronic diarrhea: enteroaggregative, enteropathogenic, and diffusely-adhering E. coli), (2) VFs that arrive in different genetic backgrounds but are preferentially found, associated with a specific pathology, in only one particular background (such as VFs associated with extraintestinal diseases), and (3) VFs that require a particular genetic background for the arrival and expression of their virulence potential (such as VFs associated with pathovars typical of severe acute diarrhea: enterohemorragic, enterotoxigenic, and enteroinvasive E. coli strains). The possibility of a single arrival of VFs by chance, followed by a vertical transmission, was ruled out by comparing the evolutionary histories of some of these VFs to the strain phylogeny. These evidences suggest that important changes in the genome of E. coli have occurred during the diversification of the species, allowing the virulence factors associated with severe acute diarrhea to arrive in the population. Thus, the E. coli genome seems to be formed by an "ancestral" and a "derived" background, each one responsible for the acquisition and expression of different virulence factors.

Journal ArticleDOI
TL;DR: It is demonstrated that recombination drives the evolution of base composition in human (probably via the process of biased gene conversion) and the pattern of neutral substitutions in 14.3 Mb of primate noncoding regions suggests changes of recombination rates occur relatively frequently during evolution.
Abstract: Unraveling the evolutionary forces responsible for variations of neutral substitution patterns among taxa or along genomes is a major issue in the identification of functional sequence features. Mammalian genomes show large-scale regional variations of GC-content (the isochores), but the substitution processes at the origin of this structure are poorly understood. We have analyzed the pattern of neutral substitutions in 14.3 Mb of primate noncoding regions. We show that the GC-content toward which sequences are evolving is strongly correlated (r(2) = 0.61, P

Journal ArticleDOI
TL;DR: Notably, differential gene loss played an important role in the evolution of different nuclear receptor sets in bilaterian lineages and was also shaped by periods of gene duplication in vertebrates, as well as a lineage-specific duplication burst in nematodes.
Abstract: Bilaterian animals are notably characterized by complex endocrine systems. The receptors for many steroids, retinoids, and other hormones belong to the superfamily of nuclear receptors, which are transcription factors regulating many aspects of development and homeostasis. Despite a diversity of regulatory mechanisms and physiological roles, nuclear receptors share a common protein organization. To obtain the broad picture of bilaterian nuclear hormone receptor evolution, we have characterized the complete set of nuclear receptor genes from nine animal genome sequences and analyzed it in a phylogenetic framework. In addition, expressed sequence tags from key lineages with no available genome sequence were also searched. This allows us to date the evolutionary events that led from an ancestral nuclear receptor gene, in an early metazoan, to present day diversity. We show that there were ;25 nuclear receptor genes in Urbilateria, the ancestor of bilaterians, at which point the fundamental diversity of the subfamily was already established. Surprisingly, differential gene loss played an important role in the evolution of different nuclear receptor sets in bilaterian lineages. The nuclear receptor distribution was also shaped by periods of gene duplication, essentially in vertebrates, as well as a lineage-specific duplication burst in nematodes. Our results imply that the genes for major receptors such as steroid receptors or thyroid hormone receptors were present in Urbilateria.

Journal ArticleDOI
TL;DR: This study used an empirical example based on 100 mitochondrial genomes from higher teleost fishes to compare the accuracy of parsimony-based jackknife values with Bayesian support values, and found that the higher BayesianSupport values are inappropriate and should not be interpreted as probabilities that clades are correctly resolved.
Abstract: In this study, we used an empirical example based on 100 mitochondrial genomes from higher teleost fishes to compare the accuracy of parsimony-based jackknife values with Bayesian support values. Phylogenetic analyses of 366 partitions, using differential taxon and character sampling from the entire data matrix of 100 taxa and 7,990 characters, were performed for both phylogenetic methods. The tree topology and branch-support values from each partition were compared with the tree inferred from all taxa and characters. Using this approach, we quantified the accuracy of the branch-support values assigned by the jackknife and Bayesian methods, with respect to each of 15 basal clades. In comparing the jackknife and Bayesian methods, we found that (1) both measures of support differ significantly from an ideal support index; (2) the jackknife underestimated support values; (3) the Bayesian method consistently overestimated support; (4) the magnitude by which Bayesian values overestimate support exceeds the magnitude by which the jackknife underestimates support; and (5) both methods performed poorly when taxon sampling was increased and character sampling was not increases. These results indicate that (1) the higher Bayesian support values are inappropriate (in magnitude), and (2) Bayesian support values should not be interpreted as probabilities that clades are correctly resolved. We advocate the continued use of the relatively conservative bootstrap and jackknife approaches to estimating branch support rather than the more extreme overestimates provided by the Markov Chain Monte Carlo-based Bayesian methods.

Journal ArticleDOI
TL;DR: A classification of the genus Botrytis was constructed based on DNA sequence data of three nuclear protein-coding genes and compared with the traditional classification, finding that loss of sexual reproduction has occurred at least three times and is supposed to be a consequence of negative selection.
Abstract: The cosmopolitan genus Botrytis contains 22 recognized species and one hybrid. The current classification is largely based on morphological characters and, to a minor extent, on physiology and host range. In this study, a classification of the genus was constructed based on DNA sequence data of three nuclear protein-coding genes (RPB2, G3PDH, and HSP60) and compared with the traditional classification. Sexual reproduction and the host range, important fitness traits, were traced in the tree and used for the identification of major evolutionary events during speciation. The phylogenetic analysis corroborated the classical species delineation. In addition, the hybrid status of B. allii (B. byssoidea x B. aclada) was confirmed. Both individual gene trees and combined trees show that the genus Botrytis can be divided into two clades, radiating after the separation of Botrytis from other Sclerotiniaceae genera. Clade 1 contains four species that all colonize exclusively eudicot hosts, whereas clade 2 contains 18 species that are pathogenic on either eudicot (3) or monocot (15) hosts. A comparison of Botrytis and angiosperm phylogenies shows that cospeciation of pathogens and their hosts have not occurred during their respective evolution. Rather, we propose that host shifts have occurred during Botrytis speciation, possibly by the acquisition of novel pathogenicity factors. Loss of sexual reproduction has occurred at least three times and is supposed to be a consequence of negative selection

Journal ArticleDOI
TL;DR: In this article, pairwise amino acid sequence identity was examined in comparison of 6,214 nuclear protein-coding genes from Saccharomyces cerevisiae to 177,117 proteins encoded in sequenced genomes from 45 eubacteria and 15 archaebacteria.
Abstract: Analyses of 55 individual and 31 concatenated protein data sets encoded in Reclinomonas americana and Marchantia polymorpha mitochondrial genomes revealed that current methods for constructing phylogenetic trees are insufficiently sensitive (or artifact-insensitive) to ascertain the sister of mitochondria among the current sample of eight alpha-proteobacterial genomes using mitochondrially-encoded proteins. However, Rhodospirillum rubrum came as close to mitochondria as any alpha-proteobacterium investigated. This prompted a search for methods to directly compare eukaryotic genomes to their prokaryotic counterparts to investigate the origin of the mitochondrion and its host from the standpoint of nuclear genes. We examined pairwise amino acid sequence identity in comparisons of 6,214 nuclear protein-coding genes from Saccharomyces cerevisiae to 177,117 proteins encoded in sequenced genomes from 45 eubacteria and 15 archaebacteria. The results reveal that approximately 75% of yeast genes having homologues among the present prokaryotic sample share greater amino acid sequence identity to eubacterial than to archaebacterial homologues. At high stringency comparisons, only the eubacterial component of the yeast genome is detectable. Our findings indicate that at the levels of overall amino acid sequence identity and gene content, yeast shares a sister-group relationship with eubacteria, not with archaebacteria, in contrast to the current phylogenetic paradigm based on ribosomal RNA. Among eubacteria and archaebacteria, proteobacterial and methanogen genomes, respectively, shared more similarity with the yeast genome than other prokaryotic genomes surveyed.

Journal ArticleDOI
TL;DR: Results indicate that only a small proportion of scored AFLP loci might be linked to genes implicated in the adaptive radiation of lake whitefish, and provide further support for the role of selection in driving their divergence.
Abstract: Under the ecological theory of adaptive radiation, adaptation and reproductive isolation are thought to evolve as a result of divergent natural selection. Accordingly, elucidating the genetic basis of these processes is essential toward understanding the role of selection in shaping biological diversity. In this respect, the number of genes that evolved by selection remains contentious. To address this issue, the pattern of genetic differentiation obtained using 440 AFLP loci was compared with that expected under neutrality in four sympatric pairs of lake whitefish ecotypes that evolved adaptive phenotypic differences associated with the exploitation of distinct ecological niches. On average, 14 loci showed restricted gene flow relative to neutral expectation, suggesting a role of directional selection on their divergence. Among all loci that are most likely under directional selection, six exhibited parallel patterns of divergence, which provided further support for the role of selection in driving their divergence. Overall, these results indicate that only a small proportion of scored AFLP loci (between 1.4% and 3.2%) might be linked to genes implicated in the adaptive radiation of lake whitefish.

Journal ArticleDOI
TL;DR: In this paper, the authors made regression analyses combining data available on these variables and on evolutionary rates, in two well-documented model bacteria, Escherichia coli and Bacillus subtilis.
Abstract: The variation of amino acid substitution rates in proteins depends on several variables. Among these, the protein's expression level, functional category, essentiality, or metabolic costs of its amino acid residues may play an important role. However, the relative importance of each variable has not yet been evaluated in comparative analyses. To this aim, we made regression analyses combining data available on these variables and on evolutionary rates, in two well-documented model bacteria, Escherichia coli and Bacillus subtilis. In both bacteria, the level of expression of the protein in the cell was by far the most important driving force constraining the amino acids substitution rate. Subsequent inclusion in the analysis of the other variables added little further information. Furthermore, when the rates of synonymous substitutions were included in the analysis of the E. coli data, only the variable expression levels remained statistically significant. The rate of nonsynonymous substitution was shown to correlate with expression levels independently of the rate of synonymous substitution. These results suggest an important direct influence of expression levels, or at least codon usage bias for translation optimization, on the rates of nonsynonymous substitutions in bacteria. They also indicate that when a control for this variable is included, essentiality plays no significant role in the rate of protein evolution in bacteria, as is the case in eukaryotes.

Journal ArticleDOI
TL;DR: To better understand the domestication process, the coalescent with recombination is used to simulate bottlenecks under various lengths and population sizes and it is found that demography is unlikely to account for the previously observed positive correlation between nucleotide diversity and the population-recombination parameter in maize.
Abstract: The domestication of maize (Zea mays ssp. mays) from its wild ancestor (Zea mays ssp. parviglumis) led to a loss of genetic diversity both through a population bottleneck and through directional selection at agronomically important genes. In order to discriminate between those effects and to investigate the nature of the domestication bottleneck, we analyzed nucleotide diversity data from 12 chromosome 1 loci in parviglumis. We found an average loss of nucleotide diversity of 38% across genes, but this average was skewed downward by four putatively selected loci (tb1, d8, ts2, and zagl1). To better understand the domestication process, we used the coalescent with recombination to simulate bottlenecks under various lengths and population sizes. For each locus, we determine the likelihood of the observed data using three summary statistics: the number of segregating sites, an estimate of the population recombination parameter, and Tajima's D. Based on the eight neutrally evolving loci, a model with a bottleneck had a significantly higher likelihood than a model without one. The four putatively selected loci had significantly different likelihood optimums than the neutral loci, and this approach confirmed that ts2 and d8 were selected either during domestication or breeding. Overall, the best-fitting models had a bottleneck in which the population size and the bottleneck duration had a ratio of approximately 4- to approximately 5; for example, if the initial domestication event occurred over a 500-year period, the population size was roughly 2,000 to 2,500 individuals. However, this range did vary with the summary statistic used to assess the fit of simulations to data. In this context, Tajima's D performed poorly as a goodness-of-fit statistic, probably because Z. mays ssp. parviglumis has a frequency spectrum that is significantly skewed toward low-frequency variants. Finally, we found that demography is unlikely to account for the previously observed positive correlation between nucleotide diversity and the population-recombination parameter in maize, leaving this observation difficult to interpret.

Journal ArticleDOI
TL;DR: It is hypothesized that as a substantial fraction of nonsynonymous divergence has been shown to be adaptive, much of the observed expression divergence is likewise adaptive.
Abstract: Sequence divergence scaled by variation within species has been used to infer the action of selection upon individual genes. Applying this approach to expression, we compared whole-genome whole-body RNA levels in 10 heterozygous Drosophila simulans genotypes and a pooled sample of 10 D. melanogaster lines using Affymetrix Genechip. For 972 genes expressed in D. melanogaster, the transcript level was below detection threshold in D. simulans, which may be explained either by sequence divergence between the primers on the chip and the mRNA transcripts or by downregulation of these genes. Out of 6,707 genes that were expressed in both species, transcript level was significantly different between species for 534 genes (at P , 0.001). Genes whose expression is under stabilizing selection should exhibit reduced genetic variation within species and reduced divergence between species. Expression of genes under directional selection in D. simulans should be highly divergent from D. melanogaster, while showing low genetic variation in D. simulans. Finally, the genes with large variation within species but modest divergence between species are candidates for balancing selection. Rapidly diverging, low-polymorphism genes included those involved in reproduction (e.g., Mst 3Ba, 98Cb; Acps 26Aa, 63F; and sperm-specific dynein). Genes with high variation in transcript abundance within species included metallothionein and hairless, both hypothesized to be segregating in nature because of gene-byenvironment interactions. Further, we compared expression divergence and DNA substitution rate in 195 genes. Synonymous substitution rate and expression divergences were uncorrelated, whereas there was a significant positive correlation between nonsynonymous substitution rate and expression divergence. We hypothesize that as a substantial fraction of nonsynonymous divergence has been shown to be adaptive, much of the observed expression divergence is likewise adaptive.

Journal ArticleDOI
TL;DR: It is shown that all vertebrate mono-CRD galectins known to date belong to either the F3- or F4- subtype, and a sequence of duplication and divergence events of the different galectin in chordates is proposed.
Abstract: Galectins form a family of structurally related carbohydrate binding proteins (lectins) that have been identified in a large variety of metazoan phyla. They are involved in many biological processes such as morphogenesis, control of cell death, immunological response, and cancer. To elucidate the evolutionary history of galectins and galectin-like proteins in chordates, we have exploited three independent lines of evidence: (i) location of galectin encoding genes (LGALS) in the human genome; (ii) exon-intron organization of galectin encoding genes; and (iii) sequence comparison of carbohydrate recognition domains (CRDs) of chordate galectins. Our results suggest that a duplication of a mono-CRD galectin gene gave rise to an original bi-CRD galectin gene, before or early in chordate evolution. The N-terminal and C-terminal CRDs of this original galectin subsequently diverged into two different subtypes, defined by exon-intron structure (F4-CRD and F3-CRD). We show that all vertebrate mono-CRD galectins known to date belong to either the F3- or F4- subtype. A sequence of duplication and divergence events of the different galectins in chordates is proposed.

Journal ArticleDOI
TL;DR: The analysis of several Drosophila data sets suggests that approximately 25% +/- 20% of amino acid substitutions were driven by positive selection in the divergence between D. simulans and D. yakuba.
Abstract: The proportion of amino acid substitutions driven by adaptive evolution can potentially be estimated from polymorphism and divergence data by an extension of the McDonald-Kreitman test. We have developed a maximum-likelihood method to do this and have applied our method to several data sets from three Drosophila species: D. melanogaster, D. simulans, and D. yakuba. The estimated number of adaptive substitutions per codon is not uniformly distributed among genes, but follows a leptokurtic distribution. However, the proportion of amino acid substitutions fixed by adaptive evolution seems to be remarkably constant across the genome (i.e., the proportion of amino acid substitutions that are adaptive appears to be the same in fast-evolving and slow-evolving genes; fast-evolving genes have higher numbers of both adaptive and neutral substitutions). Our estimates do not seem to be significantly biased by selection on synonymous codon use or by the assumption of independence among sites. Nevertheless, an accurate estimate is hampered by the existence of slightly deleterious mutations and variations in effective population size. The analysis of several Drosophila data sets suggests that approximately 25% ± 20% of amino acid substitutions were driven by positive selection in the divergence between D. simulans and D. yakuba.

Journal ArticleDOI
TL;DR: Direct evidence supporting the suggestion that Central Asia is the location of genetic admixture of the East and the West is provided, with the highest frequency present in Uygur and Uzbek samples, and no western Eurasian type was found in Han Chinese samples from the same place.
Abstract: Previous studies have shown that there were extensive genetic admixtures in the Silk Road region. In the present study, we analyzed 252 mtDNAs of five ethnic groups (Uygur, Uzbek, Kazak, Mongolian, and Hui) from Xinjiang Province, China (through which the Silk Road once ran) together with some reported data from the adjacent regions in Central Asia. In a simple way, we classified the mtDNAs into different haplogroups (monophyletic clades in the rooted mtDNA tree) according to the available phylogenetic information and compared their frequencies to show the differences among the matrilineal genetic structures of these populations with different demographic histories. With the exception of eight unassigned M*, N*, and R* mtDNAs, all the mtDNA types identified here belonged to defined subhaplogroups of haplogroups M and N (including R) and consisted of subsets of both the eastern and western Eurasian pools, thus providing direct evidence supporting the suggestion that Central Asia is the location of genetic admixture of the East and the West. Although our samples were from the same geographic location, a decreasing tendency of the western Eurasianspecific haplogroup frequency was observed, with the highest frequency present in Uygur (42.6%) and Uzbek (41.4%) samples, followed by Kazak (30.2%), Mongolian (14.3%), and Hui (6.7%). No western Eurasian type was found in Han Chinese samples from the same place. The frequencies of the eastern Eurasian-specific haplogroups also varied in these samples. Combined with the historical records, ethno-origin, migratory history, and marriage customs might play different roles in shaping the matrilineal genetic structure of different ethnic populations residing in this region.

Journal ArticleDOI
TL;DR: The African-American population shows both a higher level of nucleotide diversity and more negative values of Tajima's D statistic than does a European- American population, and the frequency spectrum of mutations--corrected for levels of polymorphism--is correlated with recombination rate only in European-Americans.
Abstract: Demographic events affect all genes in a genome, whereas natural selection has only local effects. Using publicly available data from 151 loci sequenced in both European-American and African-American populations, we attempt to distinguish the effects of demography and selection. To analyze large sets of population genetic data such as this one, we introduce "Perlymorphism," a Unix-based suite of analysis tools. Our analyses show that the demographic histories of human populations can account for a large proportion of effects on the level and frequency of variation across the genome. The African-American population shows both a higher level of nucleotide diversity and more negative values of Tajima's D statistic than does a European-American population. Using coalescent simulations, we show that the significantly negative values of the D statistic in African-Americans and the positive values in European-Americans are well explained by relatively simple models of population admixture and bottleneck, respectively. Working within these nonequilibrium frameworks, we are still able to show deviations from neutral expectations at a number of loci, including ABO and TRPV6. In addition, we show that the frequency spectrum of mutations--corrected for levels of polymorphism--is correlated with recombination rate only in European-Americans. These results are consistent with repeated selective sweeps in non-African populations, in agreement with recent reports using microsatellite data.

Journal ArticleDOI
TL;DR: A comprehensive sequence alignment and the first phylogenetic analysis of the cation/Ca(2+) exchanger superfamily of 147 sequences are reported, suggesting unique signature motifs of conserved residues that may underlie divergent functional properties.
Abstract: Cation/Ca 2 + exchangers are an essential component of Ca 2 + signaling pathways and function to transport cytosolic Ca 2 + across membranes against its electrochemical gradient by utilizing the downhill gradients of other cation species such as H + , Na + , or K + . The cation/Ca 2 + exchanger superfamily is composed of H + /Ca 2 + exchangers and Na + /Ca 2 + exchangers, which have been investigated extensively in both plant cells and animal cells. Recently, information from completely sequenced genomes of bacteria, archaea, and eukaryotes has revealed the presence of genes that encode homologues of cation/Ca 2 + exchangers in many organisms in which the role of these exchangers has not been clearly demonstrated. In this study, we report a comprehensive sequence alignment and the first phylogenetic analysis of the cation/Ca 2 + exchanger superfamily of 147 sequences. The results present a framework for structure-function relationships of cation/Ca 2 + exchangers, suggesting unique signature motifs of conserved residues that may underlie divergent functional properties. Construction of a phylogenetic tree with inclusion of cation/Ca 2 + exchangers with known functional properties defines five protein families and the evolutionary relationships between the members. Based on this analysis, the cation/Ca 2 + exchanger superfamily is classified into the YRBG, CAX, NCX, and NCKX families and a newly recognized family, designated CCX. These findings will provide guides for future studies concerning structures, functions, and evolutionary origins of the cation/Ca 2 + exchangers.

Journal ArticleDOI
TL;DR: Analysis of snake venom toxin families provides strong additional evidence that venom evolved once, at the base of the advanced snake radiation, rather than multiple times in different lineages, with these toxins also present in the venoms of the "colubrid" snake families.
Abstract: We analyzed the origin and evolution of snake venom toxin families represented in both viperid and elapid snakes by means of phylogenetic analysis of the amino acid sequences of the toxins and related nonvenom proteins. Out of eight toxin families analyzed, five provided clear evidence of recruitment into the snake venom proteome before the diversification of the advanced snakes (Kunitz-type protease inhibitors, CRISP toxins, galactose-binding lectins, M12B peptidases, nerve growth factor toxins), and one was equivocal (cystatin toxins). In two others (phospholipase A2 and natriuretic toxins), the nonmonophyly of venom toxins demonstrates that presence of these proteins in elapids and viperids results from independent recruitment events. The ANP/BNP natriuretic toxins are likely to be basal, whereas the CNP/BPP toxins are Viperidae only. Similarly, the lectins were recruited twice. In contrast to the basal recruitment of the galactose-binding lectins, the C-type lectins were shown to be Viperidae only, with the a-chains and b-chains resulting from an early duplication event. These results provide strong additional evidence that venom evolved once, at the base of the advanced snake radiation, rather than multiple times in different lineages, with these toxins also present in the venoms of the ‘‘colubrid’’ snake families. Moreover, they provide a first insight into the composition of the earliest ophidian venoms and point the way toward a research program that could elucidate the functional context of the evolution of the snake venom proteome.

Journal ArticleDOI
TL;DR: A system level view of genome-scale sequence and expression data is employed to examine the interplay between these two sources of order, natural selection and physical self-organization, in the evolution of human gene regulation, and a model is proposed for how natural selection could influence gene expression divergence is proposed.
Abstract: The role of natural selection in biology is well appreciated. Recently, however, a critical role for physical principles of network self-organization in biological systems has been revealed. Here, we employ a systems level view of genome-scale sequence and expression data to examine the interplay between these two sources of order, natural selection and physical self-organization, in the evolution of human gene regulation. The topology of a human gene coexpression network, derived from tissue-specific expression profiles, shows scale-free properties that imply evolutionary self-organization via preferential node attachment. Genes with numerous coexpressed partners (the hubs of the coexpression network) evolve more slowly on average than genes with fewer coexpressed partners, and genes that are coexpressed show similar rates of evolution. Thus, the strength of selective constraints on gene sequences is affected by the topology of the gene coexpression network. This connection is strong for the coding regions and 3' untranslated regions (UTRs), but the 5' UTRs appear to evolve under a different regime. Surprisingly, we found no connection between the rate of gene sequence divergence and the extent of gene expression profile divergence between human and mouse. This suggests that distinct modes of natural selection might govern sequence versus expression divergence, and we propose a model, based on rapid, adaptation-driven divergence and convergent evolution of gene expression patterns, for how natural selection could influence gene expression divergence.