scispace - formally typeset
Search or ask a question

Showing papers in "Molecular Biology and Evolution in 2013"


Journal ArticleDOI
TL;DR: An advanced version of the Molecular Evolutionary Genetics Analysis software, which currently contains facilities for building sequence alignments, inferring phylogenetic histories, and conducting molecular evolutionary analysis, is released, which enables the inference of timetrees, as it implements the RelTime method for estimating divergence times for all branching points in a phylogeny.
Abstract: We announce the release of an advanced version of the Molecular Evolutionary Genetics Analysis (MEGA) software, which currently contains facilities for building sequence alignments, inferring phylogenetic histories, and conducting molecular evolutionary analysis. In version 6.0, MEGA now enables the inference of timetrees, as it implements the RelTime method for estimating divergence times for all branching points in a phylogeny. A new Timetree Wizard in MEGA6 facilitates this timetree inference by providing a graphical user interface (GUI) to specify the phylogeny and calibration constraints step-by-step. This version also contains enhanced algorithms to search for the optimal trees under evolutionary criteria and implements a more advanced memory management that can double the size of sequence data sets to which MEGA can be applied. Both GUI and command-line versions of MEGA6 can be downloaded from www.megasoftware.net free of charge.

37,956 citations


Journal ArticleDOI
TL;DR: This version of MAFFT has several new features, including options for adding unaligned sequences into an existing alignment, adjustment of direction in nucleotide alignment, constrained alignment and parallel processing, which were implemented after the previous major update.
Abstract: We report a major update of the MAFFT multiple sequence alignment program. This version has several new features, including options for adding unaligned sequences into an existing alignment, adjustment of direction in nucleotide alignment, constrained alignment and parallel processing, which were implemented after the previous major update. This report shows actual examples to explain how these features work, alone and in combination. Some examples incorrectly aligned by MAFFT are also shown to clarify its limitations. We discuss how to avoid misalignments, and our ongoing efforts to overcome such limitations.

27,771 citations


Journal ArticleDOI
TL;DR: This work proposes an ultrafast bootstrap approximation approach (UFBoot) to compute the support of phylogenetic groups in maximum likelihood (ML) based trees and offers an efficient and easy-to-use software to perform the UFBoot analysis with ML tree inference.
Abstract: Nonparametric bootstrap has been a widely used tool in phylogenetic analysis to assess the clade support of phylogenetic trees. However, with the rapidly growing amount of data, this task remains a computational bottleneck. Recently, approximation methods such as the RAxML rapid bootstrap (RBS) and the Shimodaira-Hasegawa-like approximate likelihood ratio test have been introduced to speed up the bootstrap. Here, we suggest an ultrafast bootstrap approximation approach (UFBoot) to compute the support of phylogenetic groups in maximum likelihood (ML) based trees. To achieve this, we combine the resampling estimated log-likelihood method with a simple but effective collection scheme of candidate trees. We also propose a stopping rule that assesses the convergence of branch support values to automatically determine when to stop collecting candidate trees. UFBoot achieves a median speed up of 3.1 (range: 0.66-33.3) to 10.2 (range: 1.32-41.4) compared with RAxML RBS for real DNA and amino acid alignments, respectively. Moreover, our extensive simulations show that UFBoot is robust against moderate model violations and the support values obtained appear to be relatively unbiased compared with the conservative standard bootstrap. This provides a more direct interpretation of the bootstrap support. We offer an efficient and easy-to-use software (available at http://www.cibiv.at/software/iqtree) to perform the UFBoot analysis with ML tree inference.

2,469 citations


Journal ArticleDOI
TL;DR: A step-by-step protocol is presented in sufficient detail to allow a novice to start with a sequence of interest and to build a publication-quality tree illustrating the evolution of an appropriate set of homologs of that sequence.
Abstract: Phylogenetic analysis is sometimes regarded as being an intimidating, complex process that requires expertise and years of experience. In fact, it is a fairly straightforward process that can be learned quickly and applied effectively. This Protocol describes the several steps required to produce a phylogenetic tree from molecular data for novices. In the example illustrated here, the program MEGA is used to implement all those steps, thereby eliminating the need to learn several programs, and to deal with multiple file formats from one step to another (Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S. 2011. MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol Biol Evol. 28:2731‐2739). The first step, identification of a set of homologous sequences and downloading those sequences, is implemented by MEGA’s own browser built on top of the Google Chrome toolkit. For the second step, alignment of those sequences, MEGA offers two different algorithms: ClustalW and MUSCLE. For the third step, construction of a phylogenetic tree from the aligned sequences, MEGA offers many different methods. Here we illustrate the maximum likelihood method, beginning with MEGA’s Models feature, which permits selecting the most suitable substitution model. Finally, MEGA provides a powerful and flexible interface for the final step, actually drawing the tree for publication. Here a step-by-step protocol is presented in sufficient detail to allow a novice to start with a sequence of interest and to build a publication-quality tree illustrating the evolution of an appropriate set of homologs of that sequence. MEGA is available for use on PCs and Macs from www. megasoftware.net.

1,057 citations


Journal ArticleDOI
Xuhua Xia1
TL;DR: Since its first release in 2001 as mainly a software package for phylogenetic analysis, data analysis for molecular biology and evolution (DAMBE) has gained many new functions that may be classified into six categories.
Abstract: Since its first release in 2001 as mainly a software package for phylogenetic analysis, data analysis for molecular biology and evolution (DAMBE) has gained many new functions that may be classified into six categories: 1) sequence retrieval, editing, manipulation, and conversion among more than 20 standard sequence formats including MEGA, NEXUS, PHYLIP, GenBank, and the new NeXML format for interoperability, 2) motif characterization and discovery functions such as position weight matrix and Gibbs sampler, 3) descriptive genomic analysis tools with improved versions of codon adaptation index, effective number of codons, protein isoelectric point profiling, RNA and protein secondary structure prediction and calculation of minimum folding energy, and genomic skew plots with optimized window size, 4) molecular phylogenetics including sequence alignment, testing substitution saturation, distance-based, maximum parsimony, and maximum-likelihood methods for tree reconstructions, testing the molecular clock hypothesis with either a phylogeny or with relative-rate tests, dating gene duplication and speciation events, choosing the best-fit substitution models, and estimating rate heterogeneity over sites, 5) phylogeny-based comparative methods for continuous and discrete variables, and 6) graphic functions including secondary structure display, optimized skew plot, hydrophobicity plot, and many other plots of amino acid properties along a protein sequence, tree display and drawing by dragging nodes to each other, and visual searching of the maximum parsimony tree. DAMBE features a graphic, user-friendly, and intuitive interface and is freely available from http://dambe.bio.uottawa.ca (last accessed April 16, 2013).

989 citations


Journal ArticleDOI
TL;DR: This work presents an approximate hierarchical Bayesian method using a Markov chain Monte Carlo (MCMC) routine that ensures robustness against model misspecification by averaging over a large number of predefined site classes, and leaves the distribution of selection parameters essentially unconstrained.
Abstract: Model-based analyses of natural selection often categorize sites into a relatively small number of site classes. Forcing each site to belong to one of these classes places unrealistic constraints on the distribution of selection parameters, which can result in misleading inference due to model misspecification. We present an approximate hierarchical Bayesian method using a Markov chain Monte Carlo (MCMC) routine that ensures robustness against model misspecification by averaging over a large number of predefined site classes. This leaves the distribution of selection parameters essentially unconstrained, and also allows sites experiencing positive and purifying selection to be identified orders of magnitude faster than by existing methods. We demonstrate that popular random effects likelihood methods can produce misleading results when sites assigned to the same site class experience different levels of positive or purifying selection—an unavoidable scenario when using a small number of site classes. Our Fast Unconstrained Bayesian AppRoximation (FUBAR) is unaffected by this problem, while achieving higher power than existing unconstrained (fixed effects likelihood) methods. The speed advantage of FUBAR allows us to analyze larger data sets than other methods: We illustrate this on a large influenza hemagglutinin data set (3,142 sequences). FUBAR is available as a batch file within the latest HyPhy distribution (http://www.hyphy.org), as well as on the Datamonkey web server (http://www.datamonkey.org/).

939 citations


Journal ArticleDOI
TL;DR: New algorithms based on population genetics, ecological modeling, and statistical learning techniques are proposed to screen genomes for signatures of local adaptation and demonstrate that LFMM can efficiently estimate random effects due to population history and isolation-by-distance patterns when computing gene-environment correlations.
Abstract: Adaptation to local environments often occurs through natural selection acting on a large number of loci, each having a weak phenotypic effect. One way to detect these loci is to identify genetic polymorphisms that exhibit high correlation with environmental variables used as proxies for ecological pressures. Here, we propose new algorithms based on population genetics, ecological modeling, and statistical learning techniques to screen genomes for signatures of local adaptation. Implemented in the computer program “latent factor mixed model” (LFMM), these algorithms employ an approach in which population structure is introduced using unobserved variables. These fast and computationally efficient algorithms detect correlations between environmental and genetic variation while simultaneously inferring background levels of population structure. Comparing these new algorithms with related methods provides evidence that LFMM can efficiently estimate random effects due to population history and isolation-by-distance patterns when computing gene-environment correlations, and decrease the number of false-positive associations in genome scans. We then apply these models to plant and human genetic data, identifying several genes with functions related to development that exhibit strong correlations with climatic gradients.

605 citations


Journal ArticleDOI
TL;DR: MitoFish contains re-annotations of previously sequenced fish mitogenomes, enabling researchers to refer to them when they find annotations that are likely to be erroneous or while conducting comparative mitogenomic analyses.
Abstract: Mitofish is a database of fish mitochondrial genomes (mitogenomes) that includes powerful and precise de novo annotations for mitogenome sequences. Fish occupy an important position in the evolution of vertebrates and the ecology of the hydrosphere, and mitogenomic sequence data have served as a rich source of information for resolving fish phylogenies and identifying new fish species. The importance of a mitogenomic database continues to grow at a rapid pace as massive amounts of mitogenomic data are generated with the advent of new sequencing technologies. A severe bottleneck seems likely to occur with regard to mitogenome annotation because of the overwhelming pace of data accumulation and the intrinsic difficulties in annotating sequences with degenerating transfer RNA structures, divergent start/stop codons of the coding elements, and the overlapping of adjacent elements. To ease this data backlog, we developed an annotation pipeline named MitoAnnotator. MitoAnnotator automatically annotates a fish mitogenome with a high degree of accuracy in approximately 5 min; thus, it is readily applicable to data sets of dozens of sequences. MitoFish also contains re-annotations of previously sequenced fish mitogenomes, enabling researchers to refer to them when they find annotations that are likely to be erroneous or while conducting comparative mitogenomic analyses. For users who need more information on the taxonomy, habitats, phenotypes, or life cycles of fish, MitoFish provides links to related databases. MitoFish and MitoAnnotator are freely available at http://mitofish.aori.u-tokyo.ac.jp/ (last accessed August 28, 2013); all of the data can be batch downloaded, and the annotation pipeline can be used via a web interface.

590 citations


Journal ArticleDOI
TL;DR: The results show that errors in genome annotation do lead to higher inferred rates of gene gain and loss but that CAFE 3 sufficiently accounts for these errors to provide accurate estimates of important evolutionary parameters.
Abstract: Current sequencing methods produce large amounts of data, but genome assemblies constructed from these data are often fragmented and incomplete. Incomplete and error-fille da ssemblies result in many annotation errors, especially in the number of genes present in a genome. This means that methods attempting to estimate rates of gene duplication and loss often will be misled by such errors and that rates of gene family evolution will be consistently overestimated. Here, we present a method that takes these errors into account, allowing one to accurately infer rates of gene gain and loss among genomes even with low assembly and annotation quality. The method is implemented in the newest version of the software package CAFE, along with several other novel features. We demonstrate the accuracy of the method with extensive simulations and reanalyze several previously published data sets. Our results show that errors in genome annotation do lead to higher inferred rates of gene gain and lo ss but that CAFE 3s uff iciently accounts for these errors to provide accurate estimates of important evolutionary parameters.

572 citations


Journal ArticleDOI
TL;DR: Two upgrades to the Bayesian Analysis of Population Structure (BAPS) software are introduced, which enable 1) spatially explicit modeling of variation in DNA sequences and 2) hierarchical clustering of DNA sequence data to reveal nested genetic population structures.
Abstract: Phylogeographical analyses have become commonplace for a myriad of organisms with the advent of cheap DNA sequencing technologies. Bayesian model-based clustering is a powerful tool for detecting important patterns in such data and can be used to decipher even quite subtle signals of systematic differences in molecular variation. Here, we introduce two upgrades to the Bayesian Analysis of Population Structure (BAPS) software, which enable 1) spatially explicit modeling of variation in DNA sequences and 2) hierarchical clustering of DNA sequence data to reveal nested genetic population structures. We provide a direct interface to map the results from spatial clustering with Google Maps using the portal http://www.spatialepidemiology.net/ and illustrate this approach using sequence data from Borrelia burgdorferi. The usefulness of hierarchical clustering is demonstrated through an analysis of the metapopulation structure within a bacterial population experiencing a high level of local horizontal gene transfer. The tools that are introduced are freely available at http://www.helsinki.fi/bsg/software/BAPS/.

507 citations


Journal ArticleDOI
TL;DR: In this article, a Gaussian Markov random field (GMRF) model was proposed for the analysis of multilocus sequence data and the time to the most recent common ancestor (TMRCA) was recovered.
Abstract: Effective population size is fundamental in population genetics and characterizes genetic diversity. To infer past population dynamics from molecular sequence data, coalescent-based models have been developed for Bayesian nonparametric estimation of effective population size over time. Among the most successful is a Gaussian Markov random field (GMRF) model for a single gene locus. Here, we present a generalization of the GMRF model that allows for the analysis of multilocus sequence data. Using simulated data, we demonstrate the improved performance of our method to recover true population trajectories and the time to the most recent common ancestor (TMRCA). We analyze a multilocus alignment of HIV-1 CRF02_AG gene sequences sampled from Cameroon. Our results are consistent with HIV prevalence data and uncover some aspects of the population history that go undetected in Bayesian parametric estimation. Finally, we recover an older and more reconcilable TMRCA for a classic ancient DNA data set.

Journal ArticleDOI
TL;DR: It is shown that an increase of sample size results in more precise detection of positive selection and the ability to analyze substantially larger sample sizes by using SweeD leads to more accurate sweep detection.
Abstract: The advent of modern DNA sequencing technology is the driving force in obtaining complete intra-specific genomes that can be used to detect loci that have been subject to positive selection in the recent past. Based on selective sweep theory, beneficial loci can be detected by examining the single nucleotide polymorphism patterns in intraspecific genome alignments. In the last decade, a plethora of algorithms for identifying selective sweeps have been developed. However, the majority of these algorithms have not been designed for analyzing whole-genome data. We present SweeD (Sweep Detector), an open-source tool for the rapid detection of selective sweeps in whole genomes. It analyzes site frequency spectra and represents a substantial extension of the widely used SweepFinder program. The sequential version of SweeD is up to 22 times faster than SweepFinder and, more importantly, is able to analyze thousands of sequences. We also provide a parallel implementation of SweeD for multi-core processors. Furthermore, we implemented a checkpointing mechanism that allows to deploy SweeD on cluster systems with queue execution time restrictions, as well as to resume long-running analyses after processor failures. In addition, the user can specify various demographic models via the command-line to calculate their theoretically expected site frequency spectra. Therefore, (in contrast to SweepFinder) the neutral site frequencies can optionally be directly calculated from a given demographic model. We show that an increase of sample size results in more precise detection of positive selection. Thus, the ability to analyze substantially larger sample sizes by using SweeD leads to more accurate sweep detection. We validate SweeD via simulations and by scanning the first chromosome from the 1000 human Genomes project for selective sweeps. We compare SweeD results with results from a linkage-disequilibrium-based approach and identify common outliers.

Journal ArticleDOI
TL;DR: These analyses demonstrate that missing data perturb phylogenetic inference slightly beyond the expected decrease in resolving power, and confirm that including incomplete yet short-branch taxa can help to eschew artifacts, as predicted by simulations.
Abstract: Progress in sequencing technology allows researchers to assemble ever-larger supermatrices for phylogenomic inference. However, current phylogenomic studies often rest on patchy data sets, with some having 80% missing (or ambiguous) data or more. Though early simulations had suggested that missing data per se do not harm phylogenetic inference when using sufficiently large data sets, Lemmon et al. (Lemmon AR, Brown JM, Stanger-Hall K, Lemmon EM. 2009. The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and Bayesian inference. Syst Biol. 58:130-145.) have recently cast doubt on this consensus in a study based on the introduction of parsimony-uninformative incomplete characters. In this work, we empirically reassess the issue of missing data in phylogenomics while exploring possible interactions with the model of sequence evolution. First, we note that parsimony-uninformative incomplete characters are actually informative in a probabilistic framework. A reanalysis of Lemmon's data set with this in mind gives a very different interpretation of their results and shows that some of their conclusions may be unfounded. Second, we investigate the effect of the progressive introduction of missing data in a complete supermatrix (126 genes × 39 species) capable of resolving animal relationships. These analyses demonstrate that missing data perturb phylogenetic inference slightly beyond the expected decrease in resolving power. In particular, they exacerbate systematic errors by reducing the number of species effectively available for the detection of multiple substitutions. Consequently, large sparse supermatrices are more sensitive to phylogenetic artifacts than smaller but less incomplete data sets, which argue for experimental designs aimed at collecting a modest number (~50) of highly covered genes. Our results further confirm that including incomplete yet short-branch taxa (i.e., slowly evolving species or close outgroups) can help to eschew artifacts, as predicted by simulations. Finally, it appears that selecting an adequate model of sequence evolution (e.g., the site-heterogeneous CAT model instead of the site-homogeneous WAG model) is more beneficial to phylogenetic accuracy than reducing the level of missing data.

Journal ArticleDOI
TL;DR: Yang et al. as mentioned in this paper present pamlX, a graphical user interface/front end for the paml (for Phylogenetic Analysis by Maximum Likelihood) program package.
Abstract: This note announces pamlX, a graphical user interface/front end for the paml (for Phylogenetic Analysis by Maximum Likelihood) program package (Yang Z. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 13:555-556; Yang Z. 2007. PAML 4: Phylogenetic analysis by maximum likelihood. Mol Biol Evol. 24:1586-1591). pamlX is written in C++ using the Qt library and communicates with paml programs through files. It can be used to create, edit, and print control files for paml programs and to launch paml runs. The interface is available for free download at http://abacus.gene.ucl.ac.uk/software/paml.html.

Journal ArticleDOI
TL;DR: This work provides a user-friendly and effective tool to perform homology searches with operons or gene clusters as basic units, instead of single genes, to get a better understanding of the function, evolutionary history, and practical applications of such genomic regions.
Abstract: The genes encoding many biomolecular systems and pathways are genomically organized in operons or gene clusters. With MultiGeneBlast, we provide a user-friendly and effective tool to perform homology searches with operons or gene clusters as basic units, instead of single genes. The contextualization offered by MultiGeneBlast allows users to get a better understanding of the function, evolutionary history, and practical applications of such genomic regions. The tool is fully equipped with applications to generate search databases from GenBank or from the user's own sequence data. Finally, an architecture search mode allows searching for gene clusters with novel configurations, by detecting genomic regions with any user-specified combination of genes. Sources, precompiled binaries, and a graphical tutorial of MultiGeneBlast are freely available from http://multigeneblast.sourceforge.net/.

Journal ArticleDOI
TL;DR: In this paper, the authors used a new 454 transcriptome data sets from Ostracoda, an ancient and diverse group with a dense fossil record, which is often undersampled in broader studies.
Abstract: An ambitious, yet fundamental goal for comparative biology is to understand the evolutionary relationships for all of life. However, many important taxonomic groups have remained recalcitrant to inclusion into broader scale studies. Here, we focus on collection of 9 new 454 transcriptome data sets from Ostracoda, an ancient and diverse group with a dense fossil record, which is often undersampled in broader studies. We combine the new transcriptomes with a new morphological matrix (including fossils) and existing expressed sequence tag, mitochondrial genome, nuclear genome, and ribosomal DNA data. Our analyses lead to new insights into ostracod and pancrustacean phylogeny. We obtained support for three epic pancrustacean clades that likely originated in the Cambrian: Oligostraca (Ostracoda, Mystacocarida, Branchiura, and Pentastomida); Multicrustacea (Copepoda, Malacostraca, and Thecostraca); and a clade we refer to as Allotriocarida (Hexapoda, Remipedia, Cephalocarida, and Branchiopoda). Within the Oligostraca clade, our results support the unresolved question of ostracod monophyly. Within Multicrustacea, we find support for Thecostraca plus Copepoda, for which we suggest the name Hexanauplia. Within Allotriocarida, some analyses support the hypothesis that Remipedia is the sister taxon to Hexapoda, but others support Branchiopoda+ Cephalocarida as the sister group of hexapods. In multiple different analyses, we see better support for equivocal nodes using slow-evolving genes or when excluding distant outgroups, highlighting the increased importance of conditional data combination in this age of abundant, often anonymous data. However, when we analyze the same set of species and ignore rate of gene evolution, we find higher support when including all data, more in line with a “total evidence” philosophy. By concatenating molecular and morphological data, we place pancrustacean fossils in the phylogeny, which can be used for studies of divergence times in Pancrustacea, Arthropoda, or Metazoa. Our results and new data will allow for attributes of Ostracoda, such as its amazing fossil record and diverse biology, to be leveraged in broader scale comparative studies. Further, we illustrate how adding extensive next-generation sequence data from understudied groups can yield important new phylogenetic insights into long-standing questions, especially when carefully analyzed in combination with other data.

Journal ArticleDOI
TL;DR: The first genetic signatures of early domestic pigs in the Near Eastern Neolithic core zone are revealed and it is demonstrated that these early pigs differed genetically from those in western Anatolia that were introduced to Europe during the Neolithic expansion.
Abstract: Zooarcheological evidence suggests that pigs were domesticated in Southwest Asia ~8,500 BC. They then spread across the Middle and Near East and westward into Europe alongside early agriculturalists. European pigs were either domesticated independently or more likely appeared so as a result of admixture between introduced pigs and European wild boar. As a result, European wild boar mtDNA lineages replaced Near Eastern/Anatolian mtDNA signatures in Europe and subsequently replaced indigenous domestic pig lineages in Anatolia. The specific details of these processes, however, remain unknown. To address questions related to early pig domestication, dispersal, and turnover in the Near East, we analyzed ancient mitochondrial DNA and dental geometric morphometric variation in 393 ancient pig specimens representing 48 archeological sites (from the Pre-Pottery Neolithic to the Medieval period) from Armenia, Cyprus, Georgia, Iran, Syria, and Turkey. Our results reveal the first genetic signatures of early domestic pigs in the Near Eastern Neolithic core zone. We also demonstrate that these early pigs differed genetically from those in western Anatolia that were introduced to Europe during the Neolithic expansion. In addition, we present a significantly more refined chronology for the introduction of European domestic pigs into Asia Minor that took place during the Bronze Age, at least 900 years earlier than previously detected. By the 5th century AD, European signatures completely replaced the endemic lineages possibly coinciding with the widespread demographic and societal changes that occurred during the Anatolian Bronze and Iron Ages.

Journal ArticleDOI
TL;DR: The results show that past interactions with pathogens have elicited widespread and coordinated genomic responses, and suggest that adaptation to pathogens can be considered as a primary example of polygenic selection.
Abstract: Most approaches aiming at finding genes involved in adaptive events have focused on the detection of outlier loci, which resulted in the discovery of individually “significant” genes with strong effects. However, a collection of small effect mutations could have a large effect on a given biological pathway that includes many genes, and such a polygenic mode of adaptation has not been systematically investigated in humans. We propose here to evidence polygenic selection by detecting signals of adaptation at the pathway or gene set level instead of analyzing single independent genes. Using a gene-set enrichment test to identify genome-wide signals of adaptation among human populations, we find that most pathways globally enriched for signals of positive selection are either directly or indirectly involved in immune response. We also find evidence for long-distance genotypic linkage disequilibrium, suggesting functional epistatic interactions between members of the same pathway. Our results show that past interactions with pathogens have elicited widespread and coordinated genomic responses, and suggest that adaptation to pathogens can be considered as a primary example of polygenic selection.

Journal ArticleDOI
TL;DR: It is found that the derived allele of this site is less efficient than the ancestral allele in activating transcription from a reporter construct, and is a plausible candidate for having caused a recent selective sweep in the FOXP2 gene.
Abstract: The FOXP2 gene is required for normal development of speech and language. By isolating and sequencing FOXP2 genomic DNA fragments from a 49,000-year-old Iberian Neandertal and 50 present-day humans, we have identified substitutions in the gene shared by all or nearly all present-day humansbut absent or polymorphic in Neandertals. One such substitution is localized in intron 8 and affects a binding site for the transcription factor POU3F2, which is highly conserved among vertebrates. We find that the derived allele of this site is less efficient than the ancestral allele in activating transcription from a reporter construct. The derived allele also binds less POU3F2 dimers than POU3F2 monomers compared with the ancestral allele. Because the substitution in the POU3F2 binding site is likely to alter the regulation of FOXP2 expression, and because it is localized in a region of the gene associated with a previously described signal of positive selection, it is a plausible candidate for having caused a recent selective sweep in the FOXP2 gene.

Journal ArticleDOI
TL;DR: In this article, the authors analyzed the phylogeny of key conjugation proteins to infer the evolutionary history of conjugations and type IV secretion systems (T4SS) and showed that single-stranded DNA (ssDNA) and double-strand DNA (dsDNA), while both based on a key AAA + ATPase, diverged before the last common ancestor of bacteria.
Abstract: Genetic exchange by conjugation is responsible for the spread of resistance, virulence, and social traits among prokaryotes. Recent works unraveled the functioning of the underlying type IV secretion systems (T4SS) and its distribution and recruitment for other biological processes (exaptation), notably pathogenesis. We analyzed the phylogeny of key conjugation proteins to infer the evolutionary history of conjugation and T4SS. We show that single-stranded DNA (ssDNA) and double-stranded DNA (dsDNA) conjugation, while both based on a key AAA + ATPase, diverged before the last common ancestor of bacteria. The two key ATPases of ssDNA conjugation are monophyletic, having diverged at an early stage from dsDNA translocases. Our data suggest that ssDNA conjugation arose first in diderm bacteria, possibly Proteobacteria, and then spread to other bacterial phyla, including bacterial monoderms and Archaea. Identifiable T4SS fall within the eight monophyletic groups, determined by both taxonomy and structure of the cell envelope. Transfer to monoderms might have occurred only once, but followed diverse adaptive paths. Remarkably, some Firmicutes developed a new conjugation system based on an atypical relaxase and an ATPase derived from a dsDNA translocase. The observed evolutionary rates and patterns of presence/absence of specific T4SS proteins show that conjugation systems are often and independently exapted for other functions. This work brings a natural basis for the classification of all kinds of conjugative systems, thus tackling a problem that is growing as fast as genomic databases. Our analysis provides the first global picture of the evolution of conjugation and shows how a self-transferrable complex multiprotein system has adapted to different taxa and often been recruited by the host. As conjugation systems became specific to certain clades and cell envelopes, they may have biased the rate and direction of gene transfer by conjugation within prokaryotes.

Journal ArticleDOI
TL;DR: The view that Ethiopian, Andean, and Tibetan populations living at high altitude have adapted to hypoxia differently is supported, with convergent evolution affecting different genes from the same pathway.
Abstract: The Tibetan and Andean Plateaus and Ethiopian highlands are the largest regions to have long-term high-altitude residents. Such populations are exposed to lower barometric pressures and hence atmospheric partial pressures of oxygen. Such "hypobaric hypoxia" may limit physical functional capacity, reproductive health, and even survival. As such, selection of genetic variants advantageous to hypoxic adaptation is likely to have occurred. Identifying signatures of such selection is likely to help understanding of hypoxic adaptive processes. Here, we seek evidence of such positive selection using five Ethiopian populations, three of which are from high-altitude areas in Ethiopia. As these populations may have been recipients of Eurasian gene flow, we correct for this admixture. Using single-nucleotide polymorphism genotype data from multiple populations, we find the strongest signal of selection in BHLHE41 (also known as DEC2 or SHARP1). Remarkably, a major role of this gene is regulation of the same hypoxia response pathway on which selection has most strikingly been observed in both Tibetan and Andean populations. Because it is also an important player in the circadian rhythm pathway, BHLHE41 might also provide insights into the mechanisms underlying the recognized impacts of hypoxia on the circadian clock. These results support the view that Ethiopian, Andean, and Tibetan populations living at high altitude have adapted to hypoxia differently, with convergent evolution affecting different genes from the same pathway.

Journal ArticleDOI
TL;DR: An efficient method for sequencing anuran mitochondrial DNAs by amplifying the mitochondrial genome in 12 overlapping fragments using frog-specific universal primer sets is developed and mtDNA performs well for both phylogenetic and divergence time inferences and will provide important reference hypotheses for the phylogeny and evolution of frogs.
Abstract: Anura (frogs and toads) constitute over 88% of living amphibian diversity but many important questions about their phylogeny and evolution remain unresolved. For this study, we developed an efficient method for sequencing anuran mitochondrial DNAs (mtDNAs) by amplifying the mitochondrial genome in 12 overlapping fragments using frog-specific universal primer sets. Based on this method, we generated 47 nearly complete, new anuran mitochondrial genomes and discovered nine novel gene arrangements. By combining the new data and published anuran mitochondrial genomes, we assembled a large mitogenomic data set (11,007nt) including 90 frog species, representing 39 of 53 recognized anuran families, to investigate their phylogenetic relationships and evolutionary history. The resulting tree strongly supported a paraphyletic arrangement of archaeobatrachian (=nonneobatrachian) frogs, with Leiopelmatoidea branching first, followed by Discoglossoidea, Pipoidea, and Pelobatoidea. Within Neobatrachia, the South African Heleophrynidae is the sister-taxon to all other neobatrachian frogs and the Seychelles-endemic Sooglossidae is recovered as the sister-taxon to Ranoidea. These phylogenetic relationships agree with many nuclear gene studies. The chronogram derived from two Bayesian relaxed clock methods (MultiDivTime and BEAST) suggests that modern frogs (Anura) originated in the early Triassic about 244 Ma and the appearance of Neobatrachia took place in the late Jurassic about 163 Ma. The initial diversifications of two species-rich superfamilies Hyloidea and Ranoidea commenced 110 and 133 Ma, respectively. These times are older than some other estimates by approximately 30‐40 My. Compared with nuclear data, mtDNA produces compatible time estimates for deep nodes (>150 Ma), but apparently older estimates for more shallow nodes. Our study shows that, although it evolves relatively rapidly and behaves much as a single locus, mtDNA performs well for both phylogenetic and divergence time inferences and will provide important reference hypotheses for the phylogeny and evolution of frogs.

Journal ArticleDOI
TL;DR: Analysis of the core genome suggested that among 73 genes present in all isolates of S. mutans but absent in other species of the mutans taxonomic group, the majority can be associated with metabolic processes that could have contributed to the successful adaptation of the species to its new niche, the human mouth, and with the dietary changes that accompanied the onset of human agriculture.
Abstract: Streptococcus mutans is widely recognized as one of the key etiological agents of human dental caries. Despite its role in this important disease, our present knowledge of gene content variability across the species and its relationship to adaptation is minimal. Estimates of its demographic history are not available. In this study, we generated genome sequences of 57 S. mutans isolates, as well as representative strains of the most closely related species to S. mutans (S. ratti, S. macaccae, and S. criceti), to identify the overall structure and potential adaptive features of the dispensable and core components of the genome. We also performed population genetic analyses on the core genome of the species aimed at understanding the demographic history, and impact of selection shaping its genetic variation. The maximum gene content divergence among strains was approximately 23%, with the majority of strains diverging by 5–15%. The core genome consisted of 1,490 genes and the pan-genome approximately 3,296. Maximum likelihood analysis of the synonymous site frequency spectrum (SFS) suggested that the S. mutans population started expanding exponentially approximately 10,000 years ago (95% confidence interval [CI]: 3,268–14,344 years ago), coincidental with the onset of human agriculture. Analysis of the replacement SFS indicated that a majority of these substitutions are under strong negative selection, and the remainder evolved neutrally. A set of 14 genes was identified as being under positive selection, most of which were involved in either sugar metabolism or acid tolerance. Analysis of the core genome suggested that among 73 genes present in all isolates of S. mutans but absent in other species of the mutans taxonomic group, the majority can be associated with metabolic processes that could have contributed to the successful adaptation of S. mutans to its new niche, the human mouth, and with the dietary changes that accompanied the origin of agriculture.

Journal ArticleDOI
TL;DR: Gene duplications, domain rearrangement, and post-transcriptional regulation have enabled a subtle control of auxin signaling through ARF proteins that may have contributed to the critical importance of these regulators in plant development and evolution.
Abstract: Auxin response factors (ARF) are key players in plant development. They mediate the cellular response to the plant hormone auxin by activating or repressing the expression of downstream developmental genes. The pivotal activation function of ARF proteins is enabled by their four-domain architecture, which includes both DNA-binding and protein dimerization motifs. To determine the evolutionary origin of this characteristic architecture, we built a comprehensive data set of 224 ARF-related protein sequences that represents all major living divisions of land plants, except hornworts. We found that ARFs are split into three subfamilies that could be traced back to the origin of the land plants. We also show that repeated events of extensive gene duplication contributed to the expansion of those three original subfamilies. Further examination of our data set uncovered a broad diversity in the structure of ARF transcripts and allowed us to identify an additional conserved motif in ARF proteins. We found that additional structural diversity in ARF proteins is mainly generated by two mechanisms: genomic truncation and alternative splicing. We propose that the loss of domains from the canonical, four-domain ARF structure has promoted functional shifts within the ARF family by disrupting either dimerization or DNA-binding capabilities. For instance, the loss of dimerization domains in some ARFs from moss and spikemoss genomes leads to proteins that are reminiscent of Aux/IAA proteins, possibly providing a clue on the evolution of these modulators of ARF function. We also assessed the functional impact of alternative splicing in the case of ARF4, for which we have identified a novel isoform in Arabidopsis thaliana. Genetic analysis showed that these two transcripts exhibit markedly different developmental roles in A. thaliana. Gene duplications, domain rearrangement, and post-transcriptional regulation have thus enabled a subtle control of auxin signaling through ARF proteins that may have contributed to the critical importance of these regulators in plant development and evolution.

Journal ArticleDOI
TL;DR: The second major release of the Bio++ libraries is presented, which provides an extended set of classes and methods that notably provide built-in access to sequence databases and new data structures for handling and manipulating sequences from the omics era, such as multiple genome alignments and sequencing reads libraries.
Abstract: Efficient algorithms and programs for the analysis of the ever-growing amount of biological sequence data are strongly needed in the genomics era. The pace at which new data and methodologies are generated calls for the use of pre-existing, optimized-yet extensible-code, typically distributed as libraries or packages. This motivated the Bio++ project, aiming at developing a set of C++ libraries for sequence analysis, phylogenetics, population genetics, and molecular evolution. The main attractiveness of Bio++ is the extensibility and reusability of its components through its object-oriented design, without compromising the computer-efficiency of the underlying methods. We present here the second major release of the libraries, which provides an extended set of classes and methods. These extensions notably provide built-in access to sequence databases and new data structures for handling and manipulating sequences from the omics era, such as multiple genome alignments and sequencing reads libraries. More complex models of sequence evolution, such as mixture models and generic n-tuples alphabets, are also included.

Journal ArticleDOI
TL;DR: The history and degree of isolation of two cryptic and partially sympatric model species are clarified and a methodological framework to investigate genome-wide heterogeneity (GWH) at various stages of speciation process is provided.
Abstract: Inferring a realistic demographic model from genetic data is an important challenge to gain insights into the historical events during the speciation process and to detect molecular signatures of selection along genomes. Recent advances in divergence population genetics have reported that speciation in face of gene flow occurred more frequently than theoretically expected, but the approaches used did not account for genome-wide heterogeneity (GWH) in introgression rates. Here, we investigate the impact of GWH on the inference of divergence with gene flow between two cryptic species of the marine model Ciona intestinalis by analyzing polymorphism and divergence patterns in 852 protein-coding sequence loci. These morphologically similar entities are highly diverged molecular-wise, but evidence of hybridization has been reported in both laboratory and field studies. We compare various speciation models and test for GWH under the approximate Bayesian computation framework. Our results demonstrate the presence of significant extents of gene flow resulting from a recent secondary contact after >3 My of divergence in isolation. The inferred rates of introgression are relatively low, highly variable across loci and mostly unidirectional, which is consistent with the idea that numerous genetic incompatibilities have accumulated over time throughout the genomes of these highly diverged species. A genomic map of the level of gene flow identified two hotspots of introgression, that is, large genome regions of unidirectional introgression. This study clarifies the history and degree of isolation of two cryptic and partially sympatric model species and provides a methodological framework to investigate GWH at various stages of speciation process.

Journal ArticleDOI
TL;DR: The genetic data indicate that Tibetans have been adapted to a high altitude environment since initial colonization of the Tibetan Plateau in the early Upper Paleolithic, before the last glacial maximum, followed by a rapid population expansion that coincided with the establishment of farming and yak pastoralism on the PlateauIn the early Neolithic.
Abstract: Tibetans live on the highest plateau in the world, their current population size is approximately 5 million, and most of them live at an altitude exceeding 3,500 m. Therefore, the Tibetan Plateau is a remarkable area for cultural and biological studies of human population history. However, the chronological profile of the Tibetan Plateau's colonization remains an unsolved question of human prehistory. To reconstruct the prehistoric colonization and demographic history of modern humans on the Tibetan Plateau, we systematically sampled 6,109 Tibetan individuals from 41 geographic populations across the entire region of the Tibetan Plateau and analyzed the phylogeographic patterns of both paternal (n = 2,354) and maternal (n = 6,109) lineages as well as genome-wide single nucleotide polymorphism markers (n = 50) in Tibetan populations. We found that there have been two distinct, major prehistoric migrations of modern humans into the Tibetan Plateau. The first migration was marked by ancient Tibetan genetic signatures dated to approximately 30,000 years ago, indicating that the initial peopling of the Tibetan Plateau by modern humans occurred during the Upper Paleolithic rather than Neolithic. We also found evidences for relatively young (only 7-10 thousand years old) shared Y chromosome and mitochondrial DNA haplotypes between Tibetans and Han Chinese, suggesting a second wave of migration during the early Neolithic. Collectively, the genetic data indicate that Tibetans have been adapted to a high altitude environment since initial colonization of the Tibetan Plateau in the early Upper Paleolithic, before the last glacial maximum, followed by a rapid population expansion that coincided with the establishment of farming and yak pastoralism on the Plateau in the early Neolithic.

Journal ArticleDOI
TL;DR: The first survey of skeletal organic matrix proteins in the staghorn coral Acropora millepora suggests that co-option and domain shuffling may be general mechanisms by which the trait of calcification has evolved.
Abstract: In corals, biocalcification is a major function that may be drastically affected by ocean acidification (OA). Scleractinian corals grow by building up aragonitic exoskeletons that provide support and protection for soft tissues. Although this process has been extensively studied, the molecular basis of biocalcification is poorly understood. Notably lacking is a comprehensive catalog of the skeleton-occluded proteins—the skeletal organic matrix proteins (SOMPs) that are thought to regulate the mineral deposition. Using a combination of proteomics and transcriptomics, we report the first survey of such proteins in the staghorn coral Acropora millepora. The organic matrix (OM) extracted from the coral skeleton was analyzed by mass spectrometry and bioinformatics, enabling the identification of 36 SOMPs. These results provide novel insights into the molecular basis of coral calcification and the macroevolution of metazoan calcifying systems, whereas establishing a platform for studying the impact of OA at molecular level. Besides secreted proteins, extracellular regions of transmembrane proteins are also present, suggesting a close control of aragonite deposition by the calicoblastic epithelium. In addition to the expected SOMPs (Asp/Glu-rich, galaxins), the skeletal repertoire included several proteins containing known extracellular matrix domains. From an evolutionary perspective, the number of coral-specific proteins is low, many SOMPs having counterparts in the noncalcifying cnidarians. Extending the comparison with the skeletal OM proteomes of other metazoans allowed the identification of a pool of functional domains shared between phyla. These data suggest that co-option and domain shuffling may be general mechanisms by which the trait of calcification has evolved.

Journal ArticleDOI
TL;DR: It is argued that the GC content-because it is a reliable indicator of the long-term recombination rate-is an informative criterion that could help in identifying the most reliable molecular markers for species tree inference.
Abstract: Despite the rapid increase of size in phylogenomic data sets, a number of important nodes on animal phylogeny are still unresolved. Among these, the rooting of the placental mammal tree is still a controversial issue. One difficulty lies in the pervasive phylogenetic conflicts among genes, with each one telling its own story, which may be reliable or not. Here, we identified a simple criterion, that is, the GC content, which substantially helps in determining which gene trees best reflect the species tree. We assessed the ability of 13,111 coding sequence alignments to correctly reconstruct the placental phylogeny. We found that GC-rich genes induced a higher amount of conflict among gene trees and performed worse than AT-rich genes in retrieving well-supported, consensual nodes on the placental tree. We interpret this GC effect mainly as a consequence of genome-wide variations in recombination rate. Indeed, recombination is known to drive GC-content evolution through GC-biased gene conversion and might be problematic for phylogenetic reconstruction, for instance, in an incomplete lineage sorting context. When we focused on the AT-richest fraction of the data set, the resolution level of the placental phylogeny was greatly increased, and a strong support was obtained in favor of an Afrotheria rooting, that is, Afrotheria as the sister group of all other placentals. We show that in mammals most conflicts among gene trees, which have so far hampered the resolution of the placental tree, are concentrated in the GC-rich regions of the genome. We argue that the GC content—because it is a reliable indicator of the long-term recombination rate—is an informative criterion that could help in identifying the most reliable molecular markers for species tree inference.

Journal ArticleDOI
TL;DR: An updated version of DIVERGE is released with the following improvements: a feasible approach to examining functional divergence in nearly complete sequences by including deletions and insertions (indels); the calculation of the false discovery rate of functionally diverging sites; estimation of the effective number of functional divergence-related sites that is reliable and insensitive to cutoffs.
Abstract: DIVERGE is a software system for phylogeny-based analyses of protein family evolution and functional divergence. It provides a suite of statistical tools for selection and prioritization of the amino acid sites that are responsible for the functional divergence of a gene family. The synergistic efforts of DIVERGE and other methods have convincingly demonstrated that the pattern of rate change at a particular amino acid site may contain insightful information about the underlying functional divergence following gene duplication. These predicted sites may be used as candidates for further experiments. We are now releasing an updated version of DIVERGE with the following improvements: 1) a feasible approach to examining functional divergence in nearly complete sequences by including deletions and insertions (indels); 2) the calculation of the false discovery rate of functionally diverging sites; 3) estimation of the effective number of functional divergence-related sites that is reliable and insensitive to cutoffs; 4) a statistical test for asymmetric functional divergence; and 5) a new method to infer functional divergence specific to a given duplicate cluster. In addition, we have made efforts to improve software design and produce a well-written software manual for the general user.