scispace - formally typeset
Search or ask a question

Showing papers in "Molecular Biology and Evolution in 2014"


Journal ArticleDOI
TL;DR: PopGenome is a population genomics package for the R software environment that offers a wide range of diverse population genetics analyses, including neutrality tests as well as statistics for population differentiation, linkage disequilibrium, and recombination.
Abstract: Although many computer programs can perform population genetics calculations, they are typically limited in the analyses and data input formats they offer; few applications can process the large data sets produced by whole-genome resequencing projects. Furthermore, there is no coherent framework for the easy integration of new statistics into existing pipelines, hindering the development and application of new population genetics and genomics approaches. Here, we present PopGenome, a population genomics package for the R software environment (a de facto standard for statistical analyses). PopGenome can efficiently process genome-scale data as well as large sets of individual loci. It reads DNA alignments and single-nucleotide polymorphism (SNP) data sets in most common formats, including those used by the HapMap, 1000 human genomes, and 1001 Arabidopsis genomes projects. PopGenome also reads associated annotation files in GFF format, enabling users to easily define regions or classify SNPs based on their annotation; all analyses can also be applied to sliding windows. PopGenome offers a wide range of diverse population genetics analyses, including neutrality tests as well as statistics for population differentiation, linkage disequilibrium, and recombination. PopGenome is linked to Hudson’s MS and Ewing’s MSMS programs to assess statistical significance based on coalescent simulations. PopGenome’s integration in R facilitates effortless and reproducible downstream analyses as well as the production of publication-quality graphics. Developers can easily incorporate new analyses methods into the PopGenome framework. PopGenome and R are freely available from CRAN (http://cran.r-project.org/) for all major operating systems under the GNU General Public License.

761 citations


Journal ArticleDOI
TL;DR: selscan as discussed by the authors is a multithreaded application that implements Extended Haplotype Homozygosity (EHH), integrated haplotype score (iHS), and cross-population EHH (XPEHH).
Abstract: Haplotype-based scans to detect natural selection are useful to identify recent or ongoing positive selection in genomes. As both real and simulated genomic data sets grow larger, spanning thousands of samples and millions of markers, there is a need for a fast and efficient implementation of these scans for general use. Here, we present selscan ,a n eff icient multithreaded application that implements Extended Haplotype Homozygosity (EHH), Integrated Haplotype Score (iHS), and Cross-population EHH (XPEHH). selscan accepts phased genotypes in multiple formats, including TPED, and performs extremely well on both simulated and real data and over an order of magnitude faster than existing available implementations. It calculates iHS on chromosome 22 (22,147 loci) across 204 CEU haplotypes in 353 s on one thread (33 s on 16 threads) and calculates XPEHH for the same data relative to 210 YRI haplotypes in 578 s on one thread (52 s on 16 threads). Source code and binaries (Windows, OSX, and Linux) are available at https://github.com/ szpiech/selscan.

474 citations


Journal ArticleDOI
TL;DR: Simulation results indicate that the power of the method to delimit species increases with an increase of the divergence times in the species tree, and with an increased number of gene loci, and of the impact of the prior on population size parameters (θ) on Bayesian species delimitation.
Abstract: A method was developed for simultaneous Bayesian inference of species delimitation and species phylogeny using the multispecies coalescent model. The method eliminates the need for a user-specified guide tree in species delimitation and incorporates phylogenetic uncertainty in a Bayesian framework. The nearest-neighbor interchange algorithm was adapted to propose changes to the species tree, with the gene trees for multiple loci altered in the proposal to avoid conflicts with the newly proposed species tree. We also modify our previous scheme for specifying priors for species delimitation models to construct joint priors for models of species delimitation and species phylogeny. As in our earlier method, the modified algorithm integrates over gene trees, taking account of the uncertainty of gene tree topology and branch lengths given the sequence data. We conducted a simulation study to examine the statistical properties of the method using six populations (two sequences each) and a true number of three species, with values of divergence times and ancestral population sizes that are realistic for recently diverged species. The results suggest that the method tends to be conservative with high posterior probabilities being a confident indicator of species status. Simulation results also indicate that the power of the method to delimit species increases with an increase of the divergence times in the species tree, and with an increased number of gene loci. Reanalyses of two data sets of cavefish and coast horned lizards suggest considerable phylogenetic uncertainty even though the data are informative about species delimitation. We discuss the impact of the prior on models of species delimitation and species phylogeny and of the prior on population size parameters (θ) on Bayesian species delimitation.

442 citations


Journal ArticleDOI
TL;DR: The program GrowthRates is introduced that uses plate reader output files to automatically determine the exponential portion of the curve and to automatically calculate the growth rate, the maximum culture density, and the duration of the growth lag phase.
Abstract: In the 1960s-1980s, determination of bacterial growth rates was an important tool in microbial genetics, biochemistry, molecular biology, and microbial physiology. The exciting technical developments of the 1990s and the 2000s eclipsed that tool; as a result, many investigators today lack experience with growth rate measurements. Recently, investigators in a number of areas have started to use measurements of bacterial growth rates for a variety of purposes. Those measurements have been greatly facilitated by the availability of microwell plate readers that permit the simultaneous measurements on up to 384 different cultures. Only the exponential (logarithmic) portions of the resulting growth curves are useful for determining growth rates, and manual determination of that portion and calculation of growth rates can be tedious for high-throughput purposes. Here, we introduce the program GrowthRates that uses plate reader output files to automatically determine the exponential portion of the curve and to automatically calculate the growth rate, the maximum culture density, and the duration of the growth lag phase. GrowthRates is freely available for Macintosh, Windows, and Linux. We discuss the effects of culture volume, the classical bacterial growth curve, and the differences between determinations in rich media and minimal (mineral salts) media. This protocol covers calibration of the plate reader, growth of culture inocula for both rich and minimal media, and experimental setup. As a guide to reliability, we report typical day-to-day variation in growth rates and variation within experiments with respect to position of wells within the plates.

397 citations


Journal ArticleDOI
TL;DR: A novel, user-friendly software package engineered for conducting state-of-the-art Bayesian tree inferences on data sets of arbitrary size is introduced and first experiences with Bayesian inferences at the whole-genome level are reported on.
Abstract: Modern sequencing technology now allows biologists to collect the entirety of molecular evidence for reconstructing evolutionary trees. We introduce a novel, user-friendly software package engineered for conducting state-of-the-art Bayesian tree inferences on data sets of arbitrary size. Our software introduces a nonblocking parallelization of Metropolis-coupled chains, modifications for efficient analyses of data sets comprising thousands of partitions and memory saving techniques. We report on first experiences with Bayesian inferences at the whole-genome level using the SuperMUC supercomputer and simulated data.

369 citations


Journal ArticleDOI
TL;DR: A new method is developed that combines alignments from mappings to multiple reference sequences and shows that this successfully removes biases from the reconstructed phylogenies, which fully automates phylogenetic reconstruction from raw sequencing reads.
Abstract: Studies of microbial evolutionary dynamics are being transformed by the availability of affordable high-throughput sequencing technologies, which allow whole-genome sequencing of hundreds of related taxa in a single study. Reconstructing a phylogenetic tree of these taxa is generally a crucial step in any evolutionary analysis. Instead of constructing genome assemblies for all taxa, annotating these assemblies, and aligning orthologous genes, many recent studies 1) directly map raw sequencing reads to a single reference sequence, 2) extract single nucleotide polymorphisms (SNPs), and 3) infer the phylogenetic tree using maximum likelihood methods from the aligned SNP positions. However, here we show that, when using such methods to reconstruct phylogenies from sets of simulated sequences, both the exclusion of nonpolymorphic positions and the alignment to a single reference genome, introduce systematic biases and errors in phylogeny reconstruction. To address these problems, we developed a new method that combines alignments from mappings to multiple reference sequences and show that this successfully removes biases from the reconstructed phylogenies. We implemented this method as a web server named REALPHY (Reference sequence Alignment-based Phylogeny builder), which fully automates phylogenetic reconstruction from raw sequencing reads.

333 citations


Journal ArticleDOI
TL;DR: A new haplotype-based statistic for detecting both soft and hard sweeps in population genomic data from a single population is presented and it is shown that nSL has at least as much power as other methods under a number of different selection scenarios.
Abstract: We present a new haplotype-based statistic (nSL) for detecting both soft and hard sweeps in population genomic data from a single population. We compare our new method with classic single-population haplotype and site frequency spectrum (SFS)-based methods and show that it is more robust, particularly to recombination rate variation. However, all statistics show some sensitivity to the assumptions of the demographic model. Additionally, we show that nSL has at least as much power as other methods under a number of different selection scenarios, most notably in the cases of sweeps from standing variation and incomplete sweeps. This conclusion holds up under a variety of demographic models. In many aspects, our new method is similar to the iHS statistic; however, it is generally more robust and does not require a genetic map. To illustrate the utility of our new method, we apply it to HapMap3 data and show that in the Yoruban population, there is strong evidence of selection on genes relating to lipid metabolism. This observation could be related to the known differences in cholesterol levels, and lipid metabolism more generally, between African Americans and other populations. We propose that the underlying causes for the selection on these genes are pleiotropic effects relating to blood parasites rather than their role in lipid metabolism.

328 citations


Journal ArticleDOI
TL;DR: It is found that genome content variation, in the form of presence or absence as well as copy number of genetic material, is higher inside S. cerevisiae than within S. paradoxus, despite genetic distances as measured in single-nucleotide polymorphisms being vastly smaller within the former species.
Abstract: The question of how genetic variation in a population influences phenotypic variation and evolution is of major importance in modern biology. Yet much is still unknown about the relative functional importance of different forms of genome variation and how they are shaped by evolutionary processes. Here we address these questions by population level sequencing of 42 strains from the budding yeast Saccharomyces cerevisiae and its closest relative S. paradoxus. We find that genome content variation, in the form of presence or absence as well as copy number of genetic material, is higher within S. cerevisiae than within S. paradoxus, despite genetic distances as measured in single-nucleotide polymorphisms being vastly smaller within the former species. This genome content variation, as well as loss-of-function variation in the form of premature stop codons and frameshifting indels, is heavily enriched in the subtelomeres, strongly reinforcing the relevance of these regions to functional evolution. Genes affected by these likely functional forms of variation are enriched for functions mediating interaction with the external environment (sugar transport and metabolism, flocculation, metal transport, and metabolism). Our results and analyses provide a comprehensive view of genomic diversity in budding yeast and expose surprising and pronounced differences between the variation within S. cerevisiae and that within S. paradoxus. We also believe that the sequence data and de novo assemblies will constitute a useful resource for further evolutionary and population genomics studies.

278 citations


Journal ArticleDOI
TL;DR: These DFEs provide insight into the inherent benefits of the genetic code’s architecture, support for the hypothesis that mRNA stability dictates codon usage at the beginning of genes, an extensive framework for understanding protein mutational tolerance, and evidence that mutational effects on protein thermodynamic stability shape the DFE.
Abstract: Mutations are central to evolution, providing the genetic variation upon which selection acts. A mutation's effect on the suitability of a gene to perform a particular function (gene fitness) can be positive, negative, or neutral. Knowledge of the distribution of fitness effects (DFE) of mutations is fundamental for understanding evolutionary dynamics, molecular-level genetic variation, complex genetic disease, the accumulation of deleterious mutations, and the molecular clock. We present comprehensive DFEs for point and codon mutants of the Escherichia coli TEM-1 β-lactamase gene and missense mutations in the TEM-1 protein. These DFEs provide insight into the inherent benefits of the genetic code's architecture, support for the hypothesis that mRNA stability dictates codon usage at the beginning of genes, an extensive framework for understanding protein mutational tolerance, and evidence that mutational effects on protein thermodynamic stability shape the DFE. Contrary to prevailing expectations, we find that deleterious effects of mutation primarily arise from a decrease in specific protein activity and not cellular protein levels.

276 citations


Journal ArticleDOI
TL;DR: This work presents a procedure that uses phylogenies for both homology and orthology assignment, and finds that data sets that are more recently diverged and/or include more high-coverage genomes had more complete sets of orthologs.
Abstract: Orthology inference is central to phylogenomic analyses. Phylogenomic data sets commonly include transcriptomes and low-coverage genomes that are incomplete and contain errors and isoforms. These properties can severely violate the underlying assumptions of orthology inference with existing heuristics. We present a procedure that uses phylogenies for both homology and orthology assignment. The procedure first uses similarity scores to infer putative homologs that are then aligned, constructed into phylogenies, and pruned of spurious branches caused by deep paralogs, misassembly, frameshifts, or recombination. These final homologs are then used to identify orthologs. We explore four alternative tree-based orthology inference approaches, of which two are new. These accommodate gene and genome duplications as well as gene tree discordance. We demonstrate these methods in three published data sets including the grape family, Hymenoptera, and millipedes with divergence times ranging from approximately 100 to over 400 Ma. The procedure significantly increased the completeness and accuracy of the inferred homologs and orthologs. We also found that data sets that are more recently diverged and/or include more high-coverage genomes had more complete sets of orthologs. To explicitly evaluate sources of conflicting phylogenetic signals, we applied serial jackknife analyses of gene regions keeping each locus intact. The methods described here can scale to over 100 taxa. They have been implemented in python with independent scripts for each step, making it easy to modify or incorporate them into existing pipelines. All scripts are available from https://bitbucket.org/yangya/phylogenomic_dataset_construction.

271 citations


Journal ArticleDOI
TL;DR: This study based on transcriptomic data comprising 68,750-170,497 amino acid sites from 305 to 622 proteins resolves annelid relationships, including Chaetopteridae, Amphinomidae, Sipuncula, Oweniidae, and Magelonidae in the basal part of the tree.
Abstract: Annelida is one of three animal groups possessing segmentation and is central in considerations about the evolution of different character traits. It has even been proposed that the bilaterian ancestor resembled an annelid. However, a robust phylogeny of Annelida, especially with respect to the basal relationships, has been lacking. Our study based on transcriptomic data comprising 68,750-170,497 amino acid sites from 305 to 622 proteins resolves annelid relationships, including Chaetopteridae, Amphinomidae, Sipuncula, Oweniidae, and Magelonidae in the basal part of the tree. Myzostomida, which have been indicated to belong to the basal radiation as well, are now found deeply nested within Annelida as sister group to Errantia in most analyses. On the basis of our reconstruction of a robust annelid phylogeny, we show that the basal branching taxa include a huge variety of life styles such as tube dwelling and deposit feeding, endobenthic and burrowing, tubicolous and filter feeding, and errant and carnivorous forms. Ancestral character state reconstruction suggests that the ancestral annelid possessed a pair of either sensory or grooved palps, bicellular eyes, biramous parapodia bearing simple chaeta, and lacked nuchal organs. Because the oldest fossil of Annelida is reported for Sipuncula (520 Ma), we infer that the early diversification of annelids took place at least in the Lower Cambrian.

Journal ArticleDOI
TL;DR: The genomic comparison demonstrates that a series of inverted repeat boundary shifts and inversions played a major role in shaping genome organization in the Geraniaceae family and is correlated with the acceleration in nonsynonymous substitution rates but not with synonymous substitution rates.
Abstract: Geraniaceae plastid genomes are highly rearranged, and each of the four genera already sequenced in the family has a distinct genome organization. This study reports plastid genome sequences of six additional species, Francoa sonchifolia, Melianthus villosus, and Viviania marifolia from Geraniales, and Pelargonium alternans, California macrophylla, and Hypseocharis bilobata from Geraniaceae. These genome sequences, combined with previously published species, provide sufficient taxon sampling to reconstruct the ancestral plastid genome organization of Geraniaceae and the rearrangements unique to each genus. The ancestral plastid genome of Geraniaceae has a 4 kb inversion and a reduced, Pelargonium-like small single copy region. Our ancestral genome reconstruction suggests that a few minor rearrangements occurred in the stem branch of Geraniaceae followed by independent rearrangements in each genus. The genomic comparison demonstrates that a series of inverted repeat boundary shifts and inversions played a major role in shaping genome organization in the family. The distribution of repeats is strongly associated with breakpoints in the rearranged genomes, and the proportion and the number of large repeats (>20 bp and >60 bp) are significantly correlated with the degree of genome rearrangements. Increases in the degree of plastid genome rearrangements are correlated with the acceleration in nonsynonymous substitution rates (dN) but not with synonymous substitution rates (dS). Possible mechanisms that might contribute to this correlation, including DNA repair system and selection, are discussed.

Journal ArticleDOI
TL;DR: Novel measures that use information theory to quantify the degree of conflict or incongruence among all nontrivial bipartitions present in a set of trees are introduced.
Abstract: Phylogenies inferred from different data matrices often conflict with each other necessitating the development of measures that quantify this incongruence. Here, we introduce novel measures that use information theory to quantify the degree of conflict or incongruence among all nontrivial bipartitions present in a set of trees. The first measure, internode certainty (IC), calculates the degree of certaint yf or a given internode by considering the frequency of the bipartition defined by the internode (internal branch) in a given set of trees jointly with that of the most prevalent conflicting bipartition in the same tree set. The second measure, IC All (ICA), calculates the degree of certainty for a given internode by considering the frequency of the bipartition defined by the internode in a given set of trees in conjunction with that of all conflicting bipartitions in the same underlying tree set. Finally, the tree certainty (TC) and TC All (TCA) measures are the sum of IC and ICA values across all internodes of a phylogeny, respectively. IC, ICA, TC, and TCA can be calculated from different types of data that contain nontrivial bipartitions, including from bootstrap replicate trees to gene trees or individual characters. Given a set of phylogenetic trees, the IC and ICA values of a given internode reflect its specific degree of incongruence, and the TC and TCA values describe the global degree of incongruence between trees in the set. All four measures are implemented and freely available in version 8.0.0 and subsequent versions of the widely used program RAxML.

Journal ArticleDOI
TL;DR: It is shown that phylogenetic signal for the monophyly of Arachnida is restricted to the 500 slowest-evolving genes in the data set, and that outgroup selection without regard for branch length distribution exacerbates long-branch attraction artifacts and does not mitigate gene-tree discordance, regardless of high gene representation for outgroups that are model organisms.
Abstract: Chelicerata represents one of the oldest groups of arthropods, with a fossil record extending to the Cambrian, and is sister group to the remaining extant arthropods, the mandibulates. Attempts to resolve the internal phylogeny of chelicerates have achieved little consensus, due to marked discord in both morphological and molecular hypotheses of chelicerate phylogeny. The monophyly of Arachnida, the terrestrial chelicerates, is generally accepted, but has garnered little support from molecular data, which have been limited either in breadth of taxonomic sampling or in depth of sequencing. To address the internal phylogeny of this group, we employed a phylogenomic approach, generating transcriptomic data for 17 species in combination with existing data, including two complete genomes. We analyzed multiple data sets containing up to 1,235,912 sites across 3,644 loci, using alternative approaches to optimization of matrix composition. Here, we show that phylogenetic signal for the monophyly of Arachnida is restricted to the 500 slowest-evolving genes in the data set. Accelerated evolutionary rates in Acariformes, Pseudoscorpiones, and Parasitiformes potentially engender longbranch attraction artifacts, yielding nonmonophyly of Arachnida with increasing support upon incrementing the number of concatenated genes. Mutually exclusive hypotheses are supported by locus groups of variable evolutionary rate, revealing significant conflicts in phylogenetic signal. Analyses of gene-tree discordance indicate marked incongruence in relationships among chelicerate orders, whereas derived relationships are demonstrably robust. Consistently recovered and supported relationships include the monophyly of Chelicerata, Euchelicerata, Tetrapulmonata, and all orders represented by multiple terminals. Relationships supported by subsets of slow-evolving genes include Ricinulei+Solifugae; a clade comprised of Ricinulei, Opiliones, and Solifugae; and a clade comprised of Tetrapulmonata, Scorpiones, and Pseudoscorpiones. We demonstrate that outgroup selection without regard for branch length distribution exacerbates long-branch attraction artifacts and does not mitigate gene-tree discordance, regardless of high gene representation for outgroups that are model organisms. Arachnopulmonata (new name) is proposed for the clade comprising Scorpiones+Tetrapulmonata (previously named Pulmonata).

Journal ArticleDOI
TL;DR: It is suggested that selection strength is an important parameter contributing to the complexity of antibiotic resistance problem and use of high doses of antibiotics to clear infections has the potential to promote increase of cross-resistance in clinics.
Abstract: Revealing the genetic changes responsible for antibiotic resistance can be critical for developing novel antibiotic therapies. However, systematic studies correlating genotype to phenotype in the context of antibiotic resistance have been missing. In order to fill in this gap, we evolved 88 isogenic Escherichia coli populations against 22 antibiotics for 3 weeks. For every drug, two populations were evolved under strong selection and two populations were evolved under mild selection. By quantifying evolved populations' resistances against all 22 drugs, we constructed two separate cross-resistance networks for strongly and mildly selected populations. Subsequently, we sequenced representative colonies isolated from evolved populations for revealing the genetic basis for novel phenotypes. Bacterial populations that evolved resistance against antibiotics under strong selection acquired high levels of cross-resistance against several antibiotics, whereas other bacterial populations evolved under milder selection acquired relatively weaker cross-resistance. In addition, we found that strongly selected strains against aminoglycosides became more susceptible to five other drug classes compared with their wild-type ancestor as a result of a point mutation on TrkH, an ion transporter protein. Our findings suggest that selection strength is an important parameter contributing to the complexity of antibiotic resistance problem and use of high doses of antibiotics to clear infections has the potential to promote increase of cross-resistance in clinics.

Journal ArticleDOI
TL;DR: It is suggested that genetic complexity arose early in evolution as shown by the presence of these genes in most of the animal lineages, which suggests sponges either possess cryptic physiological and morphological complexity and/or have lost ancestral cell types or physiological processes.
Abstract: Sponges (Porifera) are among the earliest evolving metazoans. Their filter-feeding body plan based on choanocyte chambers organized into a complex aquiferous system is so unique among metazoans that it either reflects an early divergence from other animals prior to the evolution of fea tures such as muscles and nerves, or that sponges lost these characters. Analyses of the Amphimedon and Oscarella genomes support this view of uniqueness—many key metazoan genes are absent in these sponges—but whether this is generally true of other sponges remains unknown. We studied the transcriptomes of eight sponge species in four classes (Hexactinellida, Demospongiae, Homoscleromorpha, and Calcarea) specifically seeking genes and pathways considered to be involved in animal complexity. For reference, we also sought these genes in transcriptomes and genomes of three unicellular opisthokonts, two sponges (A. queenslandica and O. carmela), and two bilaterian taxa. Our analyses showed that all sponge classes share an unexpectedly large complement of genes with other metazoans. Interestingly, hexactinellid, calcareous, and homoscleromorph sponges share more genes with bilaterians than with nonbilaterian metazoans. We weresurprised tofind representatives ofmost molecules involved in cell‐cell communication, signaling, complex epithelia, immune recognition, and germ-lineage/sex, with only a few, but potentially key, absences. A noteworthy finding was that some important genes were absent from all demosponges (transcriptomes and the Amphimedon genome), which might reflect divergence from main-stem lineages including hexactinellids, calcareous sponges, and homoscleromorphs. Our results suggest that genetic complexity arose early in evolution as shown by the presence of these genes in most of the animal lineages, which suggests sponges either possess cryptic physiological and morphological complexity and/or have lost ancestral cell typesor physiological processes.

Journal ArticleDOI
TL;DR: It is found that under a realistic model of within-host evolution, reconstructions of simulated outbreaks contain substantial uncertainty even when genomic data reflect a high substitution rate, and that Bayesian reconstructions derived from sequence data may form a useful starting point for a genomic epidemiology investigation.
Abstract: Genomics is increasingly being used to investigate disease outbreaks, but an important question remains unanswered—how well do genomic data capture known transmission events, particularly for pathogens with long carriage periods or large within-host population sizes? Here we present a novel Bayesian approach to reconstruct densely sampled outbreaks from genomic data while considering within-host diversity. We infer a time-labeled phylogeny using Bayesian evolutionary analysis by sampling trees (BEAST), and then infer a transmission network via a Monte Carlo Markov chain. We find that under a realistic model of within-host evolution, reconstructions of simulated outbreaks contain substantial uncertainty even when genomic data reflect a high substitution rate. Reconstruction of a real-world tuberculosis outbreak displayed similar uncertainty, although the correct source case and several clusters of epidemiologically linked cases were identified. We conclude that genomics cannot wholly replace traditional epidemiology but that Bayesian reconstructions derived from sequence data may form a useful starting point for a genomic epidemiology investigation.

Journal ArticleDOI
TL;DR: This work reconstructs the most comprehensive phylogeny of Brachyura to date using sequence data of six nuclear protein-coding genes and two mitochondrial rRNA genes from more than 140 species belonging to 58 families, and confirms that the "Podotremata," are paraphyletic.
Abstract: Crabs of the infra-order Brachyura are one of the most diverse groups of crustaceans with approximately 7,000 described species in 98 families, occurring in marine, freshwater, and terrestrial habitats. The relationships among the brachyuran families are poorly understood due to the high morphological complexity of the group. Here, we reconstruct the most comprehensive phylogeny of Brachyura to date using sequence data of six nuclear protein-coding genes and two mitochondrial rRNA genes from more than 140 species belonging to 58 families. The gene tree confirms that the "Podotremata," are paraphyletic. Within the monophyletic Eubrachyura, the reciprocal monophyly of the two subsections, Heterotremata and Thoracotremata, is supported. Monophyly of many superfamilies, however, is not recovered, indicating the prevalence of morphological convergence and the need for further taxonomic studies. Freshwater crabs were derived early in the evolution of Eubrachyura and are shown to have at least two independent origins. Bayesian relaxed molecular methods estimate that freshwater crabs separated from their closest marine sister taxa ~135 Ma, that is, after the break up of Pangaea (∼200 Ma) and that a Gondwanan origin of these freshwater representatives is untenable. Most extant families and superfamilies arose during the late Cretaceous and early Tertiary.

Journal ArticleDOI
TL;DR: The method was applied to study the higher level phylogenetic relationships in the weevils, producing 92 newly assembled mitogenomes obtained in a single Illumina MiSeq run, and supported a separate origin of wood-boring behavior by the subfamilies Scolytinae, Platypodinae and Cossoninae.
Abstract: Complete mitochondrial genomes have been shown to be reliable markers for phylogeny reconstruction among diverse animal groups. However, the relative difficulty and high cost associated with obtaining de novo full mitogenomes have frequently led to conspicuously low taxon sampling in ensuing studies. Here, we report the successful use of an economical and accessible method for assembling complete or near-complete mitogenomes through shot-gun next-generation sequencing of a single library made from pooled total DNA extracts of numerous target species. To avoid the use of separate indexed libraries for each specimen, and an associated increase in cost, we incorporate standard polymerase chain reaction-based “bait” sequences to identify the assembled mitogenomes. The method was applied to study the higher level phylogenetic relationships in the weevils (Coleoptera: Curculionoidea), producing 92 newly assembled mitogenomes obtained in a single Illumina MiSeq run. The analysis supported a separate origin of wood-boring behavior by the subfamilies Scolytinae, Platypodinae, and Cossoninae. This finding contradicts morphological hypotheses proposing a close relationship between the first two of these but is congruent with previous molecular studies, reinforcing the utility of mitogenomes in phylogeny reconstruction. Our methodology provides a technically simple procedure for generating densely sampled trees from whole mitogenomes and is widely applicable to groups of animals for which bait sequences are the only required prior genome knowledge.

Journal ArticleDOI
TL;DR: This work shows that one can identify the most reliable portions of an MSA, as judged from BAliBASE and PREFAB structure-based reference alignments, using the transitive consistency score (TCS), an extended version of the T-Coffee scoring scheme.
Abstract: Multiple sequence alignment (MSA) is a key modeling procedure when analyzing biological sequences. Homology and evolutionary modeling are the most common applications of MSAs. Both are known to be sensitive to the underlying MSA accuracy. In this work, we show how this problem can be partly overcome using the transitive consistency score (TCS), an extended version of the T-Coffee scoring scheme. Using this local evaluation function, we show that one can identify the most reliable portions of an MSA, as judged from BAliBASE and PREFAB structure-based reference alignments. We also show how this measure can be used to improve phylogenetic tree reconstruction using both an established simulated data set and a novel empirical yeast data set. For this purpose, we describe a novel lossless alternative to site filtering that involves overweighting the trustworthy columns. Our approach relies on the T-Coffee framework; it uses libraries of pairwise alignments to evaluate any third party MSA. Pairwise projections can be produced using fast or slow methods, thus allowing a trade-off between speed and accuracy. We compared TCS with Heads-or-Tails, GUIDANCE, Gblocks, and trimAl and found it to lead to significantly better estimates of structural accuracy and more accurate phylogenetic trees. The software is available from www.tcoffee.org/Projects/tcs.

Journal ArticleDOI
TL;DR: A pattern of mesic-adapted lineages evolving to use more arid and open habitats, which is broadly consistent with regional climate and environmental change is found, however, contrary to the general trend, several lineages subsequently appear to have reverted from drier to more mesic habitats.
Abstract: Marsupials exhibit great diversity in ecology and morphology. However, compared with their sister group, the placental mammals, our understanding of many aspects of marsupial evolution remains limited. We use 101 mitochondrial genomes and data from 26 nuclear loci to reconstruct a dated phylogeny including 97% of extant genera and 58% of modern marsupial species. This tree allows us to analyze the evolution of habitat preference and geographic distributions of marsupial species through time. We found a pattern of mesic-adapted lineages evolving to use more arid and open habitats, which is broadly consistent with regional climate and environmental change. However, contrary to the general trend, several lineages subsequently appear to have reverted from drier to more mesic habitats. Biogeographic reconstructions suggest that current views on the connectivity between Australia and New Guinea/Wallacea during the Miocene and Pliocene need to be revised. The antiquity of several endemic New Guinean clades strongly suggests a substantially older period of connection stretching back to the Middle Miocene and implies that New Guinea was colonized by multiple clades almost immediately after its principal formation.

Journal ArticleDOI
TL;DR: The observed patterns suggest that the recombination rate experienced by a gene is positively related to an increase in the efficiency of both positive and purifying selection.
Abstract: Genetic recombination associated with sexual reproduction increases the efficiency of natural selection by reducing the strength of Hill-Robertson interference. Such interference can be caused either by selective sweeps of positively selected alleles or by background selection (BGS) against deleterious mutations. Its consequences can be studied by comparing patterns of molecular evolution and variation in genomic regions with different rates of crossing over. We carried out a comprehensive study of the benefits of recombination in Drosophila melanogaster, both by contrasting five independent genomic regions that lack crossing over with the rest of the genome and by comparing regions with different rates of crossing over, using data on DNA sequence polymorphisms from an African population that is geographically close to the putatively ancestral population for the species, and on sequence divergence from a related species. We observed reductions in sequence diversity in noncrossover (NC) regions that are inconsistent with the effects of hard selective sweeps in the absence of recombination. Overall, the observed patterns suggest that the recombination rate experienced by a gene is positively related to an increase in the efficiency of both positive and purifying selection. The results are consistent with a BGS model with interference among selected sites in NC regions, and joint effects of BGS, selective sweeps, and a past population expansion on variability in regions of the genome that experience crossing over. In such crossover regions, the X chromosome exhibits a higher rate of adaptive protein sequence evolution than the autosomes, implying a Faster-X effect.

Journal ArticleDOI
TL;DR: Using high-throughput experimental strategies, this work creates an evolutionary model for influenza nucleoprotein that describes the gene phylogeny far better than existing models with dozens or even hundreds of free parameters.
Abstract: All modern approaches to molecular phylogenetics require a quantitative model for how genes evolve. Unfortunately, existing evolutionary models do not realistically represent the site-heterogeneous selection that governs actual sequence change. Attempts to remedy this problem have involved augmenting these models with a burgeoning number of free parameters. Here, I demonstrate an alternative: Experimental determination of a parameter-free evolutionary model via mutagenesis, functional selection, and deep sequencing. Using this strategy, I create an evolutionary model for influenza nucleoprotein that describes the gene phylogeny far better than existing models with dozens or even hundreds of free parameters. Emerging high-throughput experimental strategies such as the one employed here provide fundamentally new information that has the potential to transform the sensitivity of phylogenetic and genetic analyses.

Journal ArticleDOI
TL;DR: The inability to identify substantial plastid genome sequences from R. lagascae using multiple approaches suggests that the parasitic plant genus Rafflesia may be the first plant group for which there is no recognizable plastids genome, or if present is found in cryptic form at very low levels.
Abstract: Rafflesia is a genus of holoparasitic plants endemic to Southeast Asia that has lost the ability to undertake photosynthesis. With short-read sequencing technology, we assembled a draft sequence of the mitochondrial genome of Rafflesia lagascae Blanco, a species endemic to the Philippine island of Luzon, with ~350 sequencing depth coverage. Using multiple approaches, however, we were only able to identify small fragments of plastid sequences at low coverage depth (<2) and could not recover any substantial portion of a chloroplast genome. The gene fragments we identified included photosynthesis and energy production genes (atp,ndh,pet,psa,psb,rbcL), ribosomal RNA genes (rrn16,rrn23), ribosomal protein genes (rps7,rps11,rps16), transfer RNA genes, as well as matK,accD,ycf2, and multiple nongenic regions from the inverted repeats. None of the identified plastid gene sequences had intact reading frames. Phylogenetic analysis suggests that ~33% of these remnant plastid genes may have been horizontally transferred from the host plant genus Tetrastigma with the rest having ambiguous phylogenetic positions (<50% bootstrap support), except for psaB that was strongly allied with the plastid homolog in Nicotiana. Our inability to identify substantial plastid genome sequences from R. lagascae using multiple approaches—despite success in identifying and developing a draft assembly of the much larger mitochondrial genome—suggests that the parasitic plant genus Rafflesia may be the first plant group for which there is no recognizable plastid genome, or if presen ti s found in cryptic form at very low levels.

Journal ArticleDOI
TL;DR: The results show that horizontal gene transfer is an important and recurring mechanism driving coevolution between insects and their bacterial endosymbionts and highlight interesting similarities and contrasts with the evolutionary history of mitochondria and plastids.
Abstract: Bacteria confined to intracellular environments experience extensive genome reduction. In extreme cases, insect endosymbionts have evolved genomes that are so gene-poor that they blur the distinction between bacteria and endosymbiotically derived organelles such as mitochondria and plastids. To understand the host’s role in this extreme gene loss, we analyzed gene content and expression in the nuclear genome of the psyllid Pachypsylla venusta, a sap-feeding insect that harbors an ancient endosymbiont (Carsonella) with one of the most reduced bacterial genomes ever identified. Carsonella retains many genes required for synthesis of essential amino acids that are scarce in plant sap, but most of these biosynthetic pathways have been disrupted by gene loss. Host genes that are upregulated in psyllid cells housing Carsonella appear to compensate for endosymbiont gene losses, resulting in highly integrated metabolic pathways that mirror those observed in other sap-feeding insects. The host contribution to these pathways is mediated by a combination of native eukaryotic genes and bacterial genes that were horizontally transferred from multiple donor lineages early in the evolution of psyllids, including one gene that appears to have been directly acquired from Carsonella. By comparing the psyllid genome to a recent analysis of mealybugs, we found that a remarkably similar set of functional pathways have been shaped by independent transfers of bacterial genes to the two hosts. These results show that horizontal gene transfer is an important and recurring mechanism driving coevolution between insects and their bacterial endosymbionts and highlight interesting similarities and contrasts with the evolutionary history of mitochondria and plastids.

Journal ArticleDOI
TL;DR: A new class of ancient WGD is proposed, with Musa (alpha), poplar, and soybean as members, where genes are both deleted and expressed to an equal extent (unbiased fractionation and genome equivalence).
Abstract: Whole genome duplications (WGDs) occurred in the distant evolutionary history of many lineages and are particularly frequent in the flowering plant lineages. Following paleopolyploidization in plants, most duplicated genes are deleted by intrachromosomal recombination, a process referred to as fractionation. In the examples studied so far, genes are disproportionately lost from one of the parental subgenomes (biased fractionation) and the subgenome having lost the lowest number of genes is more expressed (genome dominance). In the present study, we analyzed the pattern of gene deletion and gene expression following the most recent WGD in banana (alpha event) and extended our analyses to seven other sequenced plant genomes: poplar, soybean, medicago, arabidopsis, sorghum, brassica, and maize. We propose a new class of ancient WGD, with Musa (alpha), poplar, and soybean as members, where genes are both deleted and expressed to an equal extent (unbiased fractionation and genome equivalence). We suggest that WGDs with genome dominance and biased fractionation (Class I) may result from ancient allotetraploidies, while WGDs without genome dominance or biased fractionation (Class II) may result from ancient autotetraploidies.

Journal ArticleDOI
TL;DR: Gene flow detected from brown into American black bears can explain the conflicting placement of the American black bear in mitochondrial and nuclear phylogenies, and highlights that both incomplete lineage sorting and introgression are prominent evolutionary forces even on time scales up to several million years.
Abstract: Ursine bears are a mammalian subfamily that comprises six morphologically and ecologically distinct extant species. Previous phylogenetic analyses of concatenated nuclear genes could not resolve all relationships among bears, and appeared to conflict with the mitochondrial phylogeny. Evolutionary processes such as incomplete lineage sorting and introgression can cause gene tree discordance and complicate phylogenetic inferences, but are not accounted for in phylogenetic analyses of concatenated data. We generated a high-resolution data set of autosomal introns from several individuals per species and of Y-chromosomal markers. Incorporating intraspecific variability in coalescence-based phylogenetic and gene flow estimation approaches, we traced the genealogical history of individual alleles. Considerable heterogeneity among nuclear loci and discordance between nuclear and mitochondrial phylogenies were found. A species tree with divergence time estimates indicated that ursine bears diversified within less than 2 My. Consistent with a complex branching order within a clade of Asian bear species, we identified unidirectional gene flow from Asian black into sloth bears. Moreover, gene flow detected from brown into American black bears can explain the conflicting placement of the American black bear in mitochondrial and nuclear phylogenies. These results highlight that both incomplete lineage sorting and introgression are prominent evolutionary forces even on time scales up to several million years. Complex evolutionary patterns are not adequately captured by strictly bifurcating models, and can only be fully understood when analyzing multiple independently inherited loci in a coalescence framework. Phylogenetic incongruence among gene trees hence needs to be recognized as a biologically meaningful signal.

Journal ArticleDOI
TL;DR: Platyzoan paraphyly suggests that the last common ancestor of Spiralia was a simple-bodied organism lacking coelomic cavities, segmentation, and complex brain structures, and that more complex animals such as annelids evolved from such a simply organized ancestor.
Abstract: Based on molecular data three major clades have been recognized within Bilateria: Deuterostomia, Ecdysozoa, and Spiralia. Within Spiralia, small-sized and simply organized animals such as flatworms, gastrotrichs, and gnathostomulids have recently been grouped together as Platyzoa. However, the representation of putative platyzoans was low in the respective molecular phylogenetic studies, in terms of both, taxon number and sequence data. Furthermore, increased substitution rates in platyzoan taxa raised the possibility that monophyletic Platyzoa represents an artifact due to long-branch attraction. In order to overcome such problems, we employed a phylogenomic approach, thereby substantially increasing 1) the number of sampled species within Platyzoa and 2) species-specific sequence coverage in data sets of up to 82,162 amino acid positions. Using established and new measures (long-branch score), we disentangled phylogenetic signal from misleading effects such as long-branch attraction. In doing so, our phylogenomic analyses did not recover a monophyletic origin of platyzoan taxa that, instead, appeared paraphyletic with respect to the other spiralians. Platyhelminthes and Gastrotricha formed a monophylum, which we name Rouphozoa. To the exclusion of Gnathifera, Rouphozoa and all other spiralians represent a monophyletic group, which we name Platytrochozoa. Platyzoan paraphyly suggests that the last common ancestor of Spiralia was a simple-bodied organism lacking coelomic cavities, segmentation, and complex brain structures, and that more complex animals such as annelids evolved from such a simply organized ancestor. This conclusion contradicts alternative evolutionary scenarios proposing an annelid-like ancestor of Bilateria and Spiralia and several independent events of secondary reduction.

Journal ArticleDOI
TL;DR: It is demonstrated that many of the candidate SNPs were false positives that were linked to selected sites over distances much larger than the typical linkage disequilibrium range of Drosophila melanogaster, and could be readily replicated when strong selection acts on rare haplotypes.
Abstract: Experimental evolution in combination with whole-genome sequencing (evolve and resequence [E&R]) is a promising approach to define the genotype–phenotype map and to understand adaptation in evolving populations. Many previous studies have identified a large number of putative selected sites (i.e., candidate loci), but it remains unclear to what extent these loci are genuine targets of selection or experimental noise. To address this question, we exposed the same founder population to two different selection regimes—a hot environment and a cold environment—and quantified the genomic response in each. We detected large numbers of putative selected loci in both environments, albeit with little overlap between the two sets of candidates, indicating that most resulted from habitat-specific selection. By quantifying changes across multiple independent biological replicates, we demonstrate that most of the candidate SNPs were false positives that were linked to selected sites over distances much larger than the typical linkage disequilibrium range of Drosophila melanogaster. We show that many of these mid- to long-range associations were attributable to large segregating inversions and confirm by computer simulations that such patterns could be readily replicated when strong selection acts on rare haplotypes. In light of our findings, we outline recommendations to improve the performance of future Drosophila E&R studies which include using species with negligible inversion loads, such as D. mauritiana and D. simulans, instead of D. melanogaster.

Journal ArticleDOI
TL;DR: Insight is provided into the adaptation process and lessons important for the future implementation of ALE as a tool for scientific research and engineering are yielded.
Abstract: Adaptive laboratory evolution (ALE) has emerged as a valuable method by which to investigate microbial adaptation to a desired environment. Here, we performed ALE to 42 °C of ten parallel populations of Escherichia coli K-12 MG1655 grown in glucose minimal media. Tightly controlled experimental conditions allowed selection based on exponential-phase growth rate, yielding strains that uniformly converged toward a similar phenotype along distinct genetic paths. Adapted strains possessed as few as 6 and as many as 55 mutations, and of the 144 genes that mutated in total, 14 arose independently across two or more strains. This mutational recurrence pointed to the key genetic targets underlying the evolved fitness increase. Genome engineering was used to introduce the novel ALE-acquired alleles in random combinations into the ancestral strain, and competition between these engineered strains reaffirmed the impact of the key mutations on the growth rate at 42 °C. Interestingly, most of the identified key gene targets differed significantly from those found in similar temperature adaptation studies, highlighting the sensitivity of genetic evolution to experimental conditions and ancestral genotype. Additionally, transcriptomic analysis of the ancestral and evolved strains revealed a general trend for restoration of the global expression state back toward preheat stressed levels. This restorative effect was previously documented following evolution to metabolic perturbations, and thus may represent a general feature of ALE experiments. The widespread evolved expression shifts were enabled by a comparatively scant number of regulatory mutations, providing a net fitness benefit but causing suboptimal expression levels for certain genes, such as those governing flagellar formation, which then became targets for additional ameliorating mutations. Overall, the results of this study provide insight into the adaptation process and yield lessons important for the future implementation of ALE as a tool for scientific research and engineering.