scispace - formally typeset
Search or ask a question

Showing papers in "Molecular Biology and Evolution in 2021"


Journal ArticleDOI
TL;DR: New functionalities and major improvements of the BUSCO software are presented, as well as the renewal and expansion of the underlying datasets in sync with the OrthoDB v10 release.
Abstract: Methods for evaluating the quality of genomic and metagenomic data are essential to aid genome assembly procedures and to correctly interpret the results of subsequent analyses. BUSCO estimates the completeness and redundancy of processed genomic data based on universal single-copy orthologs. Here, we present new functionalities and major improvements of the BUSCO software, as well as the renewal and expansion of the underlying data sets in sync with the OrthoDB v10 release. Among the major novelties, BUSCO now enables phylogenetic placement of the input sequence to automatically select the most appropriate BUSCO data set for the assessment, allowing the analysis of metagenome-assembled genomes of unknown origin. A newly introduced genome workflow increases the efficiency and runtimes especially on large eukaryotic genomes. BUSCO is the only tool capable of assessing both eukaryotic and prokaryotic species, and can be applied to various data types, from genome assemblies and metagenomic bins, to transcriptomes and gene sets.

1,181 citations


Journal ArticleDOI
TL;DR: For example, EggNOG-mapper v2 as mentioned in this paper is a tool for functional annotation based on precomputed orthology assignments, optimized for vast (meta)genomic data sets.
Abstract: Even though automated functional annotation of genes represents a fundamental step in most genomic and metagenomic workflows, it remains challenging at large scales. Here, we describe a major upgrade to eggNOG-mapper, a tool for functional annotation based on precomputed orthology assignments, now optimized for vast (meta)genomic data sets. Improvements in version 2 include a full update of both the genomes and functional databases to those from eggNOG v5, as well as several efficiency enhancements and new features. Most notably, eggNOG-mapper v2 now allows for: (i) de novo gene prediction from raw contigs, (ii) built-in pairwise orthology prediction, (iii) fast protein domain discovery, and (iv) automated GFF decoration. eggNOG-mapper v2 is available as a standalone tool or as an online service at http://eggnog-mapper.embl.de.

576 citations


Journal ArticleDOI
TL;DR: Results of phylogenetic analyses, in particular those conducted under the default settings of current phylogenetic inference tools, as well as downstream analyses on the inferred phylogenies, should be considered and interpreted with extreme caution.
Abstract: Numerous studies covering some aspects of SARS-CoV-2 data analyses are being published on a daily basis, including a regularly updated phylogeny on nextstrain.org. Here, we review the difficulties of inferring reliable phylogenies by example of a data snapshot comprising a quality-filtered subset of 8,736 out of all 16,453 virus sequences available on May 5, 2020 from gisaid.org. We find that it is difficult to infer a reliable phylogeny on these data due to the large number of sequences in conjunction with the low number of mutations. We further find that rooting the inferred phylogeny with some degree of confidence either via the bat and pangolin outgroups or by applying novel computational methods on the ingroup phylogeny does not appear to be credible. Finally, an automatic classification of the current sequences into subclasses using the mPTP tool for molecular species delimitation is also, as might be expected, not possible, as the sequences are too closely related. We conclude that, although the application of phylogenetic methods to disentangle the evolution and spread of COVID-19 provides some insight, results of phylogenetic analyses, in particular those conducted under the default settings of current phylogenetic inference tools, as well as downstream analyses on the inferred phylogenies, should be considered and interpreted with extreme caution.

120 citations


Journal ArticleDOI
TL;DR: As 3′-UTR-based phylotranscriptomics resolved the avian family-level tree of life, it is suggested that this procedure will also resolve the all-species avian tree oflife.
Abstract: Presumably, due to a rapid early diversification, major parts of the higher-level phylogeny of birds are still resolved controversially in different analyses or are considered unresolvable. To address this problem, we produced an avian tree of life, which includes molecular sequences of one or several species of ∼90% of the currently recognized family-level taxa (429 species, 379 genera) including all 106 family-level taxa of the nonpasserines and 115 of the passerines (Passeriformes). The unconstrained analyses of noncoding 3-prime untranslated region (3'-UTR) sequences and those of coding sequences yielded different trees. In contrast to the coding sequences, the 3'-UTR sequences resulted in a well-resolved and stable tree topology. The 3'-UTR contained, unexpectedly, transcription factor binding motifs that were specific for different higher-level taxa. In this tree, grebes and flamingos are the sister clade of all other Neoaves, which are subdivided into five major clades. All nonpasserine taxa were placed with robust statistical support including the long-time enigmatic hoatzin (Opisthocomiformes), which was found being the sister taxon of the Caprimulgiformes. The comparatively late radiation of family-level clades of the songbirds (oscine Passeriformes) contrasts with the attenuated diversification of nonpasseriform taxa since the early Miocene. This correlates with the evolution of vocal production learning, an important speciation factor, which is ancestral for songbirds and evolved convergent only in hummingbirds and parrots. As 3'-UTR-based phylotranscriptomics resolved the avian family-level tree of life, we suggest that this procedure will also resolve the all-species avian tree of life.

84 citations


Journal ArticleDOI
TL;DR: Shuangbin Xu Zehan Dai Southern Medical University Pingfan GuoSouthern Medical University Xiaocong Fu Southern Medical university Shanshan Liu Southern MedicalUniversity Lang Zhou Southern Medical universities Wenli Tang Southern Medical Universities Tingze Feng Southern Medical School.
Abstract: We present the ggtreeExtra package for visualizing heterogeneous data with a phylogenetic tree in a circular or rectangular layout (https://www.bioconductor.org/packages/ggtreeExtra). The package supports more data types and visualization methods than other tools. It supports using the grammar of graphics syntax to present data on a tree with richly annotated layers and allows evolutionary statistics inferred by commonly used software to be integrated and visualized with external data. GgtreeExtra is a universal tool for tree data visualization. It extends the applications of the phylogenetic tree in different disciplines by making more domain-specific data to be available to visualize and interpret in the evolutionary context.

84 citations


Journal ArticleDOI
TL;DR: A rapid analytical pipeline is described and applied to analyze the spatiotemporal dispersal history and dynamics of SARS-CoV-2 lineages and has the potential to be quickly applied to other countries or regions, with key benefits in complementing epidemiological analyses in assessing the impact of intervention measures or their progressive easement.
Abstract: Since the start of the COVID-19 pandemic, an unprecedented number of genomic sequences of SARS-CoV-2 have been generated and shared with the scientific community. The unparalleled volume of available genetic data presents a unique opportunity to gain real-time insights into the virus transmission during the pandemic, but also a daunting computational hurdle if analyzed with gold-standard phylogeographic approaches. To tackle this practical limitation, we here describe and apply a rapid analytical pipeline to analyze the spatiotemporal dispersal history and dynamics of SARS-CoV-2 lineages. As a proof of concept, we focus on the Belgian epidemic, which has had one of the highest spatial densities of available SARS-CoV-2 genomes. Our pipeline has the potential to be quickly applied to other countries or regions, with key benefits in complementing epidemiological analyses in assessing the impact of intervention measures or their progressive easement.

77 citations


Journal ArticleDOI
TL;DR: An evolutionarily informed approach to attenuation is proposed that, unusually, seeks to increase usage of the already most common synonymous codons in SARS-CoV-2 genes.
Abstract: Large-scale re-engineering of synonymous sites is a promising strategy to generate vaccines either through synthesis of attenuated viruses or via codon optimized genes in DNA vaccines. Attenuation typically relies on de-optimisation of codon pairs and maximization of CpG dinucleotide frequencies. So as to formulate evolutionarily-informed attenuation strategies that aim to force nucleotide usage against the direction favoured by selection, here we examine available whole-genome sequences of SARS-CoV-2 to infer patterns of mutation and selection on synonymous sites. Analysis of mutational profiles indicates a strong mutation bias towards U. In turn, analysis of observed synonymous site composition implicates selection against U. Accounting for dinucleotide effects reinforces this conclusion, observed UU content being a quarter of that expected under neutrality. Possible mechanisms of selection against U mutations includes selection for higher expression, for high mRNA stability or lower immunogenicity of viral genes. Consistent with gene-specific selection against CpG dinucleotides, we observe systematic differences of CpG content between SARS-CoV-2 genes. We propose an evolutionarily-informed approach to attenuation that, unusually, seeks to increase usage of the already most common synonymous codons. Comparable analysis of H1N1 and Ebola finds that GC3 deviated from neutral equilibrium is not a universal feature, cautioning against generalization of results.

68 citations


Journal ArticleDOI
TL;DR: In this paper, the authors systematically test these hypotheses by synthesizing 15 previous phylogenomic studies and performing new standardized analyses under consistent conditions with additional models, and they find that Ctenophora-sister is recovered across the full range of examined conditions, and Porifera-s sister is recovered in some analyses under narrow conditions when most outgroups are excluded and siteheterogeneous CAT models are used.
Abstract: Identifying our most distant animal relatives has emerged as one of the most challenging problems in phylogenetics. This debate has major implications for our understanding of the origin of multicellular animals and of the earliest events in animal evolution, including the origin of the nervous system. Some analyses identify sponges as our most distant animal relatives (Porifera-sister hypothesis), and others identify comb jellies (Ctenophora-sister hypothesis). These analyses vary in many respects, making it difficult to interpret previous tests of these hypotheses. To gain insight into why different studies yield different results, an important next step in the ongoing debate, we systematically test these hypotheses by synthesizing 15 previous phylogenomic studies and performing new standardized analyses under consistent conditions with additional models. We find that Ctenophora-sister is recovered across the full range of examined conditions, and Porifera-sister is recovered in some analyses under narrow conditions when most outgroups are excluded and site-heterogeneous CAT models are used. We additionally find that the number of categories in site-heterogeneous models is sufficient to explain the Porifera-sister results. Furthermore, our cross-validation analyses show CAT models that recover Porifera-sister have hundreds of additional categories and fail to fit significantly better than site-heterogenuous models with far fewer categories. Systematic and standardized testing of diverse phylogenetic models suggests that we should be skeptical of Porifera-sister results both because they are recovered under such narrow conditions and because the models in these conditions fit the data no better than other models that recover Ctenophora-sister.

67 citations


Journal ArticleDOI
TL;DR: In this article, the authors identified homologs of the genes underlying this phenotype, cifA and cifB, in 52 of 71 new and published Wolbachia genome sequences.
Abstract: Cytoplasmic incompatibility is a selfish reproductive manipulation induced by the endosymbiont Wolbachia in arthropods. In males Wolbachia modifies sperm, leading to embryonic mortality in crosses with Wolbachia-free females. In females, Wolbachia rescues the cross and allows development to proceed normally. This provides a reproductive advantage to infected females, allowing the maternally transmitted symbiont to spread rapidly through host populations. We identified homologs of the genes underlying this phenotype, cifA and cifB, in 52 of 71 new and published Wolbachia genome sequences. They are strongly associated with cytoplasmic incompatibility. There are up to seven copies of the genes in each genome, and phylogenetic analysis shows that Wolbachia frequently acquires new copies due to pervasive horizontal transfer between strains. In many cases, the genes have subsequently acquired loss-of-function mutations to become pseudogenes. As predicted by theory, this tends to occur first in cifB, whose sole function is to modify sperm, and then in cifA, which is required to rescue the cross in females. Although cif genes recombine, recombination is largely restricted to closely related homologs. This is predicted under a model of coevolution between sperm modification and embryonic rescue, where recombination between distantly related pairs of genes would create a self-incompatible strain. Together, these patterns of gene gain, loss, and recombination support evolutionary models of cytoplasmic incompatibility.

62 citations


Journal ArticleDOI
TL;DR: The peroxidase evolution in Agaricomycetes is analyzed by ancestral-sequence reconstruction revealing several major evolutionary pathways, and the appearance of the different enzyme types in a time-calibrated species tree is mapped.
Abstract: As actors of global carbon cycle, Agaricomycetes (Basidiomycota) have developed complex enzymatic machineries that allow them to decompose all plant polymers, including lignin. Among them, saprotrophic Agaricales are characterized by an unparalleled diversity of habitats and lifestyles. Comparative analysis of 52 Agaricomycetes genomes (14 of them sequenced de novo) reveals that Agaricales possess a large diversity of hydrolytic and oxidative enzymes for lignocellulose decay. Based on the gene families with the predicted highest evolutionary rates-namely cellulose-binding CBM1, glycoside hydrolase GH43, lytic polysaccharide monooxygenase AA9, class-II peroxidases, glucose-methanol-choline oxidase/dehydrogenases, laccases, and unspecific peroxygenases-we reconstructed the lifestyles of the ancestors that led to the extant lignocellulose-decomposing Agaricomycetes. The changes in the enzymatic toolkit of ancestral Agaricales are correlated with the evolution of their ability to grow not only on wood but also on leaf litter and decayed wood, with grass-litter decomposers as the most recent eco-physiological group. In this context, the above families were analyzed in detail in connection with lifestyle diversity. Peroxidases appear as a central component of the enzymatic toolkit of saprotrophic Agaricomycetes, consistent with their essential role in lignin degradation and high evolutionary rates. This includes not only expansions/losses in peroxidase genes common to other basidiomycetes but also the widespread presence in Agaricales (and Russulales) of new peroxidases types not found in wood-rotting Polyporales, and other Agaricomycetes orders. Therefore, we analyzed the peroxidase evolution in Agaricomycetes by ancestral-sequence reconstruction revealing several major evolutionary pathways and mapped the appearance of the different enzyme types in a time-calibrated species tree.

57 citations


Journal ArticleDOI
TL;DR: Molecular dating indicates that the Acropora ancestor survived warm periods without sea ice from the mid or late Cretaceous to the Early Eocene and that diversification of Acroporid taxa may have been enhanced by subsequent cooling periods, and the scleractinian gene repertoire is highly conserved.
Abstract: The genus Acropora comprises the most diverse and abundant scleractinian corals (Anthozoa, Cnidaria) in coral reefs, the most diverse marine ecosystems on Earth. However, the genetic basis for the success and wide distribution of Acropora are unknown. Here, we sequenced complete genomes of 15 Acropora species and 3 other acroporid taxa belonging to the genera Montipora and Astreopora to examine genomic novelties that explain their evolutionary success. We successfully obtained reasonable draft genomes of all 18 species. Molecular dating indicates that the Acropora ancestor survived warm periods without sea ice from the mid or late Cretaceous to the Early Eocene and that diversification of Acropora may have been enhanced by subsequent cooling periods. In general, the scleractinian gene repertoire is highly conserved; however, coral- or cnidarian-specific possible stress response genes are tandemly duplicated in Acropora. Enzymes that cleave dimethlysulfonioproprionate into dimethyl sulfide, which promotes cloud formation and combats greenhouse gasses, are the most duplicated genes in the Acropora ancestor. These may have been acquired by horizontal gene transfer from algal symbionts belonging to the family Symbiodiniaceae, or from coccolithophores, suggesting that although functions of this enzyme in Acropora are unclear, Acropora may have survived warmer marine environments in the past by enhancing cloud formation. In addition, possible antimicrobial peptides and symbiosis-related genes are under positive selection in Acropora, perhaps enabling adaptation to diverse environments. Our results suggest unique Acropora adaptations to ancient, warm marine environments and provide insights into its capacity to adjust to rising seawater temperatures.

Journal ArticleDOI
TL;DR: In this article, the authors show that application of three widely used herbicides-glyphosate, glufosinate, and dicamba-increases the prevalence of antibiotic resistance genes (ARGs) and mobile genetic elements (MGEs) in soil microbiomes without clear changes in the abundance, diversity and composition of bacterial communities.
Abstract: Herbicides are one of the most widely used chemicals in agriculture. While they are known to be harmful to nontarget organisms, the effects of herbicides on the composition and functioning of soil microbial communities remain unclear. Here we show that application of three widely used herbicides-glyphosate, glufosinate, and dicamba-increase the prevalence of antibiotic resistance genes (ARGs) and mobile genetic elements (MGEs) in soil microbiomes without clear changes in the abundance, diversity and composition of bacterial communities. Mechanistically, these results could be explained by a positive selection for more tolerant genotypes that acquired several mutations in previously well-characterized herbicide and ARGs. Moreover, herbicide exposure increased cell membrane permeability and conjugation frequency of multidrug resistance plasmids, promoting ARG movement between bacteria. A similar pattern was found in agricultural soils across 11 provinces in China, where herbicide application, and the levels of glyphosate residues in soils, were associated with increased ARG and MGE abundances relative to herbicide-free control sites. Together, our results show that herbicide application can enrich ARGs and MGEs by changing the genetic composition of soil microbiomes, potentially contributing to the global antimicrobial resistance problem in agricultural environments.

Journal ArticleDOI
TL;DR: In this article, the authors analyzed 6,749 experimentally determined variant effects from multiplexed assays on abundance and activity in two proteins (NUDT15 and PTEN) to quantify these effects and find that a third of the variants cause loss of function, and about half of loss-of-function variants also have low cellular abundance.
Abstract: Understanding and predicting how amino acid substitutions affect proteins are keys to our basic understanding of protein function and evolution Amino acid changes may affect protein function in a number of ways including direct perturbations of activity or indirect effects on protein folding and stability We have analyzed 6,749 experimentally determined variant effects from multiplexed assays on abundance and activity in two proteins (NUDT15 and PTEN) to quantify these effects and find that a third of the variants cause loss of function, and about half of loss-of-function variants also have low cellular abundance We analyze the structural and mechanistic origins of loss of function and use the experimental data to find residues important for enzymatic activity We performed computational analyses of protein stability and evolutionary conservation and show how we may predict positions where variants cause loss of activity or abundance In this way, our results link thermodynamic stability and evolutionary conservation to experimental studies of different properties of protein fitness landscapes

Journal ArticleDOI
TL;DR: Beyond providing a rich empirical resource for delineating the precise functions of H. armigera ORs, the results enable a comparative analysis of insect ORs that have apparently facilitated and currently sustain the intimate adaptations and ecological interactions among nectar feeding insects and flowering plants.
Abstract: Odorant receptors (ORs) are essential for plant-insect interactions. However, despite the global impacts of Lepidoptera (moths and butterflies) as major herbivores and pollinators, little functional data are available about Lepidoptera ORs involved in plant-volatile detection. Here, we initially characterized the plant-volatile-sensing function(s) of 44 ORs from the cotton bollworm Helicoverpa armigera, and subsequently conducted a large-scale comparative analysis that establishes how most orthologous ORs have functionally diverged among closely related species whereas some rare ORs are functionally conserved. Specifically, our systematic analysis of H. armigera ORs cataloged the wide functional scope of the H. armigera OR repertoire, and also showed that HarmOR42 and its Spodoptera littoralis ortholog are functionally conserved. Pursuing this, we characterized the HarmOR42-orthologous ORs from 11 species across the Glossata suborder and confirmed the HarmOR42 orthologs form a unique OR lineage that has undergone strong purifying selection in Glossata species and whose members are tuned with strong specificity to phenylacetaldehyde, a floral scent component common to most angiosperms. In vivo studies via HarmOR42 knockout support that HarmOR42-related ORs are essential for host-detection by sensing phenylacetaldehyde. Our work also supports that these ORs coevolved with the tube-like proboscis, and has maintained functional stability throughout the long-term coexistence of Lepidoptera with angiosperms. Thus, beyond providing a rich empirical resource for delineating the precise functions of H. armigera ORs, our results enable a comparative analysis of insect ORs that have apparently facilitated and currently sustain the intimate adaptations and ecological interactions among nectar feeding insects and flowering plants.

Journal ArticleDOI
TL;DR: For example, the authors found that observed genotypic associations with hypoxia-induced phenotypes may reflect second-order consequences of selection-mediated changes in other (unmeasured) traits that are coupled with the focal trait via feedback regulation.
Abstract: Population genomic analyses of high-altitude humans and other vertebrates have identified numerous candidate genes for hypoxia adaptation, and the physiological pathways implicated by such analyses suggest testable hypotheses about underlying mechanisms. Studies of highland natives that integrate genomic data with experimental measures of physiological performance capacities and subordinate traits are revealing associations between genotypes (e.g., hypoxia-inducible factor [HIF] gene variants) and hypoxia-responsive phenotypes. The subsequent search for causal mechanisms is complicated by the fact that observed genotypic associations with hypoxia-induced phenotypes may reflect second-order consequences of selection-mediated changes in other (unmeasured) traits that are coupled with the focal trait via feedback regulation. Manipulative experiments to decipher circuits of feedback control and patterns of phenotypic integration can help identify causal relationships that underlie observed genotype-phenotype associations. Such experiments are critical for correct inferences about phenotypic targets of selection and mechanisms of adaptation.

Journal ArticleDOI
TL;DR: This study provides new insights into the diversification of incipient sex chromosomes in flowering plants by showing how transposition and rearrangement of a single gene can control sex in both XY and ZW systems.
Abstract: This article is available under the Creative Commons CC-BY-NC license and permits non-commercial use, distribution and reproduction in any medium, provided the original work is properly cited.

Journal ArticleDOI
TL;DR: In this article, the divergence time of crown embryophytes is estimated based on three fossil calibration strategies, and it is shown that maximum calibration constraints have a major effect on estimating the time of origin of land plants.
Abstract: The relationships among the four major embryophyte lineages (mosses, liverworts, hornworts, vascular plants) and the timing of the origin of land plants are enigmatic problems in plant evolution. Here, we resolve the monophyly of bryophytes by improving taxon sampling of hornworts and eliminating the effect of synonymous substitutions. We then estimate the divergence time of crown embryophytes based on three fossil calibration strategies, and reveal that maximum calibration constraints have a major effect on estimating the time of origin of land plants. Moreover, comparison of priors and posteriors provides a guide for evaluating the optimal calibration strategy. By considering the reliability of fossil calibrations and the influences of molecular data, we estimate that land plants originated in the Precambrian (980-682 Ma), much older than widely recognized. Our study highlights the important contribution of molecular data when faced with contentious fossil evidence, and that fossil calibrations used in estimating the timescale of plant evolution require critical scrutiny.

Journal ArticleDOI
TL;DR: In this paper, the authors investigated the effect of taxonomic sampling via sequential deletion of basally branching pseudoscorpion superfamilies, as well as varying gene occupancy thresholds in supermatrices.
Abstract: Long-branch attraction is a systematic artifact that results in erroneous groupings of fast-evolving taxa. The combination of short, deep internodes in tandem with long-branch attraction artifacts has produced empirically intractable parts of the Tree of Life. One such group is the arthropod subphylum Chelicerata, whose backbone phylogeny has remained unstable despite improvements in phylogenetic methods and genome-scale data sets. Pseudoscorpion placement is particularly variable across data sets and analytical frameworks, with this group either clustering with other long-branch orders or with Arachnopulmonata (scorpions and tetrapulmonates). To surmount long-branch attraction, we investigated the effect of taxonomic sampling via sequential deletion of basally branching pseudoscorpion superfamilies, as well as varying gene occupancy thresholds in supermatrices. We show that concatenated supermatrices and coalescent-based summary species tree approaches support a sister group relationship of pseudoscorpions and scorpions, when more of the basally branching taxa are sampled. Matrix completeness had demonstrably less influence on tree topology. As an external arbiter of phylogenetic placement, we leveraged the recent discovery of an ancient genome duplication in the common ancestor of Arachnopulmonata as a litmus test for competing hypotheses of pseudoscorpion relationships. We generated a high-quality developmental transcriptome and the first genome for pseudoscorpions to assess the incidence of arachnopulmonate-specific duplications (e.g., homeobox genes and miRNAs). Our results support the inclusion of pseudoscorpions in Arachnopulmonata (new definition), as the sister group of scorpions. Panscorpiones (new name) is proposed for the clade uniting Scorpiones and Pseudoscorpiones.

Journal ArticleDOI
TL;DR: It is suggested that phylogenetic evidence alone is unlikely to identify the origin of the SARS-CoV-2 virus and it is cautioned against strong inferences regarding the early spread of the virus based solely on such evidence.
Abstract: The rooting of the SARS-CoV-2 phylogeny is important for understanding the origin and early spread of the virus. Previously published phylogenies have used different rootings that do not always provide consistent results. We investigate several different strategies for rooting the SARS-CoV-2 tree and provide measures of statistical uncertainty for all methods. We show that methods based on the molecular clock tend to place the root in the B clade, whereas methods based on outgroup rooting tend to place the root in the A clade. The results from the two approaches are statistically incompatible, possibly as a consequence of deviations from a molecular clock or excess back-mutations. We also show that none of the methods provide strong statistical support for the placement of the root in any particular edge of the tree. These results suggest that phylogenetic evidence alone is unlikely to identify the origin of the SARS-CoV-2 virus and we caution against strong inferences regarding the early spread of the virus based solely on such evidence.

Journal ArticleDOI
TL;DR: The most recent common ancestor of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was reconstructed through a novel application and advancement of computational methods initially developed to infer the mutational history of tumor cells in a patient as discussed by the authors.
Abstract: Global sequencing of genomes of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has continued to reveal new genetic variants that are the key to unraveling its early evolutionary history and tracking its global spread over time. Here we present the heretofore cryptic mutational history and spatiotemporal dynamics of SARS-CoV-2 from an analysis of thousands of high-quality genomes. We report the likely most recent common ancestor of SARS-CoV-2, reconstructed through a novel application and advancement of computational methods initially developed to infer the mutational history of tumor cells in a patient. This progenitor genome differs from genomes of the first coronaviruses sampled in China by three variants, implying that none of the earliest patients represent the index case or gave rise to all the human infections. However, multiple coronavirus infections in China and the United States harbored the progenitor genetic fingerprint in January 2020 and later, suggesting that the progenitor was spreading worldwide months before and after the first reported cases of COVID-19 in China. Mutations of the progenitor and its offshoots have produced many dominant coronavirus strains that have spread episodically over time. Fingerprinting based on common mutations reveals that the same coronavirus lineage has dominated North America for most of the pandemic in 2020. There have been multiple replacements of predominant coronavirus strains in Europe and Asia as well as continued presence of multiple high-frequency strains in Asia and North America. We have developed a continually updating dashboard of global evolution and spatiotemporal trends of SARS-CoV-2 spread (http://sars2evo.datamonkey.org/).

Journal ArticleDOI
TL;DR: Yatisht et al. as discussed by the authors presented a database of SARS-CoV-2 phylogenetic trees inferred with unrestricted public sequences, which they update daily to incorporate new sequences.
Abstract: The vast scale of SARS-CoV-2 sequencing data has made it increasingly challenging to comprehensively analyze all available data using existing tools and file formats. To address this, we present a database of SARS-CoV-2 phylogenetic trees inferred with unrestricted public sequences, which we update daily to incorporate new sequences. Our database uses the recently-proposed mutation-annotated tree (MAT) format to efficiently encode the tree with branches labeled with parsimony-inferred mutations, as well as Nextstrain clade and Pango lineage labels at clade roots. As of June 9, 2021, our SARS-CoV-2 MAT consists of 834,521 sequences and provides a comprehensive view of the virus' evolutionary history using public data. We also present matUtils-a command-line utility for rapidly querying, interpreting and manipulating the MATs. Our daily-updated SARS-CoV-2 MAT database and matUtils software are available at http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/ and https://github.com/yatisht/usher, respectively.

Journal ArticleDOI
TL;DR: The first high-quality genome assembly of the American paddlefish (Polyodon spathula) at a chromosome level is generated, providing a genetic resource for understanding chromosomal evolution in polyploid nonteleost fishes and bone mineralization in early vertebrates.
Abstract: Sturgeons and paddlefishes (Acipenseriformes) occupy the basal position of ray-finned fishes, although they have cartilaginous skeletons as in Chondrichthyes. This evolutionary status and their morphological specializations make them a research focus, but their complex genomes (polyploidy and the presence of microchromosomes) bring obstacles and challenges to molecular studies. Here, we generated the first high-quality genome assembly of the American paddlefish (Polyodon spathula) at a chromosome level. Comparative genomic analyses revealed a recent species-specific whole-genome duplication event, and extensive chromosomal changes, including head-to-head fusions of pairs of intact, large ancestral chromosomes within the paddlefish. We also provide an overview of the paddlefish SCPP (secretory calcium-binding phosphoprotein) repertoire that is responsible for tissue mineralization, demonstrating that the earliest flourishing of SCPP members occurred at least before the split between Acipenseriformes and teleosts. In summary, this genome assembly provides a genetic resource for understanding chromosomal evolution in polyploid nonteleost fishes and bone mineralization in early vertebrates.

Journal ArticleDOI
TL;DR: In this article, the authors examine how the presence of direct purifying selection and background selection may bias demographic inference by evaluating two commonly-used methods (MSMC and fastsimcoal2), specifically studying how the underlying shape of the distribution of fitness effects and the fraction of directly selected sites interact with demographic parameter estimation.
Abstract: Current procedures for inferring population history generally assume complete neutrality-that is, they neglect both direct selection and the effects of selection on linked sites. We here examine how the presence of direct purifying selection and background selection may bias demographic inference by evaluating two commonly-used methods (MSMC and fastsimcoal2), specifically studying how the underlying shape of the distribution of fitness effects and the fraction of directly selected sites interact with demographic parameter estimation. The results show that, even after masking functional genomic regions, background selection may cause the mis-inference of population growth under models of both constant population size and decline. This effect is amplified as the strength of purifying selection and the density of directly selected sites increases, as indicated by the distortion of the site frequency spectrum and levels of nucleotide diversity at linked neutral sites. We also show how simulated changes in background selection effects caused by population size changes can be predicted analytically. We propose a potential method for correcting for the mis-inference of population growth caused by selection. By treating the distribution of fitness effect as a nuisance parameter and averaging across all potential realizations, we demonstrate that even directly selected sites can be used to infer demographic histories with reasonable accuracy.

Journal ArticleDOI
TL;DR: This study reveals how bumblebee genes and genomes have evolved across the Bombus phylogeny and identifies variations potentially linked to key ecological and behavioral traits of these important pollinators.
Abstract: Bumblebees are a diverse group of globally important pollinators in natural ecosystems and for agricultural food production. With both eusocial and solitary life-cycle phases, and some social parasite species, they are especially interesting models to understand social evolution, behavior, and ecology. Reports of many species in decline point to pathogen transmission, habitat loss, pesticide usage, and global climate change, as interconnected causes. These threats to bumblebee diversity make our reliance on a handful of well-studied species for agricultural pollination particularly precarious. To broadly sample bumblebee genomic and phenotypic diversity, we de novo sequenced and assembled the genomes of 17 species, representing all 15 subgenera, producing the first genus-wide quantification of genetic and genomic variation potentially underlying key ecological and behavioral traits. The species phylogeny resolves subgenera relationships, whereas incomplete lineage sorting likely drives high levels of gene tree discordance. Five chromosome-level assemblies show a stable 18-chromosome karyotype, with major rearrangements creating 25 chromosomes in social parasites. Differential transposable element activity drives changes in genome sizes, with putative domestications of repetitive sequences influencing gene coding and regulatory potential. Dynamically evolving gene families and signatures of positive selection point to genus-wide variation in processes linked to foraging, diet and metabolism, immunity and detoxification, as well as adaptations for life at high altitudes. Our study reveals how bumblebee genes and genomes have evolved across the Bombus phylogeny and identifies variations potentially linked to key ecological and behavioral traits of these important pollinators.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors focused on foxl2, a central player in ovary, and elaborated the functional divergence in gibel carp (Carassius gibelio), a recurrent auto-allo-hexaploid fish.
Abstract: Evolutionary fates of duplicated genes have been widely investigated in many polyploid plants and animals, but research is scarce in recurrent polyploids. In this study, we focused on foxl2, a central player in ovary, and elaborated the functional divergence in gibel carp (Carassius gibelio), a recurrent auto-allo-hexaploid fish. First, we identified three divergent foxl2 homeologs (Cgfoxl2a-B, Cgfoxl2b-A, and Cgfoxl2b-B), each of them possessing three highly conserved alleles and revealed their biased retention/loss. Then, their abundant sexual dimorphism and biased expression were uncovered in hypothalamic-pituitary-gonadal axis. Significantly, granulosa cells and three subpopulations of thecal cells were distinguished by cellular localization of CgFoxl2a and CgFoxl2b, and the functional roles and the involved process were traced in folliculogenesis. Finally, we successfully edited multiple foxl2 homeologs and/or alleles by using CRISPR/Cas9. Cgfoxl2a-B deficiency led to ovary development arrest or complete sex reversal, whereas complete disruption of Cgfoxl2b-A and Cgfoxl2b-B resulted in the depletion of germ cells. Taken together, the detailed cellular localization and functional differences indicate that Cgfoxl2a and Cgfoxl2b have subfunctionalized and cooperated to regulate folliculogenesis and gonad differentiation, and Cgfoxl2b has evolved a new function in oogenesis. Therefore, the current study provides a typical case of homeolog/allele diversification, retention/loss, biased expression, and sub-/neofunctionalization in the evolution of duplicated genes driven by polyploidy and subsequent diploidization from the recurrent polyploid fish.

Journal ArticleDOI
TL;DR: In this paper, the authors identify numerous genetic transfers between distantly related phages and aim at understanding their frequency, consequences, and the conditions favoring them, finding that gene flow tends to occur between phages that are enriched for recombinases, transposases, and nonhomologous end joining, suggesting that both homologous and illegitimate recombination contribute to gene flow.
Abstract: Bacteriophages (phages) evolve rapidly by acquiring genes from other phages. This results in mosaic genomes. Here, we identify numerous genetic transfers between distantly related phages and aim at understanding their frequency, consequences, and the conditions favoring them. Gene flow tends to occur between phages that are enriched for recombinases, transposases, and nonhomologous end joining, suggesting that both homologous and illegitimate recombination contribute to gene flow. Phage family and host phyla are strong barriers to gene exchange, but phage lifestyle is not. Even if we observe four times more recent transfers between temperate phages than between other pairs, there is extensive gene flow between temperate and virulent phages, and between the latter. These predominantly involve virulent phages with large genomes previously classed as low gene flux, and lead to the preferential transfer of genes encoding functions involved in cell energetics, nucleotide metabolism, DNA packaging and injection, and virion assembly. Such exchanges may contribute to the observed twice larger genomes of virulent phages. We used genetic transfers, which occur upon coinfection of a host, to compare phage host range. We found that virulent phages have broader host ranges and can mediate genetic exchanges between narrow host range temperate phages infecting distant bacterial hosts, thus contributing to gene flow between virulent phages, as well as between temperate phages. This gene flow drastically expands the gene repertoires available for phage and bacterial evolution, including the transfer of functional innovations across taxa.

Journal ArticleDOI
TL;DR: It is found that low-occupancy data sets analyzed as nucleotides can result in more congruent relationships than high occupancy data set analyzed as amino acids, as in phylotranscriptomics, and omitting data, through amino acid translation or via retention of only high occupancy loci, may have a deleterious effect in phylogenetic reconstruction.
Abstract: Genome-scale data sets are converging on robust, stable phylogenetic hypotheses for many lineages; however, some nodes have shown disagreement across classes of data. We use spiders (Araneae) as a system to identify the causes of incongruence in phylogenetic signal between three classes of data: exons (as in phylotranscriptomics), noncoding regions (included in ultraconserved elements [UCE] analyses), and a combination of both (as in UCE analyses). Gene orthologs, coded as amino acids and nucleotides (with and without third codon positions), were generated by querying published transcriptomes for UCEs, recovering 1,931 UCE loci (codingUCEs). We expected that congeners represented in the codingUCE and UCEs data would form clades in the presence of phylogenetic signal. Noncoding regions derived from UCE sequences were recovered to test the stability of relationships. Phylogenetic relationships resulting from all analyses were largely congruent. All nucleotide data sets from transcriptomes, UCEs, or a combination of both recovered similar topologies in contrast with results from transcriptomes analyzed as amino acids. Most relationships inferred from low-occupancy data sets, containing several hundreds of loci, were congruent across Araneae, as opposed to high occupancy data matrices with fewer loci, which showed more variation. Furthermore, we found that low-occupancy data sets analyzed as nucleotides (as is typical of UCE data sets) can result in more congruent relationships than high occupancy data sets analyzed as amino acids (as in phylotranscriptomics). Thus, omitting data, through amino acid translation or via retention of only high occupancy loci, may have a deleterious effect in phylogenetic reconstruction.

Journal ArticleDOI
TL;DR: In this paper, the authors used whole nuclear, plastid, and organellar genomes from 12 species in the rapidly radiated, ecologically diverse, and actively hybridizing genus of peatmoss (Sphagnum) to reconstruct the species phylogeny and quantify introgression using a suite of phylogenomic methods.
Abstract: The relative importance of introgression for diversification has long been a highly disputed topic in speciation research and remains an open question despite the great attention it has received over the past decade. Gene flow leaves traces in the genome similar to those created by incomplete lineage sorting (ILS), and identification and quantification of gene flow in the presence of ILS is challenging and requires knowledge about the true phylogenetic relationship among the species. We use whole nuclear, plastid, and organellar genomes from 12 species in the rapidly radiated, ecologically diverse, actively hybridizing genus of peatmoss (Sphagnum) to reconstruct the species phylogeny and quantify introgression using a suite of phylogenomic methods. We found extensive phylogenetic discordance among nuclear and organellar phylogenies, as well as across the nuclear genome and the nodes in the species tree, best explained by extensive ILS following the rapid radiation of the genus rather than by postspeciation introgression. Our analyses support the idea of ancient introgression among the ancestral lineages followed by ILS, whereas recent gene flow among the species is highly restricted despite widespread interspecific hybridization known in the group. Our results contribute to phylogenomic understanding of how speciation proceeds in rapidly radiated, actively hybridizing species groups, and demonstrate that employing a combination of diverse phylogenomic methods can facilitate untangling complex phylogenetic patterns created by ILS and introgression.

Journal ArticleDOI
TL;DR: The host–symbiont genomes show not only tight metabolic complementarity but also distinct signatures of coevolution allowing the vesicomyids to thrive in chemosynthesis-based ecosystems.
Abstract: Endosymbiosis with chemosynthetic bacteria has enabled many deep-sea invertebrates to thrive at hydrothermal vents and cold seeps, but most previous studies on this mutualism have focused on the bacteria only. Vesicomyid clams dominate global deep-sea chemosynthesis-based ecosystems. They differ from most deep-sea symbiotic animals in passing their symbionts from parent to offspring, enabling intricate coevolution between the host and the symbiont. Here, we sequenced the genomes of the clam Archivesica marissinica (Bivalvia: Vesicomyidae) and its bacterial symbiont to understand the genomic/metabolic integration behind this symbiosis. At 1.52 Gb, the clam genome encodes 28 genes horizontally transferred from bacteria, a large number of pseudogenes and transposable elements whose massive expansion corresponded to the timing of the rise and subsequent divergence of symbiont-bearing vesicomyids. The genome exhibits gene family expansion in cellular processes that likely facilitate chemoautotrophy, including gas delivery to support energy and carbon production, metabolite exchange with the symbiont, and regulation of the bacteriocyte population. Contraction in cellulase genes is likely adaptive to the shift from phytoplankton-derived to bacteria-based food. It also shows contraction in bacterial recognition gene families, indicative of suppressed immune response to the endosymbiont. The gammaproteobacterium endosymbiont has a reduced genome of 1.03 Mb but retains complete pathways for sulfur oxidation, carbon fixation, and biosynthesis of 20 common amino acids, indicating the host's high dependence on the symbiont for nutrition. Overall, the host-symbiont genomes show not only tight metabolic complementarity but also distinct signatures of coevolution allowing the vesicomyids to thrive in chemosynthesis-based ecosystems.

Journal ArticleDOI
TL;DR: It is found that aphid autosomes have undergone dramatic reorganization over the last 30 My, to the extent that chromosome homology cannot be determined between aphids from the tribes Macrosiphini and Aphidini, making them an important emerging model system for studying the role of large-scale genome rearrangements in evolution.
Abstract: Chromosome rearrangements are arguably the most dramatic type of mutations, often leading to rapid evolution and speciation. However, chromosome dynamics have only been studied at the sequence level in a small number of model systems. In insects, Diptera and Lepidoptera have conserved genome structure at the scale of whole chromosomes or chromosome arms. Whether this reflects the diversity of insect genome evolution is questionable given that many species exhibit rapid karyotype evolution. Here, we investigate chromosome evolution in aphids-an important group of hemipteran plant pests-using newly generated chromosome-scale genome assemblies of the green peach aphid (Myzus persicae) and the pea aphid (Acyrthosiphon pisum), and a previously published assembly of the corn-leaf aphid (Rhopalosiphum maidis). We find that aphid autosomes have undergone dramatic reorganization over the last 30 My, to the extent that chromosome homology cannot be determined between aphids from the tribes Macrosiphini (Myzus persicae and Acyrthosiphon pisum) and Aphidini (Rhopalosiphum maidis). In contrast, gene content of the aphid sex (X) chromosome remained unchanged despite rapid sequence evolution, low gene expression, and high transposable element load. To test whether rapid evolution of genome structure is a hallmark of Hemiptera, we compared our aphid assemblies with chromosome-scale assemblies of two blood-feeding Hemiptera (Rhodnius prolixus and Triatoma rubrofasciata). Despite being more diverged, the blood-feeding hemipterans have conserved synteny. The exceptional rate of structural evolution of aphid autosomes renders them an important emerging model system for studying the role of large-scale genome rearrangements in evolution.