scispace - formally typeset
Search or ask a question

Showing papers by "Detlef Weigel published in 2022"


Journal ArticleDOI
TL;DR: In this paper , a large survey of de novo mutations in the plant Arabidopsis thaliana was conducted and it was shown that mutations occur less often in functionally constrained regions of the genome.
Abstract: Abstract Since the first half of the twentieth century, evolutionary theory has been dominated by the idea that mutations occur randomly with respect to their consequences 1 . Here we test this assumption with large surveys of de novo mutations in the plant Arabidopsis thaliana . In contrast to expectations, we find that mutations occur less often in functionally constrained regions of the genome—mutation frequency is reduced by half inside gene bodies and by two-thirds in essential genes. With independent genomic mutation datasets, including from the largest Arabidopsis mutation accumulation experiment conducted to date, we demonstrate that epigenomic and physical features explain over 90% of variance in the genome-wide pattern of mutation bias surrounding genes. Observed mutation frequencies around genes in turn accurately predict patterns of genetic polymorphisms in natural Arabidopsis accessions ( r = 0.96). That mutation bias is the primary force behind patterns of sequence evolution around genes in natural accessions is supported by analyses of allele frequencies. Finally, we find that genes subject to stronger purifying selection have a lower mutation rate. We conclude that epigenome-associated mutation bias 2 reduces the occurrence of deleterious mutations in Arabidopsis , challenging the prevailing paradigm that mutation is a directionless force in evolution.

161 citations


Journal ArticleDOI
TL;DR: In this article , the authors study interactions between groups of naturally coexisting commensal and pathogenic Pseudomonas strains in the Arabidopsis thaliana phyllosphere.
Abstract: The community structure in the plant-associated microbiome depends collectively on host-microbe, microbe-microbe and host-microbe-microbe interactions. The ensemble of interactions between the host and microbial consortia may lead to outcomes that are not easily predicted from pairwise interactions. Plant-microbe-microbe interactions are important to plant health but could depend on both host and microbe strain variation. Here we study interactions between groups of naturally co-existing commensal and pathogenic Pseudomonas strains in the Arabidopsis thaliana phyllosphere. We find that commensal Pseudomonas prompt a host response that leads to selective inhibition of a specific pathogenic lineage, resulting in plant protection. The extent of protection depends on plant genotype, supporting that these effects are host-mediated. Strain-specific effects are also demonstrated by one individual Pseudomonas isolate eluding the plant protection provided by commensals. Our work highlights how within-species genetic differences in both hosts and microbes can affect host-microbe-microbe dynamics.

22 citations


Journal ArticleDOI
TL;DR: A chromosome‐level genome assembly of var.
Abstract: Summary Thlaspi arvense (field pennycress) is being domesticated as a winter annual oilseed crop capable of improving ecosystems and intensifying agricultural productivity without increasing land use. It is a selfing diploid with a short life cycle and is amenable to genetic manipulations, making it an accessible field‐based model species for genetics and epigenetics. The availability of a high‐quality reference genome is vital for understanding pennycress physiology and for clarifying its evolutionary history within the Brassicaceae. Here, we present a chromosome‐level genome assembly of var. MN106‐Ref with improved gene annotation and use it to investigate gene structure differences between two accessions (MN108 and Spring32‐10) that are highly amenable to genetic transformation. We describe non‐coding RNAs, pseudogenes and transposable elements, and highlight tissue‐specific expression and methylation patterns. Resequencing of forty wild accessions provided insights into genome‐wide genetic variation, and QTL regions were identified for a seedling colour phenotype. Altogether, these data will serve as a tool for pennycress improvement in general and for translational research across the Brassicaceae.

12 citations


Journal ArticleDOI
TL;DR: The HiFi technology consistently enabled us to reconstruct gapless centromeres and 5S rDNA clusters, and the value of the approach is demonstrated by comparing these previously inaccessible regions of the genome between two A. thaliana accessions.
Abstract: Although long-read sequencing can often enable chromosome-level reconstruction of genomes, it is still unclear how one can routinely obtain gapless assemblies. In the model plant Arabidopsis thaliana, other than the reference accession Col-0, all other accessions de novo assembled with long-reads until now have used PacBio continuous long reads (CLR). Although these assemblies sometimes achieved chromosome-arm level contigs, they inevitably broke near the centromeres, excluding megabases of DNA from analysis in pan-genome projects. Since PacBio high-fidelity (HiFi) reads circumvent the high error rate of CLR technologies, albeit at the expense of read length, we compared a CLR assembly of accession Ey15-2 to HiFi assemblies of the same sample performed by five different assemblers starting from subsampled data sets, allowing us to evaluate the impact of coverage and read length. We found that centromeres and rDNA clusters are responsible for 71% of contig breaks in the CLR scaffolds, while relatively short stretches of GA/TC repeats are at the core of >85% of the unfilled gaps in our best HiFi assemblies. Since the HiFi technology consistently enabled us to reconstruct gapless centromeres and 5S rDNA clusters, we demonstrate the value of the approach by comparing these previously inaccessible regions of the genome between two A. thaliana accessions.

9 citations


Journal ArticleDOI
TL;DR: MethylScore uses an unsupervised machine learning approach to segment the genome by classification into states of high and low methylation, and can identify DMRs from hundreds of samples and how its data-driven approach can stratify associated samples without prior information.
Abstract: Abstract Abstract Whole-genome bisulfite sequencing (WGBS) is the standard method for profiling DNA methylation at single-nucleotide resolution. Different tools have been developed to extract differentially methylated regions (DMRs), often built upon assumptions from mammalian data. Here, we present MethylScore, a pipeline to analyse WGBS data and to account for the substantially more complex and variable nature of plant DNA methylation. MethylScore uses an unsupervised machine learning approach to segment the genome by classification into states of high and low methylation. It processes data from genomic alignments to DMR output and is designed to be usable by novice and expert users alike. We show how MethylScore can identify DMRs from hundreds of samples and how its data-driven approach can stratify associated samples without prior information. We identify DMRs in the A. thaliana 1,001 Genomes dataset to unveil known and unknown genotype–epigenotype associations .

7 citations


Journal ArticleDOI
19 Oct 2022-Science
TL;DR: The authors studied the impact of changes in farming on the extent and tempo of evolution across the native range of common waterhemp (Amaranthus tuberculatus), a now pervasive agricultural weed.
Abstract: North America has seen a massive increase in cropland use since 1800, accompanied more recently by the intensification of agricultural practices. Through genome analysis of present-day and historical samples spanning environments over the last two centuries, we studied the impact of these changes in farming on the extent and tempo of evolution across the native range of common waterhemp (Amaranthus tuberculatus), a now pervasive agricultural weed. Modern agriculture has imposed strengths of selection rarely observed in the wild, with striking shifts in allele frequency trajectories since agricultural intensification in the 1960s. An evolutionary response to this extreme selection was facilitated by a concurrent human-mediated range shift. By reshaping genome-wide diversity across the landscape, agriculture has driven the success of this weed in the 21st-century. One Sentence Summary Modern agriculture has shaped the evolution of a native plant into a weed by driving range shifts and strengths of selection rarely observed in the wild.

7 citations


editorialDOI
20 Oct 2022-eLife
TL;DR: eLife is changing its editorial process to emphasize public reviews and assessments of preprints by eliminating accept/reject decisions after peer review as mentioned in this paper , which is similar to our approach.
Abstract: eLife is changing its editorial process to emphasize public reviews and assessments of preprints by eliminating accept/reject decisions after peer review.

6 citations


Posted ContentDOI
04 Feb 2022-bioRxiv
TL;DR: A set of experiments using hundreds of natural genotypes of the model plant Arabidopsis thaliana are presented to showcase the power of the Evolve & Resequence approach to study rapid evolution at large scale and provide open source tools that streamline sequencing data curation and calculate various population genetic statistics two orders of magnitude faster than current software.
Abstract: The change in allele frequencies within a population over time represents a fundamental process of evolution. By monitoring allele frequencies, we can analyze the effects of natural selection and genetic drift on populations. To efficiently track time-resolved genetic change, large experimental or wild populations can be sequenced as pools of individuals sampled over time using high-throughput genome sequencing (called the Evolve & Resequence approach, E&R). Here, we present a set of experiments using hundreds of natural genotypes of the model plant Arabidopsis thaliana to showcase the power of this approach to study rapid evolution at large scale. First, we validate that sequencing DNA directly extracted from pools of flowers from multiple plants -- organs that are relatively consistent in size and easy to sample -- produces comparable results to other, more expensive state-of-the-art approaches such as sampling and sequencing of individual leaves. Sequencing pools of flowers from 25-50 individuals at ∼40X coverage recovers genome-wide frequencies in diverse populations with accuracy r > 0.95. Secondly, to enable analyses of evolutionary adaptation using E&R approaches of plants in highly replicated environments, we provide open source tools that streamline sequencing data curation and calculate various population genetic statistics two orders of magnitude faster than current software. To directly demonstrate the usefulness of our method, we conducted a two-year outdoor evolution experiment with A. thaliana to show signals of rapid evolution in multiple genomic regions. We demonstrate how these laboratory and computational Pool-seq-based methods can be scaled to study hundreds of populations across many climates.

6 citations


Posted ContentDOI
10 Apr 2022-bioRxiv
TL;DR: The reproducible and predictable associations between specific microbes and water availability raise the possibility that drought not only directly shapes genetic variation in A. thaliana, but does so also indirectly through its effects on the leaf microbiome.
Abstract: Microbes affect plant health, stress tolerance1 and life history2. In different regions of the globe, plants are colonized by distinct pathogenic and commensal microbiomes, but the factors driving their geographic variation are largely unknown3. We identified and measured the core leaf microbiome of Arabidopsis thaliana in its native range, from almost 300 populations across Europe. Comparing the distribution of the approximately 500 major bacterial phylotypes, we discovered marked, geography-dependent differences in microbiome composition within A. thaliana and between A. thaliana and other Brassicaceae, with two distinct microbiome types segregating along a latitudinal gradient. The differences in microbiome composition mirror the spatial genetics of A. thaliana, with 52-68% of variance in the first two principal coordinates of microbiome type explained by host genotype. Microbiome composition is best predicted by drought-associated metrics that are well known to be a major selective agent on A. thaliana populations. The reproducible and predictable associations between specific microbes and water availability raise the possibility that drought not only directly shapes genetic variation in A. thaliana, but does so also indirectly through its effects on the leaf microbiome.

5 citations


Posted ContentDOI
18 Jul 2022-bioRxiv
TL;DR: Spatial metaTranscriptomics (SmT) is a sequencing-based approach that leverages 16S/18S/ITS/poly-d(T) multimodal arrays for simultaneous host transcriptome- and microbiome-wide characterization of tissues at 55-µm resolution and is pivotal to answering fundamental questions on host-microbiome interplay.
Abstract: All multicellular organisms are closely associated with microbes, which have a major impact on the health of their host. The interactions of microbes among themselves and with the host take place at the microscale, forming complex networks and spatial patterns that are rarely well understood due to the lack of suitable analytical methods. The importance of high-resolution spatial molecular information has become widely appreciated with the recent advent of spatially resolved transcriptomics. Here, we present Spatial metaTranscriptomics (SmT), a sequencing-based approach that leverages 16S/18S/ITS/poly-d(T) multimodal arrays for simultaneous host transcriptome- and microbiome-wide characterization of tissues at 55-µm resolution. We showcase SmT in outdoor-grown Arabidopsis thaliana leaves as a model system, and found tissue-scale bacterial and fungal hotspots. By network analysis, we study inter- and intra-kingdom spatial interactions among microbes, as well as the host response to microbial hotspots. SmT is a powerful new strategy that will be pivotal to answering fundamental questions on host-microbiome interplay.

5 citations


Posted ContentDOI
06 Jan 2022-bioRxiv
TL;DR: MethylScore is an accessible pipeline for plant WGBS data, with unprecedented features for DMR calling in small- and large-scale datasets, and uses an unsupervised machine learning approach to segment the genome by classification into states of high and low methylation.
Abstract: Whole-genome bisulfite sequencing (WGBS) is the standard method for profiling DNA methylation at single-nucleotide resolution. Many WGBS-based studies aim to identify biologically relevant loci that display differential methylation between genotypes, treatment groups, tissues, or developmental stages. Over the years, different tools have been developed to extract differentially methylated regions (DMRs) from whole-genome data. Often, such tools are built upon assumptions from mammalian data and do not consider the substantially more complex and variable nature of plant DNA methylation. Here, we present MethylScore, a pipeline to analyze WGBS data and to account for plant-specific DNA methylation properties. MethylScore processes data from genomic alignments to DMR output and is designed to be usable by novice and expert users alike. It uses an unsupervised machine learning approach to segment the genome by classification into states of high and low methylation, substantially reducing the number of necessary statistical tests while increasing the signal-to-noise ratio and the statistical power. We show how MethylScore can identify DMRs from hundreds of samples and how its data-driven approach can stratify associated samples without prior information. We identify DMRs in the A. thaliana 1001 Genomes dataset to unveil known and unknown genotype-epigenotype associations. MethylScore is an accessible pipeline for plant WGBS data, with unprecedented features for DMR calling in small- and large-scale datasets; it is built as a Nextflow pipeline and its source code is available at https://github.com/Computomics/MethylScore.

Journal ArticleDOI
17 Jan 2022-eLife
TL;DR:
Abstract: Causal mutations and their frequency in agricultural fields are well-characterized for herbicide resistance. However, we still lack understanding of their evolutionary history: the extent of parallelism in the origins of target-site resistance (TSR), how long these mutations persist, how quickly they spread, and allelic interactions that mediate their selective advantage. We addressed these questions with genomic data from 19 agricultural populations of common waterhemp (Amaranthus tuberculatus), which we show to have undergone a massive expansion over the past century, with a contemporary effective population size estimate of 8 x 107. We found variation at seven characterized TSR loci, two of which had multiple amino acid substitutions, and three of which were common. These three common resistance variants show extreme parallelism in their mutational origins, with gene flow having shaped their distribution across the landscape. Allele age estimates supported a strong role of adaptation from de novo mutations, with a median age of 30 suggesting that most resistance alleles arose soon after the onset of herbicide use. However, resistant lineages varied in both their age and evidence for selection over two different timescales, implying considerable heterogeneity in the forces that govern their persistence. Two such forces are intra- and inter-locus allelic interactions; we report a signal of extended haplotype competition between two common TSR alleles, and extreme linkage with genome-wide alleles with known functions in resistance adaptation. Together, this work reveals a remarkable example of spatial parallel evolution in a metapopulation, with important implications for the management of herbicide resistance.

Posted ContentDOI
15 Feb 2022
TL;DR: In this paper , the authors compared a PacBio continuous long reads (CLR) assembly of accession Ey15-2 to HiFi assemblies of the same sample performed by five different assemblers starting from subsampled data sets, allowing them to evaluate the impact of coverage and read length.
Abstract: ABSTRACT Although long-read sequencing can often enable chromosome-level reconstruction of genomes, it is still unclear how one can routinely obtain gapless assemblies. In the model plant Arabidopsis thaliana , other than the reference accession Col-0, all other accessions de novo assembled with long-reads until now have used PacBio continuous long reads (CLR). Although these assemblies sometimes achieved chromosome-arm level contigs, they inevitably broke near the centromeres, excluding megabases of DNA from analysis in pan-genome projects. Since PacBio high-fidelity (HiFi) reads circumvent the high error rate of CLR technologies, albeit at the expense of read length, we compared a CLR assembly of accession Ey15-2 to HiFi assemblies of the same sample performed by five different assemblers starting from subsampled data sets, allowing us to evaluate the impact of coverage and read length. We found that centromeres and rDNA clusters are responsible for 71% of contig breaks in the CLR scaffolds, while relatively short stretches of GA/TC repeats are at the core of >85% of the unfilled gaps in our best HiFi assemblies. Since the HiFi technology consistently enabled us to reconstruct gapless centromeres and 5S rDNA clusters, we demonstrate the value of the approach by comparing these previously inaccessible regions of the genome between two A. thaliana accessions.

Journal ArticleDOI
TL;DR: In this paper , the role of PICLN in pre-mRNA splicing and in mediating plant adaptation to daily and seasonal fluctuations in environmental conditions has been investigated in both humans and Arabidopsis.
Abstract: Plants undergo transcriptome reprogramming to adapt to daily and seasonal fluctuations in light and temperature conditions. While most efforts have focused on the role of master transcription factors, the importance of splicing factors modulating these processes is now emerging. Efficient pre-mRNA splicing depends on proper spliceosome assembly, which in plants and animals requires the methylosome complex. Ion Chloride nucleotide-sensitive protein (PICLN) is part of the methylosome complex in both humans and Arabidopsis (Arabidopsis thaliana), and we show here that the human PICLN ortholog rescues phenotypes of Arabidopsis picln mutants. Altered photomorphogenic and photoperiodic responses in Arabidopsis picln mutants are associated with changes in pre-mRNA splicing that partially overlap with those in PROTEIN-ARGININE METHYL TRANSFERASE5 (prmt5) mutants. Mammalian PICLN also acts in concert with the Survival Motor Neuron (SMN) complex component GEMIN2 to modulate the late steps of UsnRNP assembly, and many alternative splicing events regulated by PICLN but not PRMT5, the main protein of the methylosome, are controlled by Arabidopsis GEMIN2. As with GEMIN2 and SM PROTEIN E1/PORCUPINE (SME1/PCP), low temperature, which increases PICLN expression, aggravates morphological and molecular defects of picln mutants. Taken together, these results establish a key role for PICLN in the regulation of pre-mRNA splicing and in mediating plant adaptation to daily and seasonal fluctuations in environmental conditions.

Posted ContentDOI
22 Aug 2022-bioRxiv
TL;DR: It is shown that neither homopolymer errors nor elevated mutation rates at transposable elements are likely to entirely explain reported mutation rate biases, and models derived from the drift-barrier hypothesis demonstrate that mechanisms linking DNA repair to chromatin marks and other epigenomic features can evolve in response to second-order selection on emergent mutation biases.
Abstract: It has recently been proposed that the uneven distribution of epigenomic features might facilitate reduced mutation rate in constrained regions of the Arabidopsis thaliana genome, even though previous work had shown that it would be difficult for reduced mutation rates to evolve on a gene-by-gene basis. A solution to Lynch’s equations for the barrier imposed by genetic drift on the evolution of targeted hypomutation can, however, come from epigenomic features that are enriched in certain portions of the genome, for example, coding regions of essential genes, and which simultaneously affect mutation rate. Such theoretical considerations draw on what is known about DNA repair guided by epigenomic features. A recent publication challenged these conclusions, because several mutation data sets that support a lower mutation rate in constrained regions suffered from variant calling errors. Here we show that neither homopolymer errors nor elevated mutation rates at transposable elements are likely to entirely explain reported mutation rate biases. Observed mutation biases are also supported by a meta-analysis of several independent germline mutation data sets, with complementary experimental data providing a mechanistic basis for reduced mutation rate in genes and specifically in essential genes. Finally, models derived from the drift-barrier hypothesis demonstrate that mechanisms linking DNA repair to chromatin marks and other epigenomic features can evolve in response to second-order selection on emergent mutation biases.

Posted ContentDOI
16 Mar 2022-bioRxiv
TL;DR: Compared clonally propagated Arabidopsis thaliana plants derived from somatic embryogenesis using two different embryonic transcription factors, it is found that both the epi(genetic) status of explant and the regeneration protocol employed play critical roles in shaping the molecular and phenotypic state of clonal plants.
Abstract: Although clonal propagation is frequently used in commercial plant breeding and plant biotechnology programs because it minimizes genetic variation, it is not uncommon to observe clonal plants with stable phenotypic changes, a phenomenon known as somaclonal variation. Several studies have shown that epigenetic modifications induced during regeneration are associated with this newly acquired phenotypic variation. However, the factors that determine the extent of somaclonal variation and the molecular changes associated with it remain poorly understood. To address this gap in our knowledge, we compared clonally propagated Arabidopsis thaliana plants derived from somatic embryogenesis using two different embryonic transcription factors-RWP-RK DOMAIN-CONTAINING 4 (RKD4) or LEAFY COTYLEDON2 (LEC2) and from two epigenetically distinct tissues. We found that both the epi(genetic) status of explant and the regeneration protocol employed play critical roles in shaping the molecular and phenotypic state of clonal plants. Phenotypic variation of regenerated plants can be largely explained by the inheritance of tissue-specific DNA methylation imprints, which are associated with specific transcriptional and metabolic changes in sexual progeny of clonal plants. Moreover, regenerants from roots were particularly affected by the inheritance of epigenetic imprints, which resulted in increased accumulation of salicylic acid in leaves and accelerated plant senescence. Collectively, our data reveal pathways for targeted manipulation of phenotypic variation in clonal plants.

Posted ContentDOI
05 Jul 2022-bioRxiv
TL;DR: In this article , the authors identify in the Arabidopsis thaliana genome hundreds of inverted repeat (IR) located near genes that are transcribed by RNA Polymerase II, resulting in the production of 24-nt small RNAs that trigger methylation of the IRs.
Abstract: Transposons are mobile elements that are commonly silenced to protect eukaryotic genome integrity. In plants, transposable elements (TEs) can be activated during stress conditions and subsequently insert into gene-rich regions. TE-derived inverted repeats (IRs) are commonly found near plant genes, where they affect host gene expression with potentially positive effects on adaptation. However, the molecular mechanisms by which these IRs control gene expression is unclear in most cases. Here, we identify in the Arabidopsis thaliana genome hundreds of IRs located near genes that are transcribed by RNA Polymerase II, resulting in the production of 24-nt small RNAs that trigger methylation of the IRs. The expression of these IRs is associated with drastic changes in the local 3D chromatin organization, which alter the expression pattern of the hosting genes. Notably, the presence and structure of many IRs differ between A. thaliana accessions. Capture-C sequencing experiments revealed that such variation changes short-range chromatin interactions, which translates into changes in gene expression patterns. CRISPR/Cas9-mediated disruption of two of such IRs leads to a switch in genome topology and gene expression, with phenotypic consequences. Our data demonstrate that the insertion of an IR near a gene provides an anchor point for chromatin interactions that can profoundly impact the activity of neighboring loci. This turns IRs into powerful evolutionary agents that can contribute to rapid adaptation.

Posted ContentDOI
24 Jun 2022-bioRxiv
TL;DR: This proof-of-concept study sequenced with PacBio High Fidelity reads long-range amplicons encompassing the entire ACCase gene in pools of over hundred individuals, and resolved them into haplotypes using the clustering algorithm PacBio amplicon analysis (pbaa), a new application for pools and for plants.
Abstract: Rapid adaptation of weeds to herbicide applications in agriculture through resistance development is a widespread phenomenon. In particular, the grass Alopecurus myosuroides is an extremely problematic weed in cereal crops with the potential to manifest resistance in the course of only a few generations. Target-site resistances (TSRs), with their strong phenotypic response, play an important role in this rapid adaptive response. Recently, using PacBio’s long-read amplicon sequencing technology in hundreds of individuals, we were able to decipher the genomic context in which TSR mutations occur. However, sequencing individual amplicons is both costly and time consuming, thus impractical to implement for other resistance loci or applications. Alternatively, pool-based approaches overcome these limitations and provide reliable allele frequencies, albeit at the expense of not preserving haplotype information. In this proof-of-concept study, we sequenced with PacBio High Fidelity (HiFi) reads long-range amplicons (13.2 kb) encompassing the entire ACCase gene in pools of over hundred individuals, and resolved them into haplotypes using the clustering algorithm PacBio amplicon analysis (pbaa), a new application for pools and for plants. From these amplicon pools, we were able to recover most haplotypes from previously sequenced individuals of the same population. In addition, we analyzed new pools from a Germany-wide collection of A. myosuroides populations and found that TSR mutations originating from soft sweeps of independent origin were common. Forward-in-time simulations indicate that TSR haplotypes will persist for decades even at relatively low frequencies and without selection, pointing to the importance of accurate measurement of TSR haplotype prevalence for weed management.

Peer ReviewDOI
25 Aug 2022
TL;DR: Parallel changes in organismal evolution can be observed even with disparate genomic changes due to similarity in functional outcomes, such as changes in molecular phenotypes like gene expression patterns as mentioned in this paper .
Abstract: Parallel changes in organismal evolution can be observed even with disparate genomic changes due to similarity in functional outcomes, such as changes in molecular phenotypes like gene expression patterns.


Journal ArticleDOI
TL;DR: In this paper , the authors investigated the effect of removing MET1 in Arabidopsis thaliana accessions on gene expression and transposable element (TE) silencing.
Abstract: Abstract Background Despite its conserved role on gene expression and transposable element (TE) silencing, genome-wide CG methylation differs substantially between wild Arabidopsis thaliana accessions. Results To test our hypothesis that global reduction of CG methylation would reduce epigenomic, transcriptomic, and phenotypic diversity in A. thaliana accessions, we knock out MET1 , which is required for CG methylation, in 18 early-flowering accessions. Homozygous met1 mutants in all accessions suffer from common developmental defects such as dwarfism and delayed flowering, in addition to accession-specific abnormalities in rosette leaf architecture, silique morphology, and fertility. Integrated analysis of genome-wide methylation, chromatin accessibility, and transcriptomes confirms that MET1 inactivation greatly reduces CG methylation and alters chromatin accessibility at thousands of loci. While the effects on TE activation are similarly drastic in all accessions, the quantitative effects on non-TE genes vary greatly. The global expression profiles of accessions become considerably more divergent from each other after genome-wide removal of CG methylation, although a few genes with diverse expression profiles across wild-type accessions tend to become more similar in mutants. Most differentially expressed genes do not exhibit altered chromatin accessibility or CG methylation in cis , suggesting that absence of MET1 can have profound indirect effects on gene expression and that these effects vary substantially between accessions. Conclusions Systematic analysis of MET1 requirement in different A. thaliana accessions reveals a dual role for CG methylation: for many genes, CG methylation appears to canalize expression levels, with methylation masking regulatory divergence. However, for a smaller subset of genes, CG methylation increases expression diversity beyond genetically encoded differences.

Peer ReviewDOI
29 Sep 2022
TL;DR: In this article , the authors investigated the natural variation of cardiac performance in the sequenced inbred lines of the Drosophila Genetic Reference Panel (DGRP) and found correlations between genes associated with cardiac phenotypes in both flies and humans, which supports a conserved genetic architecture regulating adult cardiac function from arthropods to mammals.
Abstract: Article Figures and data Abstract Editor's evaluation Introduction Results Discussion Materials and methods Data availability References Decision letter Author response Article and author information Metrics Abstract Deciphering the genetic architecture of human cardiac disorders is of fundamental importance but their underlying complexity is a major hurdle. We investigated the natural variation of cardiac performance in the sequenced inbred lines of the Drosophila Genetic Reference Panel (DGRP). Genome-wide associations studies (GWAS) identified genetic networks associated with natural variation of cardiac traits which were used to gain insights as to the molecular and cellular processes affected. Non-coding variants that we identified were used to map potential regulatory non-coding regions, which in turn were employed to predict transcription factors (TFs) binding sites. Cognate TFs, many of which themselves bear polymorphisms associated with variations of cardiac performance, were also validated by heart-specific knockdown. Additionally, we showed that the natural variations associated with variability in cardiac performance affect a set of genes overlapping those associated with average traits but through different variants in the same genes. Furthermore, we showed that phenotypic variability was also associated with natural variation of gene regulatory networks. More importantly, we documented correlations between genes associated with cardiac phenotypes in both flies and humans, which supports a conserved genetic architecture regulating adult cardiac function from arthropods to mammals. Specifically, roles for PAX9 and EGR2 in the regulation of the cardiac rhythm were established in both models, illustrating that the characteristics of natural variations in cardiac function identified in Drosophila can accelerate discovery in humans. Editor's evaluation The authors investigated natural variation and new genetic mechanisms underlying cardiac performance using sequenced inbred lines of the Drosophila Genetic Reference Panel. The study provides insights into the genetic architecture of complex cardiac performance traits and represents an important resource for researchers studying cardiac performance. https://doi.org/10.7554/eLife.82459.sa0 Decision letter eLife's review process Introduction Heart diseases is a major cause of mortality (Bezzina et al., 2015). Although a large number of genome-wide association studies (GWAS) have identified hundreds of genetic variants related to cardiovascular traits (Roselli et al., 2018; van Setten et al., 2018; Shah et al., 2020; Verweij et al., 2020), we are very far from a comprehensive understanding of the genetic architecture of these complex traits. Deciphering the impact of genetic variations on quantitative traits is however critical for the prediction of disease risk. But disentangling the relative genetic and environmental contributions to pathologies is challenging due to the difficulty in accounting for environmental influences and disease comorbidities. Underlying epistatic interactions may also contribute to problems with replication in human GWAS performed in distinct populations which rarely take epistatic effects into account. In addition, linking a trait associated locus to a candidate gene or a set of genes for prioritization is not straightforward (Mackay, 2014, Boyle et al., 2017). Furthermore, the analysis of genetic factors related to cardiac traits is complicated by their interactions with several risk factors, such as increasing age, hypertension, diabetes mellitus, ischemic, and structural heart disease (Paludan-Müller et al., 2016). These pitfalls can be overcome using animal models. Model organisms allow precise controlling of the genetic background and environmental rearing conditions. They can provide generally applicable insights into the genetic underpinnings of complex traits and human diseases, due to the evolutionary conservation of biological pathways. Numerous studies have highlighted the conservation of cardiac development and function from flies to mammals. Indeed, orthologous genes control the early development as well as the essential functional elements of the heart. The fly is the simplest genetic model with a heart muscle and is increasingly used to identify the genes involved in heart disease and aging (Ocorr et al., 2007b; Diop and Bodmer, 2015; Rosenthal et al., 2010). Although a large number of genes are implicated in establishing and maintaining cardiac function in Drosophila (Neely et al., 2010), the extent to which genes identified from mutant analysis reflect naturally occurring variants is neither known, nor do we know how allelic variants at several segregating loci combine to affect cardiac performance. We previously showed that wild populations of flies harbor rare polymorphisms of major effects that predispose them to cardiac dysfunction (Ocorr et al., 2007a). Here, we analyzed the genetic architecture of the natural variation of cardiac performance in Drosophila. Our aims were to (i) identify the variants associated with cardiac traits found in a natural population, (ii) decipher how these variants interact with each other and with the environment to impact cardiac performance, and (iii) gain insights into the molecular and cellular processes affected. For this, we used the Drosophila Genetic Reference Panel (DGRP) (Mackay et al., 2012; Huang et al., 2014), a community resource of sequenced inbred lines. Previous GWAS performed in the DGRP indicate that inheritance of most quantitative traits in Drosophila is complex, involving many genes with small additive effects as well as epistatic interactions (Mackay and Huang, 2018). The use of inbred lines allows us to assess the effects of genetic variations in distinct but constant genetic backgrounds and discriminate genetic and environmental effects. We demonstrated substantial among-lines variations of cardiac performance and identified genetic variants associated with the cardiac traits together with epistatic interactions among polymorphisms. Candidate loci were enriched for genes encoding transcription factors (TFs) and signaling pathways, which we validated in vivo. We used non-coding variants - which represented the vast majority of identified polymorphisms – for predicting transcriptional regulators of associated genes. Corresponding TFs were further validated in vivo by heart-specific RNAi-mediated knockdown (KD). This illustrates that natural variations of gene regulatory networks have widespread impact on cardiac function. In addition, we analyzed the phenotypic variability of cardiac traits between individuals within each of the DGRP lines (i.e., with the same genotype), and we documented significant diversity in phenotypic variability among the DGRP lines, suggesting genetic variations influenced phenotypic variability of cardiac performance. Genetic variants associated with this phenotypic variability were identified and shown to affect a set of genes that overlapped with those associated with trait means, although through different genetic variants in the same genes. Comparison of human GWAS of cardiac disorders with results in flies identified a set of orthologous genes associated with cardiac traits both in Drosophila and in humans, supporting the conservation of the genetic architecture of cardiac performance traits, from arthropods to mammals. siRNA-mediated gene KD were performed in human induced pluripotent stem cells derived cardiomyocytes (hiPSC-CMs) to indeed show that dmPox-meso/hPAX9 and dmStripe/hEGR2 have conserved functions in cardiomyocytes from both flies and humans. These new insights into the fly’s genetic architecture and the connections between natural variations and cardiac performance permit the accelerated identification of essential cardiac genes and pathways in humans. Results Quantitative genetics of heart performances in the DGRP In this study, we aimed to evaluate how naturally occurring genetic variations affect cardiac performance in young Drosophila adults and identify variants and genes involved in the genetic architecture of cardiac traits. To assess the magnitude of naturally occurring variations of the traits, we measured heart parameters in 1-week-old females for 167 lines from the DGRP, a publicly available population of sequenced inbred lines derived from wild caught females (Figure 1A). Briefly, semi-intact preparations of individual flies (Ocorr et al., 2007c) were used for high-speed video recording combined with Semi-automated Heartbeat Analysis (SOHA) software (http://www.sohasoftware.com/) which allows precise quantification of a number of cardiac function parameters (Fink et al., 2009; Cammarato et al., 2015). Fly cardiac function parameters are highly influenced by sex (Wessells et al., 2004). Due to the experimental burden of analyzing individual cardiac phenotypes, we focused on female flies only and designed our experiment in the following way: we randomly selected 14 lines out of 167 that were replicated twice. The remaining 153 lines were replicated once. Each replicate was composed of 12 individuals. No block effect was observed due to the replicates in the 14 selected lines (see Supplementary file 1a). This allowed us to perform our final analysis on one replicate of each of the 167 lines. A total sample of 1956 individuals was analyzed. Seven cardiac traits were analyzed across the whole population (Figure 1—source data 1 and Table 1). As illustrated in Figure 1A, we analyzed phenotypes related to the rhythmicity of cardiac function: the systolic interval (SI) is the time elapsed between the beginning and the end of one contraction, the diastolic interval (DI) is the time elapsed between two contractions and the heart period (HP) is the duration of a total cycle (contraction+relaxation (DI+SI)). The arrhythmia index (AI, std-dev(HP)/mean (HP)) is used to evaluate the variability of the cardiac rhythm. In addition, three traits related to contractility were measured. The diameters of the heart in diastole (end diastolic diameter [EDD]), in systole (end systolic diameter [ESD]), and the fractional shortening (FS), which measures the contraction efficacy (EDD −ESD/EDD). We found significant genetic variation for all traits (Figure 1B and Figure 1—figure supplement 1) with broad sense heritability ranging from 0.30 (AI) to 0.56 (EDD) (Table 1). Except for EDD/ESD and HP/DI, quantitative traits were poorly correlated with each other (Figure 1—figure supplement 1). Figure 1 with 2 supplements see all Download asset Open asset Quantitative genetics and genome-wide associations studies (GWAS) for cardiac traits in the Drosophila Genetic Reference Panel (DGRP). (A) Left: Cardiac performance traits were analyzed in 167 sequenced inbred lines from the DGRP population. Approximately 12 females per line were analyzed. Right panels: Schematic of the Drosophila adult heart assay and example of M-mode generated from video recording of a beating fly heart. Semi-intact preparations of 1-week-old adult females were used for high-speed video recording followed by automated and quantitative assessment of heart size and function. The representative M-mode trace illustrate the cardiac traits analyzed. DI: diastolic interval; SI: systolic interval; HP: heart period (duration of one heartbeat); EDD: end diastolic diameter (fully relaxed cardiac tube diameter); ESD: end systolic diameter (fully contracted cardiac tube diameter). Fractional shortening (FS=EDD − ESD/EDD) and arrythmia index (AI=Std Dev (HP)/HP) were additionally calculated and analyzed. (B) Distribution of line means and within lines variations (box plots) from 167 measured DGRP lines for HP and EDD. DGRP lines are ranked by their increasing mean phenotypic values. For both phenotypes, representative M-modes from extreme lines are shown below (other traits are displayed in Figure 1—figure supplement 1). (C) Pearson residuals of chi-square test from the comparison of indicated single nucleotide polymorphism (SNP) categories in the DGRP and among variants associated with cardiac traits. According to DGRP annotations, SNPs are attributed to genes if they are within the gene transcription unit (5’ and 3’ UTR, synonymous and non-synonymous coding, introns) or within 1 kb from transcription start and end sites (1 kb upstream, 1 kb downstream). NA: SNPs not attributed to genes (>1 kb from transcription start site [TSS] and transcription end sites [TES]). (D) Comparison of gene sets identified by single marker using Fast-LMM (LMM) and in interaction using FastEpistasis (Epistasis). The Venn diagram illustrates the size of the two populations and their overlap. (E) Overlap coefficient of gene sets associated with the different cardiac traits analyzed. Figure 1—source data 1 Individual values for cardiac traits analyzed across the 167 Drosophila Genetic Reference Panel (DGRP) lines. Individual and DGRP line number are indicated. Phenotypic values were determined from high-speed video recording on dissected flies and movie analysis using Semi-automated Heartbeat Analysis (SOHA) (Mackay et al., 2012). https://cdn.elifesciences.org/articles/82459/elife-82459-fig1-data1-v1.xlsx Download elife-82459-fig1-data1-v1.xlsx Figure 1—source data 2 Variants identified by FastLMM as associated to indicated phenotypes. Among the 100 best ranked associations, only variants with MAF >4% were retained. Tables for variants mapped to genes and for variants that are not within gene mapping criteria (>1 kb from transcription start site [TSS] and transcription end sites [TES]) are indicated. https://cdn.elifesciences.org/articles/82459/elife-82459-fig1-data2-v1.xlsx Download elife-82459-fig1-data2-v1.xlsx Figure 1—source data 3 All FastEpistasis data on mean phenotypes, per quantitative trait. Single nucleotide polymorphism (SNP) ID, position, associated genes, and statistics are indicated for both focal SNPs (left) and their interacting SNPs (right). Each sheet displays the results for indicated quantitative traits, except for the first one which is a merge of all quantitative traits association analyses. https://cdn.elifesciences.org/articles/82459/elife-82459-fig1-data3-v1.xlsx Download elife-82459-fig1-data3-v1.xlsx Table 1 Quantitative genetics of cardiac traits in the Drosophila Genetic Reference Panel (DGRP). Summary statistics over all DGRP genotypes assayed. Number of lines and individuals (after outlier removal, see Materials and methods) analyzed for each cardiac trait is indicated. Mean, standard deviation (Std dev.), and coefficient of variation (Coef. Var) among the whole population are indicated. Genetic, environment, and phenotypic variance (respectively Genet. var, Env. var, and Phen. var) were calculated for each trait. Broad sense heritability of traits means (H2) suggested heritability of corresponding traits. Levene test indicated significant heterogeneity of the variance among the lines. DiastolicintervalsSystolicintervalsHeartperiodDiastolic DiameterSystolic diameterFractional shorteningArrhythmia Indextotal.nb.lines167167167167167167167mean0.46380.21660.688379.420051.05000.35380.2475Std dev.0.263300.032160.2769014.090009.493000.068370.29230Coef. var0.56770.14850.40220.17740.18600.19331.1810lines (mean)165166165159157158166Indiv. (mean)1914191119201779175317671832lines (Cve)165166165159157158166Indiv. (Cve)1914191119201779175317671832Genet. var2.59e-025.03e-042.87e-021.13e+024.39e+011.57e-032.21e-02Env. var4.36e-025.35e-044.82e-028.64e+014.65e+013.11e-036.35e-02Phen. var6.95e-021.04e-037.68e-021.99e+029.04e+014.68e-038.56e-02H20.3730.4850.3730.5660.4850.3350.258F value76,86411,68674,71546,95015,04111,16465,308Pr(F)8.8e-1202.3e-1875.8e-1167.1e-621.9e-2318.8e-1751.8e-96Levene test1.9e-101.9e-101.7e-081.6e-052.1e-131.6e-052.1e-13 GWAS analyses of heart performance To identify candidate variants associated with cardiac performance variation, we performed GWAS analyses and evaluated single marker associations of line means with common variants using a linear mixed model (Lippert et al., 2011) and after accounting for effects of Wolbachia infection and common polymorphism inversions (see Materials and methods). Genotype-phenotype associations were performed separately for all seven quantitative traits and variants were ranked based on their p-values. For most of the phenotypes analyzed, quantile-quantile (QQ) plots were uniform (Figure 1—figure supplement 2) and none of the variants reached the strict Bonferroni correction threshold for multiple tests (2 · 10–8), which is usual in the DGRP given the size of the population. However, the decisive advantage of the Drosophila system is that we can use GWA analyses as primary screens for candidate genes and mechanisms that can be subsequently validated by different means. We therefore chose to analyze the 100 top ranked variants for each quantitative trait. This choice is based on our strategy to test the selected single nucleotide polymorphisms (SNPs) and associated genes by a variety of approaches – data mining and experimental validation (see below) – in order to provide a global validation of association results and to gain insights into the characteristics of the genetic architecture of the cardiac traits. This cut-off was chosen in order to be able to test a significant number of variants while being globally similar to the nominal cut-off (10–5) generally used in DGRP analyses. A large proportion of the variants retained have indeed a p-value below 10–5. Selected variants were further filtered on the basis of minor allele frequency (MAF >4%) (Figure 1—source data 2, Supplementary file 1b). Among the seven quantitative traits analyzed, we identified 530 unique variants. These variants were associated to genes if they were within 1 kb of transcription start site (TSS) or transcription end sites (TES). Using these criteria, 417 variants were mapped to 332 genes (Supplementary file 1c). We performed a chi-squared test to determine if the genomic location of variants associated with cardiac traits is biased toward any particular genomic region when compared with the whole set of variants with MAF >4% in the DGRP population and obtained a p-value of 2.778e-13. Genomic locations of the variants were biased toward regions within 1 kb upstream of genes TSS, and, to a lesser extent, to genes 5’ UTR (Figure 1C). Variants not mapped to genes (located at >1 kb from TSS or TES) were slightly depleted in our set. In GWAS analyses, loci associated with a complex trait collectively account for only a small proportion of the observed genetic variation (Manolio et al., 2009) and part of this ‘missing heritability’ is thought to come from interactions between variants (Flint and Mackay, 2009; Manolio et al., 2009; Huang et al., 2012; Shorter et al., 2015). As a first step toward identifying such interactions, we used FastEpistasis (Schüpbach et al., 2010). SNP identified by GWAS were used as focal SNPs and were tested for interactions with all other SNPs in the DGRP. FastEpistasis reports best ranked interacting SNP for each starting focal SNP, thus extending the network of variants and genes associated to natural variation of cardiac performance, which were used for hypothesis generation and functional validations; 288 unique SNPs were identified, which were mapped to 261 genes (Figure 1—source data 3, Supplementary file 1e). While none of the focal SNPs interacted with each other, there is a significant overlap between the 332 genes associated with single marker GWAS and the 261 genes identified by epistasis (n=31, Figure 1D and Supplementary file 1e, fold change (FC)=6; hypergeometric pval=6.8 × 10–16). This illustrates that the genes that contribute to quantitative variations in cardiac performance have a tendency to interact with each other, although through distinct alleles. Taken together, single marker GWAS and epistatic interactions performed on the seven cardiac phenotypes identified a compendium of 562 genes associated with natural variations of heart performance (Supplementary file 1f). In line with the correlation noted between their phenotypes (Figure 1—figure supplement 1B), the GWAS for HP and DI identified partially overlapping gene sets (overlap index 0.23, Figure 1E). The same was true, to a lesser extent, for ESD and EDD (0.15). Otherwise, the sets of genes associated with each of the cardiac traits are poorly correlated with each other. Functional annotations and network analyses of association results Our next objective was to identify the biological processes potentially affected by natural variation in cardiac performance. Gene Ontology (GO) enrichment analysis of the combined single marker GWAS and epistatic interactions analyses indicated that genes encoding signaling receptors, TFs, and cell adhesion molecules were over-represented among these gene sets (pval=1.4 × 10–9 [FC=2.9], 5×10–4 [FC=2], and 3×10–3 [FC=4.6], respectively). There was also a bias for genes encoding proteins located at the plasma membrane, at ion channel complexes as well as components of contractile fibers (pval=3.4 × 10–10 [FC=3], 7×10–5 [FC=4.2], and 4×10–2 [FC=3.6]; Figure 2A; Supplementary file 2a). Of note, although a number of genes have previously been identified as being required during heart development or for the establishment and maintenance of cardiac function by single gene approaches, we found no enrichment for these gene categories in our analysis. In addition, genes identified in a global screen for stress-induced lethality following heart-specific RNAi KD (Neely et al., 2010) were also not enriched in GWAS detected genes (FC=1; Supplementary file 2b). This indicates that genes associated to natural variations of cardiac traits are typically missed by traditional forward or reverse genetics approaches, which highlights the value of our approach. Figure 2 with 2 supplements see all Download asset Open asset Functional annotations and validations of genes associated with genome-wide associations studies (GWAS) for cardiac performance. (A) Gene Ontology (GO) enrichment analyses. Selected molecular functions (MF, left) and cellular components (CC, right) associated with cardiac performances at FDR < 0.05 are shown. Enrichment analysis was performed using G:profiler with a correction for multitesting (see Materials and methods). (B) Interaction network of genes associated with natural variations of cardiac performance. Direct genetic and physical interactions between cardiac fly GWAS genes are displayed. Nodes represent genes and/or proteins, edges represent interactions (blue: genetic; black: physical). Node shapes refer to single marker and/or epistasis GWAS, node color to the cardiac performances phenotype(s) for which associations were established. Genes and proteins highlighted in pink point to transcription factors, in green and blue to signaling pathways (FGF and TGFb, respectively), and in yellow to ion channels. (C) Heatmap representing the effects on indicated cardiac traits of heart-specific RNAi-mediated knockdown (KD) of 42 genes identified in GWAS for cardiac performance. Results of Wilcoxon rank sum test of the effects of indicated heart-specific RNAi-mediated gene KD (rows) for cardiac performance traits (columns), analyzed on semi-intact 1-week females flies. Detailed data are presented in Figure 2—source data 2. Thirty-eight (out of 42) genes tested lead to significant effects on cardiac performance traits upon KD. Black dots indicate the trait(s) for which the corresponding gene was associated in GWAS. ns: not significant; *: pval <0.05; **: pval <0.01; ***: pval <0.005; ****pval <0.0001 (p-values were adjusted for multiple testing using Bonferroni correction). Comparison with heart-specific effect of random selected genes is displayed in Figure 2—figure supplement 1. (D) Schematic drawing of BMP and activin pathways in Drosophila. (E) Genetic interactions between BMP and activin pathway genes. Genetic interactions tested between snooBSC234 and sogU2 for SI and between dppd14 and babo32 for FS (other phenotypes are shown in Figure 2—figure supplement 2). Cardiac traits were measured on each single heterozygotes and on double heterozygotes flies. Two-way ANOVA reveals that the interaction between snooBSC234 and sogU2 for SI and between dppd14 and babo32 for FS are significant. Detailed data for interaction effect corresponding to all phenotypes are displayed in Figure 2—figure supplement 2. Figure 2—source data 1 Collection of physical (IP, Y2H) and genetic interactions identified in Drosophila. https://cdn.elifesciences.org/articles/82459/elife-82459-fig2-data1-v1.xlsx Download elife-82459-fig2-data1-v1.xlsx Figure 2—source data 2 Data from validation experiments. (Heart-specific RNAi validations and tests for genetic interactions among BMP members.) https://cdn.elifesciences.org/articles/82459/elife-82459-fig2-data2-v1.xlsx Download elife-82459-fig2-data2-v1.xlsx In order to gain additional knowledge about the cellular and molecular pathways affected by natural variations of cardiac traits, we have mapped the associated genes and gene products onto characterized interaction networks. Of the 562 identified genes, 419 were mapped to the fly interactome that includes both physical (protein-protein) and genetic interactions from both DroID (Murali et al., 2011) and flybi databases (see Materials and methods and Figure 2—source data 1). Remarkably, a high proportion (148) of the GWAS identified genes were directly connected within the fly interactome and formed a large network of interacting genes/proteins (Supplementary file 2c and d, Figure 2B), suggesting that they participate in common biological processes. This network encompasses several TFs and ion channel complex genes, consistent with their potential role in the genetic architecture of natural variation of heart performance. Several components of signaling pathways are also present in the network, including members of the FGF and TGFβ pathways (see below). Functional validations of candidate genes To assess in an extensive way whether mutations in genes harboring SNPs associated with variation in cardiac traits contributed to these phenotypes, we selected 42 GWAS associated genes for cardiac-specific RNAi KD and tested the effects on cardiac performance. We selected genes that were identified in at least two independent GWAS for two traits or that were known to be dynamically expressed in the adult heart (Monnier et al., 2012) and for which inducible RNAi lines were available. Genes were tested in 1-week-old adult female flies, using the heart-specific Hand>Gal4 driver (Popichenko et al., 2007) and the same semi-intact heart preparations and SOHA analyses as for DGRP lines screening. Notably, 38 of the 42 selected genes led to various levels of cardiac performance defects following heart-specific KD (Figure 2C). In parallel, we tested the effect on cardiac performance of knocking down 18 genes randomly selected in the genome – the GWAS associated genes being excluded from the selection (see Materials and methods and Figure 2—figure supplement 1). Although a number of these genes lead to cardiac phenotypes when inactivated – which is consistent with published observations that quantitative traits can be influenced by a large number of genes (Zhang et al., 2021) – when inactivated in the heart, the genes selected from GWAS lead to significantly more frequent phenotypes compared to the randomly selected genes (Figure 2—figure supplement 1). These results therefore supported our association results. It is important to emphasize that our approach is limited to testing the effects of tissue-specific gene KD. Since some of the variants may lead to increased gene function and/or expression, this can lead to a false negative rate that is difficult to estimate. In addition, some of the associated variants may influence heart function by non-cell-autonomous mechanisms, which would not be replicated by cardiac-specific RNAi KD. We further focused on the TGFβ pathway, since members of both BMP and activin pathways were identified in our analyses. We tested different members of the TGFβ pathway for cardiac phenotypes using cardiac-specific RNAi KD (Figure 2C), and confirmed the involvement of the activin agonist snoo (Ski orthologue) and the BMP antagonist sog (chordin orthologue). Notably, activin and BMP pathways are usually antagonistic (Figure 2D). Their joint identification in our GWAS suggest that they act in a coordinated fashion to regulate heart function. Alternatively, it may simply reflect their involvement in different aspects of cardiac development and/or functional maturation. In order to discriminate between these two hypotheses, we tested if different components of these pathways interacted genetically. Single heterozygotes for loss of function alleles show dosage-dependent effects of snoo and sog on several phenotypes, providing an independent confirmation of their involvement in several cardiac traits (Figure 2—figure supplement 2). Importantly, compared to each single heterozygotes, snooBSC234/sogUU2 double heterozygotes flies showed non-additive SI phenotypes (two-way ANOVA pval: 2.1 · 10–7), suggesting a genetic interaction (Figure 2E and Figure 2—figure supplement 2). It is worth noting however that snoo is also a transcriptional repressor of the BMP pathway (Takaesu et al., 2006). The effect observed in snooBSC234/sogUU2 double heterozygotes can therefore alternatively arise as a consequence of an increased BMP signaling without affecting the activin pathway. We thus tested other allelic combinations for loss of function alleles of BMP and activin pathways. babo/


Posted ContentDOI
14 Jul 2022-bioRxiv
TL;DR: A systematic analysis of MET1 requirement for genome function in different A. thaliana accessions revealed a dual role forCG methylation: for many genes, CG methylation appears to canalize expression levels, with methylation masking regulatory divergence, but for a smaller subset of genes, methylation increases expression diversity beyond genetically encoded differences.
Abstract: Background Despite its conserved role on gene expression and transposable element (TE) silencing, genome-wide CG methylation differs substantially between wild Arabidopsis thaliana accessions. Results To test our hypothesis that global reduction of CG methylation would reduce epigenomic, transcriptomic, and phenotypic diversity in A. thaliana accessions, we knock out MET1 , which is required for CG methylation, in 18 early-flowering accessions. Homozygous met1 mutants in all accessions suffer from common developmental defects such as dwarfism and delayed flowering, in addition to accession-specific abnormalities in rosette leaf architecture, silique morphology, and fertility. Integrated analysis of genome-wide methylation, chromatin accessibility, and transcriptomes confirms that MET1 inactivation greatly reduces CG methylation and alters chromatin accessibility at thousands of loci. While the effects on TE activation are similarly drastic in all accessions, the quantitative effects on non-TE genes vary greatly. The global expression profiles of accessions become considerably more divergent from each other after genome-wide removal of CG methylation, although a few genes with diverse expression profiles across wild-type accessions tend to become more similar in mutants. Most differentially expressed genes do not exhibit altered chromatin accessibility or CG methylation in cis , suggesting that absence of MET1 can have profound indirect effects on gene expression and that these effects vary substantially between accessions. Conclusions Systematic analysis of MET1 requirement in different A. thaliana accessions reveals a dual role for CG methylation: for many genes, CG methylation appears to canalize expression levels, with methylation masking regulatory divergence. However, for a smaller subset of genes, CG methylation increases expression diversity beyond genetically encoded differences.

TL;DR: It is suggested that while additive complementation is an intrinsic property of F1 hybrids, the major driver of growth in hybrids derives from the quantitative nature of non-additive gene expression, especially under-dominance and thus lower expression in hybrids than predicted from the parents.
Abstract: Heterosis, the generally superior performance in hybrids compared to their inbred parents, is one of the most enigmatic biological phenomena. Many different explanations have been put forward for heterosis, which begs the question whether common principles underpinning it do exist at all. We performed a systematic transcriptomic study in Arabidopsis thaliana involving 141 random crosses, to search for the general principles, if any, that heterotic hybrids share. Consistent additive expression in F1 hybrids was observed for only about 300 genes enriched for roles in stress response and cell death. Regulatory rare-allele burden affects the expression level of these genes but does not correlate with heterosis. Non-additive gene expression in F1 hybrids is much more common, with the vast majority of genes (over 90%) being expressed below parental average. These include genes that are quantitatively correlated with biomass accumulation in both parents and F1 hybrids, as well as genes strongly associated with heterosis. Unlike in the additive genes, regulatory rare allele burden in this non-additive gene set is strongly correlated with growth heterosis, even though it does not covary with the expression level of these genes. Together, our study suggests that while additive complementation is an intrinsic property of F1 hybrids, the major driver of growth in hybrids derives from the quantitative nature of non-additive gene expression, especially under-dominance and thus lower expression in hybrids than predicted from the parents.

Peer ReviewDOI
11 Nov 2022
TL;DR: Vallebueno-Estrada et al. as discussed by the authors showed that Paredones maize originated from the same domestication event as Mexican maize and was domesticated by ~6700 BP, implying rapid dispersal followed by improvement.
Abstract: Full text Figures and data Side by side Abstract Editor's evaluation eLife digest Introduction Results Discussion Materials and methods Data availability References Decision letter Author response Article and author information Metrics Abstract Archaeological cobs from Paredones and Huaca Prieta (Peru) represent some of the oldest maize known to date, yet they present relevant phenotypic traits corresponding to domesticated maize. This contrasts with the earliest Mexican macro-specimens from Guila Naquitz and San Marcos, which are phenotypically intermediate for these traits, even though they date more recently in time. To gain insights into the origins of ancient Peruvian maize, we sequenced DNA from three Paredones specimens dating ~6700–5000 calibrated years before present (BP), conducting comparative analyses with two teosinte subspecies (Zea mays ssp. mexicana and parviglumis) and extant maize, that include highland and lowland landraces from Mesoamerica and South America. We show that Paredones maize originated from the same domestication event as Mexican maize and was domesticated by ~6700 BP, implying rapid dispersal followed by improvement. Paredones maize shows no relevant gene flow from mexicana, smaller than that observed in teosinte parviglumis. Thus, Paredones samples represent the only maize without confounding mexicana variation found to date. It also harbors significantly fewer alleles previously found to be adaptive to highlands, but not of alleles adaptive to lowlands, supporting a lowland migration route. Our overall results imply that Paredones maize originated in Mesoamerica, arrived in Peru without mexicana introgression through a rapid lowland migration route, and underwent improvements in both Mesoamerica and South America. Editor's evaluation In this important article, the authors characterize ancient DNA from maize unearthed in archaeological contexts from Paredones and Huaca Prieta in the Chicama river valley of Peru, recovered by painstakingly controlled excavation. The genetic evidence, while from a small number of samples, is compelling, although the dating evidence has to rely on archaeological context, which fortunately is excellent. The difficulties of direct radiocarbon dating of the samples in this case are appropriately discussed by the authors. https://doi.org/10.7554/eLife.83149.sa0 Decision letter Reviews on Sciety eLife's review process eLife digest The plant we know today as maize or corn began its story 9,000 years ago in modern-day Mexico, when farmers of the Balsas River basin started to carefully breed its ancestor, the wild grass teosinte parviglumis. Recent discoveries suggest the crop may have started to travel to South America before its domestication was fully complete, leading to a complex history of semi-tamed lineages evolving in parallel in different regions. For example, 5,300-year-old corn specimens found in Tehuacán, in central Mexico, still genetically and morphologically resemble teosinte. Meanwhile, cobs harvested about 6,700 to 5,000 years ago on the northern coast of Peru – 3800km away from where maize was first domesticated – look like the ones we know today. Vallebueno-Estrada et al. aimed to explore the evolutionary history of this Peruvian maize, which was discovered at the archaeological coastal site of Paredones. To do so, they extracted and sequenced its genetic information, and compared these sequences with those from modern varieties of lowland and highland maize, as well as from teosinte parviglumis and teosinte mexicana. The analyses showed that the ancestor of the Paredones maize emerged from teosinte parviglumis like any other lineage, but that it was already domesticated when it started to spread South; by the time it was present in Peru 6,700 years ago, it was genetically closer to modern-day crops. This early departure is consistent with the fact that the Paredones specimens lacked teosinte mexicana genetic variants; this highland relative of lowland parviglumis is believed to have interbred with maize lineages from Central America more recently, when these were brought to higher altitudes. The presence of genetic marks tailored to low-elevation regions suggested that the Paredones maize lineage migrated through a coastal corridor connecting Central and South America, arriving in northern Peru about 2,500 years after first arising from teosinte parviglumis in Central America around 9,000 years ago. Under the care of rapidly developing Central Andean societies, the crop then evolved to adapt to its local conditions. Maize today has spread to all continents besides Antarctica; we produce more of it than wheat, rice or any other grain. How our modern varieties will adapt to the environmental constraints brought by climate change remains unclear. By peering into the history of maize, Vallebueno-Estrada et al. hope to find genetic variations which could inform new breeding strategies that improve the future of this crop. Introduction Maize constituted 12% of global crop production in 2019, second only to sugar cane (Food and Agriculture Organization of the United Nations, 2021). Like many crop plants, global maize production is threatened by climate change, especially in the middle to low latitudes (Li et al., 2022) where maize is dominant. Maize has the allelic diversity to adapt, but much of this variation is partitioned differentially across populations (Hufford et al., 2012b; Romay et al., 2013; Zila et al., 2013). Understanding the development of population dynamics in maize not only allows better understanding of the evolutionary processes that produced a globally important crop but will highlight populations that can be used in breeding to adapt to changing climates. Although the origins of maize (Zea mays ssp. mays L.) based on archeological data puzzled the scientific community for several decades (Beadle, 1939; Mangelsdorf, 1974; Merrill et al., 1940), the integration of genomic, archeological, and botanical evidence has identified the Balsas basin in central Mexico as the only center of origin for maize (Iltis, 1983; Matsuoka et al., 2002; Ranere et al., 2009), and that the divergence from its wild ancestor, Zea mays ssp. parviglumis (hereafter parviglumis), occurred about 9000 years ago (Matsuoka et al., 2002). Domestication occurred in a single event creating a monophyletic clade that includes all domesticated maize landraces and diverges from both parviglumis and Zea mays ssp. mexicana (hereafter mexicana) populations (Matsuoka et al., 2002). Genomic investigations of archeological samples from the Tehuacan highland site suggested that the dispersal of maize to the highlands of México was complex, as early-arriving maize populations retained higher levels of genomic diversity than expected for domesticated plants (Ramos-Madrigal et al., 2016; Vallebueno-Estrada et al., 2016). The constant gene flow between domesticated maize with already divergent populations of parviglumis and mexicana has contributed to the adaptation of maize to new environments and remains embedded in the genetic structure of its populations (Swarts et al., 2017; van Heerwaarden et al., 2011). Geographic areas of contact have been stable over time, as these teosinte populations have maintained a discrete distribution in central Mexico since the last glacial maximum (Hufford et al., 2012a). Gene flow from a sympatric mexicana population to domesticated maize populations has been associated with an altitudinal cline in the highlands of Mexico and Guatemala (Hufford et al., 2013; van Heerwaarden et al., 2011; Wang et al., 2017), and the genetic introgression from mexicana in the form of distinct chromosomal inversions has been associated with adaptation of maize to central Mexican highlands (Hufford et al., 2013; Wang et al., 2017). The genetic contribution of teosinte mexicana to Mexican highland landraces is about 20% (Hufford et al., 2013; van Heerwaarden et al., 2011), with an estimated time for this introgression of around 1000 generations (Calfee et al., 2021). Archaeological evidence supports the dispersal of maize populations to South America to be associated with a Pacific lowland coastal corridor (Bonavia and Grobman, 2017; Dillehay et al., 2008; Randolph, 1959). Population substructure and differentiation patterns suggested independent adaptations to highland environments in Mesoamerica and South America; meanwhile, minimal population sub-structuring was detected between the lowlands of Mesoamerica and South America (Swarts et al., 2017; Takuno et al., 2015) indicating continuous gene flow over time. While the introgression from mexicana of chromosomal inversions on chromosomes 3, 4, and 6 has been shown to contribute to adaptation to Mesoamerican highlands (Romero Navarro et al., 2017), those regions were not detected in highland maize populations of South America (Wang et al., 2017) or North America (Swarts et al., 2017) which were also isolated from direct gene flow with parviglumis (Kistler et al., 2018). Although these inversions are specific to Mexican landraces, many maize populations across the Americas including South America show genome-wide admixture with mexicana (Swarts et al., 2017). Highland South American landraces also show phenotypic diversity relative to lowlands (Bonavia, 2013), as well as specific cytogenetic characteristics such as the absence of supernumerary highly heterochromatic B chromosomes in Peruvian landraces, resulting in the so-called Andean Complex (McClintock et al., 1981). These unique characteristics have puzzled the scientific community regarding the origin and adaptation of the Andean Complexes (Bonavia, 2013; Goodman and Bird, 1977; Grobman, 1961; McClintock et al., 1981; Randolph, 1959; Wilkes, 1979). The archeological expeditions at the sites of Paredones and Huaca Prieta in the coastal desert of north Peru yielded a robust collection of ancient maize remains that provide a unique opportunity to investigate the chronology, landrace evolution, and cultural context associated with early maize dispersal in South America (Bonavia and Grobman, 2017; Dillehay et al., 2012; Grobman et al., 2012). These findings include one charred cob fragment dated 6775–6504 calibrated BP (at 95.4% probability), and other burned cobs stratigraphically dated to similar and later ages (Bonavia and Grobman, 2017; Dillehay et al., 2012; Grobman et al., 2012), representing the most ancient maize macro-specimens found to date. Strikingly, and in contrast to Mexican cob fragments from Guilá Naquitz dating approximately the same time or slightly younger (6235 calibrated BP) (Benz, 2001), the Paredones and Huaca Prieta specimens are robust, slender and cylindrical 2.4–3.1 cm long cobs, with eight rows of kernels consistent with the hypothetical Proto-Confite Morocho landrace (Benz, 2001; Dillehay et al., 2012; Grobman et al., 2012). Recently, it has been proposed that the first maize lineages arriving in South America were partial domesticates, locally evolving the full set of domestication traits due to reduced gene flow from wild relatives that enhanced anthropogenic pressures (Kistler et al., 2020; Kistler et al., 2018). However, it is not clear how fast this process could have been and if the earliest archeological samples found in South America were partially or fully domesticated. In addition, the expectation on the phenotype of those hypothetical samples is not clear. Here we present the genomic analysis of three ancient specimens belonging to the earliest cultural phase of Paredones and radiocarbon dating 6775–6504, ~5800 to 5400 (dated by direct association with wood charcoal), and 5583–5324 2σ calibrated years BP at 95.4% probability. To reveal the population context of their origin and domestication, we conducted comparisons with parviglumis, mexicana and extant maize landraces. Also, to explore if these ancient maize samples exhibit some evidence of mexicana gene flow, we performed D-statistics under several experimental designs, comparing them to extant maize and parviglumis populations. Finally, we did a comparison with previously published data from extant highland and lowland Mesoamerican and South American landraces, to identify signatures of specific adaptation that could bring insights into the specific improvements that this maize went through in both Mesoamerica and South America. Our results provide evidence that ancient Peruvian maize originated in Mesoamerica as all landraces found to date, followed by a rapid dispersal into the lowlands of South America, and subsequently was subjected to local adaptation processes. Results Paredones ancient maize sampling The maize macroremains were collected as part of published excavations at the Paredones and Huaca Prieta sites (Grobman et al., 2012). Macroremains from both sites were excavated in deeply stratified and undisturbed cultural floors. Stratigraphic Units 20 and 22 at Paredones are the archeological component with the largest and most diversified amount of maize remains, with the oldest 14C dated cobs. The oldest cobs derive from near the base of this unit, in a single, discrete and intact floor of ~2 cm in thickness and at 5.5 m in depth from the present-day surface (Bonavia and Grobman, 2017; Grobman et al., 2012). The dated remains at both sites are chrono-stratigraphically bracketed by and in agreement with more than 165 dates from mound and off-mound contexts that were obtained by Accelerator Mass Spectrometry (AMS) and Optically Stimulated Luminescence (OSL) (Bonavia and Grobman, 2017; Dillehay, 2017; Dillehay et al., 2022). No taphonomic or other disturbing cultural or geological features were observed in any excavation units that would have altered the integrity and intactness of strata containing the maize remains (Material and Methods). All radiocarbon-dated remains were assayed by the SHCal04 Southern Hemisphere Calibration 0–11.0 calibrated kyr BP curve (McCormac et al., 2004). In 2020, Dillehay and geologists Steven Goodbred and Elizabeth Chamberlain carried out excavations in a Preceramic domestic site (S-18) located ~3.2 km north of Huaca Prieta. Preceramic corn remains were encountered consistently in hearths and in the upper to lower intact cultural layers of the site. As with parts of the Paredones and Huaca Prieta sites, the lower hearths and strata contained uncharred cobs 2.6–3.1 cm long, slender and cylindrical with eight rows of kernels, of the smaller and earliest type of identified corn species Proto Confite Morocho (Grobman et al., 2012). The middle to upper strata yielded the known later and slightly larger Preceramic varieties of Confite Chavinese and Proto Alazan. An OSL date from a discrete and intact lower layer containing a hearth with two unburned cob fragments of the Proto Confite Morocho variety assayed ~7000 +/- 630 years ago or 7560–6300 BP (Chamberlain, 2019). Wood charcoal from the hearth was processed at 7162–6914 calibrated BP (at 95.4% probability, AA75398), suggesting that the associated cob fragments date ~6800 years ago. In South America, maize micro remains (e.g. starch grains, pollen, phytoliths) have been estimated to date 7200–7000 calibrated BP (Piperno, 2006; Piperno and Pearsall, 1998; Zarrillo et al., 2008) at sites in southwest coastal Ecuador, located ~450 km north of Huaca Prieta and Paredones, and in other localities across the continent at ~6500 calibrated BP and later (Kistler et al., 2018). Excavation Units 20 and 22 from Paredones are illustrated in Figure 1A and B. Three of the recovered maize samples (Par_N1, Par_N9, and Par_N16) are well-structured maize cobs deprived of seeds and showing morphological similarities to extant landraces. Par_N1 is the most ancient specimen, a burned maize husk and shank fragment dating 5900 ± 40 14 C years BP (6770–6504 calibrated BP at 95.4% probability, OS860020), and obtained from archeological Unit 22 (Figure 1C and Figure 1—figure supplement 1). The other two samples were found in Unit 20, Par_N9, a slightly charred maize husk fragment, that radiocarbon assayed at 5582–5321 calibrated BP (at 95.4% probability, AA86932), and Par_N16, an unburned cob fragment, which stratigraphically dated by direct association with wood charcoal in a hearth at 5603–5333 calibrated BP (at 95.4% probability, AA86937; see Figure 1C and Table 1). Other charred cobs from overlying, younger strata in these units or from other units assayed between 4800–3800 calibrated BP (at 95.4% probability). These dates are stratigraphically bracketed by radiocarbon assayed intact hearths and prepared floors that are in complete chrono-stratigraphic agreement (Bonavia and Grobman, 2017; Dillehay et al., 2012; Grobman et al., 2012). Par_N1 is older than any other maize macro-specimen found to date (Torres-Rodríguez et al., 2018). Figure 1 with 1 supplement see all Download asset Open asset Archeological site and specimens of Paredones. (A) Topographic contour map of Huaca Prieta and Paredones (units U20, U22, and U27) coastal sites, showing the location excavation units. (B) The Paredones mound during archeological excavations. (C) Maize specimens Par_N1 (dating 6775–6504 calibrated years BP), Par_N9 (dating to 5800–5400 calibrated years BP), and ParN16 (dating 5583–5324 calibrated years BP); Scale bar = 1 cm. Table 1 Radiocarbon and calibrated dates of maize specimens from Paredones. Lab no.SiteAssociated Dating no.Unit / stratum14 C years BP95.4% probablityδ13C valueDated MaterialPar_N1ParedonesOS8602022/185900±406775–6504–10.3Husk/shank fragment attached to cobPar_N9ParedonesAA8693220/6b-184770±355582–5321–23.5Husk fragment attached to cob*Par_N16ParedonesAA8693720/6b-184849±315603–5333–25.8Wood charcoal in associated hearth† * Aberrant δ13C assay. Attachment to cob and molecular data confirm maize. † Maize cob directly associated with hearth. Paleogenomic characterization of ancient maize samples To determine the genomic constitution and degree of genetic variability present in the 6775–5324 calibrated years BP (at 95.4% probability) maize of Paredones, we extracted DNA and conducted whole-genome shotgun sequencing in specimens Par_N1, Par_N9, and Par_N16. Since the endogenous DNA content of all three specimens was low (0.2% for Par_N1 and Par_N9; 1.1% for Par_N16), we conducted in-depth whole-genome shotgun sequencing of high-quality libraries under Illumina platforms, generating 622 million (M) quality-filtered reads for Par_N1, 423 M for Par_N9, and 392 M for Par_N16. Due to its higher endogenous DNA content (one order of magnitude larger), we further sequenced the Par_N16 library, obtaining 459 M additional reads, to generate a total of 851 M for this sample (Table 2). Comparison with version 3 of the B73 maize reference genome resulted in 1,320,284 (Par_N1), 1,034,544 (Par_N9), and 15,023,803 (Par_N16) reads mapping to either repetitive (33.4% for Par_N1; 34% for Par_N9; 34.8% for Par_N16) or unique (66.5% for Par_N1; 66% for Par_N9; 65.2% for Par_N16) genomic regions, for a total virtual length of 52.2 Mb (Par_N1), 40.8 Mb (Par_N9), and 471.65 Mb (Par_N16) of the unique maize genome (Table 2). Average mapping quality in Phred score was 31.9 for Par_N1, 32.1 for Par_N9 and 34 for Par_N16; this is reflected in the estimated error rate of 1.19E-02 for Par_N1, 9.59E-03 for Par_N9 and 1.05E-02 for Par_N16 (Table 2). Reads contained signatures of DNA damage typical of postmortem degradation in ancient samples, including overhangs of single-stranded DNA, 13–20% cytosine deamination and fragmentation due to depurination (Dabney et al., 2013) resulting in median fragment lengths of 36 bp for all three samples. A total of 42–53% of all covered sites had signatures of molecular damage (Figure 2). This damage pattern is an indication that this is ancient endogenous DNA and does not represent DNA contamination from extant sources. After mapping reads corresponding to unique genomic regions, Par_N1, Par_N9 and Par_N16 yielded approximately 16.9 M (Par_N1), 12.1 M (Par_N9), and 334.36 M (Par_N16) unique genomic sites spread across all 10 chromosomes at an average depth of 1.2 X (Table 3 and Figure 2—figure supplements 1–4), which were used as a platform for subsequent studies. Figure 2 with 5 supplements see all Download asset Open asset Post-mortem DNA damage and fragmentation patterns of ancient maize samples. DNA composition around read-termini (top four plots), and DNA mis-incorporation errors relative to the 5’ and 3’ read (bottom plot); the two distributions for post-mortem damage signatures (C to T and G to A) are shown in red and blue respectively, while other types of substitutions are shown in gray. Table 2 Paleogenomic characterization of three ancient maize samples from Paredones. Par_N1Par_N9Par_N16Total number of raw reads623,686,255423,856,284851,330,235Total number of quality sequences622,438,882423,472,877850,326,750Number of sequences mapping to genome1,320,2841,034,54415,023,803Number of sequences mapping to repetitive regions441,442351,8835,228,275Number of sequences mapping to the unique genome878,842682,6619,795,586Total length (Mb)52.240.80471.65Average read length (bp)59.4159.9048.15Total coverage (Mb)16.9012.100334.36Average quality (Phred)31.932.134Error rate (mismatches / bases mapped (cigar))1.19E-029.59E-030.01059832 Table 3 Total number of unique genomic sites covered at variable depths in ancient Paredones samples. DepthPar_N1Par_N9Par_N16115,679,84411,436,116278,622,39021,068,931583,34944,373,0183130,67060,8927,898,502424,92311,6851,875,32858,0105291644,804637592309299,729728921189167,14281423967103,1629102683768,2721086977447,003>1010,4199341155,805 When compared to the B73 reference genome, Par_N1, Par_N9, and Par_N16 yielded 21,123, 15,554, and 275,990 single nucleotide polymorphisms (SNPs), respectively. To eliminate any potential miscalls caused by postmortem damage, all SNPs corresponding to a possible cytosine (C) to thymine (T), or guanine (G) to adenine (A) transitions were not considered for subsequent analysis (molecular damage filter) (Gilbert et al., 2003; Hofreiter et al., 2001; Table 4). All SNPs corresponding to insertions or deletions (INDELs) were also eliminated. Using a previously reported pipeline (Vallebueno-Estrada et al., 2016), Par_N1, Par_N9, and Par_N16 yielded 2,886, 1,888, and 121,842 intersected positions with called genotypes (genotype calls) included in the HapMap3 maize diversity panel, most of which were only covered at 1 X due to the low amount of endogenous DNA recovered (Table 4). Despite this low coverage depth, the vast majority corresponded to a previously reported HapMap3 allele (98.8% for Par_N1, 98.9% for Par_N9, and 99.1% for Par_N16), suggesting that this dataset provides an accurate paleogenomic representation of maize that can be used to determine its evolutionary trend. To determine if the specific elimination of C to T and G to A modifications could bias the results in favor of maize rather than teosinte alleles, an additional database was generated in which all transitions were eliminated (i.e. only transversions were included) in Par_N16 only, because it was the only sample with enough sequencing data to conduct this experiment. This second database consisted of 64,118 transversions SNPs, intersected between this sample and the HapMap3 panel. With this database we conducted parallel analyses, which results are shown in the corresponding sections. Table 4 Number of SNPs and genotype calls recovered from ancient Paredones samples. Par-N1Par-N9Par-N16Par-N16 transversionsTotal number of SNPs21,12315,554275,990Total number of SNPs275,990Transitions C->T7505576641,811Transitions C<->T76,990Transitions G->A7527561741,669Transitions G<->A76,909INDELS60942319,790INDELS19,790Quality SNPs54823748192,510Quality SNPs102,302Genotype calls included in HapMap328861888121842Genotype calls included in HapMap364,118 Relationship between ancient maize, extant landraces, and Balsas teosinte To better understand the origin and domestication of South American maize, we explored the evolutionary relationship between Paredones specimens, teosinte parviglumis and mexicana, and extant maize landraces. We inferred a bootstrapped maximum-likelihood (ML) tree topology through patterns of population divergence applied to genome-wide polymorphisms. Intersected positions among the three ancient Paredones samples were scarce (Figure 2—figure supplement 5); therefore, topologies were constructed individually for each DNA sample, based on the intersection of genotype calls between each of the samples and the maize HapMap3 dataset that includes B73 as a reference genome (including major and minor frequency alleles), 22 maize landraces (including several originating in Mexico), 15 teosinte parviglumis inbred lines, two accessions of teosinte mexicana, and a single accession of Tripsacum dactyloides acting as the outgroup (Supplementary file 1). In the case of Par_N16, the resulting tree shows all maize landraces and teosinte accessions separated into two distinct clades, all derived from Tripsacum as previously reported (Hufford et al., 2012b; Matsuoka et al., 2002). Par_N16 is in a clade that includes extant maize landraces, and this is for all 10,000 bootstrap replicates tested. Par_N16 is not basal in its clade but fits robustly with Chullpi (AYA 32) – the only extant Peruvian landrace included in the reference panel – in a derived position, closely clustering with South American landraces such as Cravo Riogranense (RGSVII) and Araguito (VEN 568). These relationships indicate that the ancient samples are monophyletic with modern maize, supporting a single domestication event, and that they are most closely related to modern samples from the same region, strongly suggesting an ancestral relationship between them and modern South American germplasm (Figure 3 and Figure 3—figure supplement 1). In the case of Par_N1 and Par_N9, and although genotype calls intersected with HapMap3 were scarce (2886 and 1888, respectively), the resulting topology is equivalent, with both samples clustering at the same position as Par_N16 (Figure 3—figure supplements 2 and 3). As other ancient samples previously analyzed, Paredones samples tend to have long branches in phylogenies, which can be explained by isolation by time. On the other hand, the fact that 3 independent samples present the same position in the phylogeny indicates that molecular damage, which is random, is not driving their phylogenetic signal. A parallel analysis that only included transversions showed the same topology, where Par_N16 groups with the South American landraces within the maize monophyletic clade (Figure 3—figure supplement 4). This shows that the phylogenetic position of the Paredones ancient samples is not biased by the molecular damage filter. Thus, based on genome-wide relatedness, Paredones maize clusters with extant domesticated Andean landraces, supporting both, a single origin for maize and that these Peruvian samples were already domesticated by ~6700 BP. Figure 3 with 4 supplements see all Download asset Open asset Advanced domestication of ancient Peruvian maize. Evolutionary relationships between ancient Par_N16 maize and its wild and cultivated relatives. ML tree from an alignment of 121,842 genome-wide genotype calls covering non-repetitive regions of the reference maize genome. The teosinte group is highlighted in green, the maize landrace group in red, and the ancient maize sample from Paredones in blue. The teosinte and landrace accessions follow previously reported nomenclatures and are described in the Supplementary file 1. The Par_N16 branch was cut for format reasons; a tree with the complete branch can be seen in Figure 3—figure supplement 1. Tests of gene flow from mexicana Par_N16 was the only sample with enough DNA sequence data to perform this analysis. All the samples showed the same phylogenetic position; therefore, Par N 16 was considered to be representative of ancient Paredones maize. To investigate the genetic relationship of this maize with teosinte mexicana, we estimated D-statistics in the form D(parviglumis, mexicana, TEST, Tripsacum) that test the hypothesis of Incomplete Lineage Sorting (ILS) due to persistence of polymorphisms across different divergence events, against an imbalanced gene flow over derived alleles from parviglumis to TEST, and mexicana to TEST (Figure 4—figure supplement 1). We used highland Palomero Toluqueño (PT2233) as a positive control, in the form D(parviglumis, mexicana, PT2233, Tripsacum), and lowland Reventador (BKN022) as a negative control, in the form D(parviglumis, mexicana, BKN022, Tripsacum). The results of multiple D-statistic distributions show that the positive control PT2233, with D>0, deviated from the balanced gene flow towards mexicana. Meanwhile, BKN022 remains in ILS balance, with D around 0. The D-statistics distribution of Par_N16 D(parviglumis, mexicana, ParN16, Tripsacum) is statistically similar to the distribution of Reventador (BKN022) (two-sample Kolmogorov-Smirnov, P=0.5814) and significantly different from the distribution of Palomero Toluqueño (two-sample Kolmogorov-Smirnov, p<0.0001) (Figure 4—figure supplement 2). The standard deviation of all 1000 jackknife replications is narrow in all cases (SD <0.001), suggesting that D values are consistent across the genomes. These results agree with the D statistics analysis in which only transversions were used (Figure 4—figure supplement 3), showing that the absence of significant gene flow between Par_N16 and mexicana is not biased by the molecular damage filter. To further confirm the absence of mexicana introgression in Par_N

Peer ReviewDOI
26 Apr 2022
TL;DR: GenESPACE as mentioned in this paper is a tool that integrates conserved gene order and orthology to define the expected physical position of all genes across multiple genomes across three levels of biological organization: spanning 300 million years of vertebrate sex chromosome evolution, across the diversity of the Poaceae plant family, and among 26 maize cultivars.
Abstract: Article Figures and data Abstract Editor's evaluation eLife digest Introduction Results and discussion Materials and methods Data availability References Decision letter Author response Article and author information Metrics Abstract The development of multiple chromosome-scale reference genome sequences in many taxonomic groups has yielded a high-resolution view of the patterns and processes of molecular evolution. Nonetheless, leveraging information across multiple genomes remains a significant challenge in nearly all eukaryotic systems. These challenges range from studying the evolution of chromosome structure, to finding candidate genes for quantitative trait loci, to testing hypotheses about speciation and adaptation. Here, we present GENESPACE, which addresses these challenges by integrating conserved gene order and orthology to define the expected physical position of all genes across multiple genomes. We demonstrate this utility by dissecting presence–absence, copy-number, and structural variation at three levels of biological organization: spanning 300 million years of vertebrate sex chromosome evolution, across the diversity of the Poaceae (grass) plant family, and among 26 maize cultivars. The methods to build and visualize syntenic orthology in the GENESPACE R package offer a significant addition to existing gene family and synteny programs, especially in polyploid, outbred, and other complex genomes. Editor's evaluation GENESPACE is a new and straightforward computational tool to include synteny information in the calculation of genome-wide sets of orthologs. The development of this tool is very timely as more and more complete chromosome-scale assembled genomes are becoming available. While the assembly problem has been solved, this is not the case for multiple genome comparisons, and GENESPACE is an important step to help remedy this gap in our comparative genomics toolbox. https://doi.org/10.7554/eLife.78526.sa0 Decision letter Reviews on Sciety eLife's review process eLife digest The genome is the complete DNA sequence of an individual. It is a crucial foundation for many studies in medicine, agriculture, and conservation biology. Advances in genetics have made it possible to rapidly sequence, or read out, the genome of many organisms. For closely related species, scientists can then do detailed comparisons, revealing similar genes with a shared past or a common role, but comparing more distantly related organisms remains difficult. One major challenge is that genes are often lost or duplicated over evolutionary time. One way to be more confident is to look at ‘synteny’, or how genes are organized or ordered within the genome. In some groups of species, synteny persists across millions of years of evolution. Combining sequence similarity with gene order could make comparisons between distantly related species more robust. To do this, Lovell et al. developed GENESPACE, a software that links similarities between DNA sequences to the order of genes in a genome. This allows researchers to visualize and explore related DNA sequences and determine whether genes have been lost or duplicated. To demonstrate the value of GENESPACE, Lovell et al. explored evolution in vertebrates and flowering plants. The software was able to highlight the shared sequences between unique sex chromosomes in birds and mammals, and it was able to track the positions of genes important in the evolution of grass crops including maize, wheat, and rice. Exploring the genetic code in this way could lead to a better understanding of the evolution of important sections of the genome. It might also allow scientists to find target genes for applications like crop improvement. Lovell et al. have designed the GENESPACE software to be easy for other scientists to use, allowing them to make graphics and perform analyses with few programming skills. Introduction De novo genome assemblies and gene model annotations represent increasingly common resources that describe the sequence and positions of protein coding and intergenic regions within a single genotype. Evolutionary relationships among these DNA sequences form the foundation of many molecular tools in modern medical, breeding, and evolutionary biology research. Perhaps the most crucial inference to make when comparing genomes revolves around homologous genes, which share an evolutionary common ancestor and ensuing sequence or protein structure similarity. Analyses of homologs, including comparative gene expression, epigenetics, and sequence evolution, require the distinction between orthologs, which arise from speciation events, and paralogs, which arise from sequence duplications (see Box 1 for definitions). In some systems, this is a simple task where most genes are single copy, and orthologs are synonymous with reciprocal best-scoring protein BLAST hits. Other sequence similarity approaches such as OrthoFinder (Emms and Kelly, 2019; Emms and Kelly, 2015) leverage graphs and gene trees to test for orthology, permitting more robust analyses in systems with gene copy number (CNV) or presence–absence variation (PAV). However, whole-genome duplications (WGDs), chromosomal deletions, and variable rates of sequence evolution, such as subgenome dominance in polyploids, can confound the evidence of orthology from sequence similarity alone. Box 1 Definitions Orthogroup — a set of genes across multiple genomes derived from a single ancestral gene. Ortholog — a pair of orthogroup members in two species derived from a single gene in their most recent common ancestor. Paralog — orthogroup members derived from a duplication event since speciation. Homeolog — paralogs derived from a whole-genome duplication. Tandem array — paralogs in proximity on a chromosome within a genome. Gene collinearity — retained order of genes across species due to common ancestry. Synteny — like collinearity but at larger scales, like chromosome arms. Pan-genome annotation — A set of orthogroups across multiple genomes, placed along the coordinate system of a specified reference genome. The physical position of homologs offers a second line of evidence that can help overcome challenges posed by WGDs, tandem arrays, heterozygous-duplicated regions, and other genomic complexities (Drillon et al., 2020; Haug-Baltzell et al., 2017; Wang et al., 2012). Synteny, or the conserved order of DNA sequences among chromosomes that share a common ancestor, is a typical feature of eukaryotic genomes. In some taxa, synteny is preserved across hundreds of millions of years of evolution and is retained over multiple WGDs (Jiao et al., 2014; Simakov et al., 2020; Zhao and Schranz, 2019). Like chromosome-scale synteny, conserved gene order collinearity along local regions of chromosomes can provide evidence of homology, and in some cases enable determination of whether two regions diverged as a result of speciation or a large-scale duplication event (Drillon et al., 2020). Combined, evidence of gene collinearity and sequence similarity should improve the ability to classify paralogous and orthologous relationships beyond either approach in isolation. Integrating synteny and collinearity into comparative genomics pipelines also physically anchors the positions of related gene sequences onto the assemblies of each genome. For example, by exploring only syntenic orthologs it is possible to examine all putatively functional variants within a genomic region of interest, even those in genes that are absent in the focal reference genome (Lovell et al., 2018). Such a ‘pan-genome annotation’ framework (Lovell et al., 2021a) permits access to multi-genome networks of high-confidence orthologs and paralogs, regardless of ploidy or other complicating aspects of genome biology. Here, we present GENESPACE, an analytical pipeline that explicitly links synteny and sequence similarity to provide high-confidence inference about networks of genes that share a common ancestor and represents these networks as a ‘pan-genome annotation’. We then leverage this framework to explore gene family evolution in flowering plants, mammals, and reptiles. Results and discussion GENESPACE syntenic orthology methods to compare multiple complex genomes Comparative genomics across the complex evolutionary histories of eukaryotes typically requires equally varied input and analytical pipelines depending on researchers’ goals and study systems. For example, synteny between closely related haploid assemblies is often inferred by exploring only 1:1 reciprocal best-scoring hits with MCScanX (Wang et al., 2012). Alternatively, polyploid genomes are typically split into subgenomes so that homeologs are viewed by clustering algorithms like OrthoFinder (Emms and Kelly, 2019; Emms and Kelly, 2015) as orthologous and not paralogous. While expert knowledge that informs these analytical decisions can dramatically improve precision, this knowledge is not available in many systems. These issues boil down to a simple circular problem: a priori knowledge of gene copy number is needed to effectively infer orthology and synteny, yet measures of synteny and orthology are needed to infer copy number between a pair of sequences. GENESPACE resolves this circular problem by operating on a foundational assumption: homologs should be exactly single copy within any syntenic region between a pair of genomes. There are two major violations of this assumption that cause copy number variation within a syntenic block: (1) tandem arrays and (2) gene PAV. GENESPACE addresses these complexities directly in two ways. First, physically proximate multigene families (hereon, ‘tandem arrays’, Box 1) are condensed to the physically most central gene of the array. Gene rank order along the genome is recalculated on these ‘array representative’ genes, effectively masking copy number variation due to tandem arrays. Second, synteny is inferred in a pairwise manner only using ‘potential anchor’ protein BLAST hits (hereon ‘hits’) where both the query and target genes are in the same orthogroup. The rank-order positions of these potential anchor genes are also recalculated prior to synteny inference, which effectively masks orthogroups missing a gene in one genome (i.e., PAV). Thus, GENESPACE operates on OrthoFinder-derived orthogroups with one-to-one relationships for any accurately defined syntenic region, regardless of ploidy or level of sequence conservation. Given GENESPACE’s reliance on syntenic regions between genomes, errors in syntenic block coordinates can have major effects on downstream estimates. Therefore, we crafted a sensitive pipeline to infer syntenic regions (see pipeline overview in Figure 1). To demonstrate its functionality, we ran GENESPACE on seven pairs of genomes spanning closely related maize cultivars to ancient mammal–bird divergence (~300 M ya; Table 1, see Materials and methods). The three comparisons between vertebrate genomes have no recent history of WGDs, while WGDs predated evolutionary divergence of the four plant contrasts: maize is a 12 M ya paleo-tetraploid, cotton is a 1.6 M ya meso-tetraploid, and the ~70 M ya Rho WGD predated grass diversification. Paralogs derived from these WGDs notoriously obscure contrasts between orthologs in these systems. As such, we treated all genomes as haploid and did not include an outgroup; therefore, orthogroups should not include homeologs since the genomes share a common ancestor that arose after the WGD (see ‘additional considerations’ methods section: Outgroups and the phylogenetic context of orthology inference). In each run, we contrasted GENESPACE-derived orthogroups and syntenic blocks to the defaults from OrthoFinder and MCScanX, respectively (Table 1). Since it is a common practice to refine hits prior to running MCScanX, we also included syntenic block coverage from MCScanX run on orthogroup-constrained hits where both the target and query genes must be in the same orthogroup. To contrast each approach, we calculated the percent of genes in (1) orthogroups and (2) syntenic blocks that were placed on exactly one chromosome per genome (Table 1). Figure 1 Download asset Open asset GENESPACE synteny and pan-genome annotation methods. (A, grey panel) GENESPACE runs and parses OrthoFinder results into a synteny-constrained pan-genome annotation. (B, purple panel) Chromosome, gene rank order, and orthogroup membership are added to BLAST hits, which allows direct integration between estimates of orthology and synteny. The three dotplots present the efficacy of GENESPACE syntenic blocks by exploring a particularly challenging region on human (x-axis) and chimpanzee (y-axis) chr. 6. Each point is a BLAST hit rank-order position, colored by syntenic block; colors are recycled if there are more than eight blocks. (C, green panel) Synteny-constrained orthogroups and optionally non-syntenic orthologs are decomposed into a pan-genome annotation where each orthogroup is placed at its inferred syntenic position. Table 1 Comparison of synteny and orthogroup methods. To test the precision of GENESPACE syntenic orthogroups estimates, we contrasted seven pairs of haploid genome assemblies. We present the percent of genes that were found in an orthogroup that hit a single chromosome per genome from the default OrthoFinder and GENESPACE runs. The precision of syntenic block breakpoint estimates was calculated similarly, where the percentage of genes that are placed in a single syntenic block per genome are presented for MCScanX run on all hits, those where both the query and target genes are in the same orthogroup (‘OG’) or via the GENESPACE pipeline. (a) % genes in single-copy OGs(b) % genes in single-copy syntenic blocksAge (~M ya)OrthoFinderGENESPACEMCScanXMCScanX OGGENESPACEB73 vs. B97 maize*<0.0151.573.650.879.093.4Human Hg38 vs. T2T0–0.187.795.981.195.097.7Cotton*,+0.535.685.72.714.196.2HAL2 vs. FIL2 panicgrass*1.174.883.262.389.392.0Human-chimpanzee781.190.278.691.293.3Sorghum-Brachypodium*5046.750.249.367.476.3Human-chicken31066.768.566.471.273.0 * The plant genomes all have one or more WGDs that predate divergence of the genomes,.+Cotton species Gossypium barbadense and G. darwinii have the most recent WGD of ~1.6 M ya, which causes a large number of blocks to be included as two copies; to avoid confusion between subgenomes, blkSize, and nGaps parameters were increased from 5 (default) to 10 genes. For every pair of genomes, GENESPACE produced a greater percentage of single-copy syntenic blocks and genes in single chromosome orthogroups than either OrthoFinder or MCScanX in isolation. GENESPACE also outperformed simple integration between the two methods through MCScanX on orthogroup-constrained hits. The improved performance of GENESPACE synteny-constrained orthogroups was most subtle between highly diverged haploid animal assemblies. For example, in the comparison between human and chicken, GENESPACE resolved 2% and 1.8% more single-copy orthogroups and syntenic blocks than default OrthoFinder and orthogroup-constrained MCScanX, respectively. In contrast, the benefits of GENESPACE were most pronounced in recently diverged genomes with a history of WGDs. Single-copy orthogroups between two meso-tetraploid cotton species that share a WGD that predated speciation by ~1 M ya, were uncommon in the default OrthoFinder run (35.6%) but far more prevalent in GENESPACE syntenic orthogroups (85.7%). Similarly, homeologs derived from the cotton WGD impacted estimates of syntenic blocks: only 14% of the genomes were single-copy syntenic in orthogroup-constrained MCScanX but 96.2% were single copy in GENESPACE syntenic blocks. Combined, these results demonstrate significant flexibility and utility of GENESPACE across a range of evolutionary histories and divergence. It is important to note that some evolutionary processes, including small-scale translocations, can cause true orthologs to exist outside of syntenic regions. Closely related genomes without a history of WGDs tend to have few non-syntenic orthologs. For example, there are 1096 non-syntenic orthologs (6.3% of all orthologs) between human and chimpanzee. In contrast, the 50 M ya diverged Sorghum and Brachypodium genomes have 9002 non-syntenic orthologs, many of which are the result of over-retained Rho WGD-derived paralogs (see below). Since the non-syntenic orthologs can be important in some situations, GENESPACE embraces this complexity by including and flagging non-syntenic orthologs within the pan-genome annotation (Figure 1). Synteny-anchored exploration of vertebrate sex chromosomes GENESPACE facilitates the exploration and analysis of sequence evolution across multiple genomes within regions of interest (e.g., quantitative trait loci [QTL] intervals, see the next section). One particularly instructive example comes from the origin and evolution of the mammalian XY and avian ZW sex chromosome systems. To explore these chromosomes, we ran GENESPACE on 15 haploid avian and mammalian genome assemblies, spanning most major clades of birds, placental mammals, monotremes, and marsupials with available chromosome-scale annotated reference genomes (See Materials and methods). We also included two reptile genomes as outgroups to the avian genomes. The heteromorphic chromosomes (Y and W) are often unassembled, or, where assemblies exist, lack sufficient synteny to provide a useful metric for comparative genomics. As such, we chose to focus on the homomorphic X and Z chromosomes, which have remained surprisingly intact over the >100 M years of independent mammalian (Murphy et al., 1999) and avian evolution (Zhou et al., 2014; Figure 2, Figure 2—figure supplement 1). Figure 2 with 1 supplement see all Download asset Open asset Sex chromosome syntenic network across 17 representative vertebrate genomes. The plot was generated by the plot_riparian GENESPACE function. Genomes are ordered vertically to minimize the number of translocations between each pairwise combination. Chromosomes are ordered horizontally to maximize synteny with the human chromosomes [X, 1–22]. Regions containing syntenic orthogroup members to the mammalian X (gold) or avian Z (blue) chromosomes are highlighted. All sex chromosomes are represented by red segments while autosomes are white. Chromosome segment sizes are scaled by the total number of genes in syntenic networks and positions of the braids are the gene order along the chromosome sequence. See Figure 2—figure supplement 1 for the full synteny graph including autosomes and chromosome labels. While the same or similar genomic regions often recurrently evolve into sex chromosomes, perhaps due to ancestral gene functions involved in gonadogenesis, evidence about the nonrandomness of sex chromosome evolution is still contentious (Kratochvíl et al., 2021). Given our analysis, the avian Z chromosome clearly did not evolve from either of the two reptile Z chromosomes sampled here, but instead likely arose from autosomal regions or unsampled ancestral sex chromosomes. The situation in mammals is less clear, in part because both reptile genomes are more closely related to avian than mammalian genomes, which makes ancestral state reconstructions between the two groups less accurate. Nonetheless, the mammalian X and sand lizard Z chromosomes partially share syntenic orthology, an outcome that would be consistent with common descent from a shared ancestral sex chromosome or autosome containing sex-related genes. The shared 91.7 M bp region between the human X and sand lizard Z represents 59.0% of the human X chromosome genic sequence. The remaining 64.0 M bp of human X linked sequences are syntenic with autosomes in the sand lizard and garter snake genomes (Figure 2). The eutherian mammalian X chromosome is largely composed of two regions, an X-conserved ancestral sex chromosome region that arose in the common ancestor of therian mammals, and an X-added region that arose in the common ancestor of eutherians (Ross et al., 2005). Consistent with this evolutionary history, the X chromosome is syntenic across all five eutherian mammals studied here. Further, a 107.2 M bp (68.8%) segment of the human X, which corresponds with the X-conserved region, is syntenic with 77.8 M bp (93.9%) of the Tasmanian devil X chromosome and represents the entire syntenic region between the human and all three marsupial X chromosomes (Figure 2). Similarly, the chicken Z chromosome is retained in its entirety across all five avian genomes. The only notable exception being the budgie Z chromosome, which features a partial fusion between the Z and an otherwise autosomal 19.5 M bp segment of chicken chromosome 11 (Figure 2), potentially representing a neo-sex chromosome fusion that has not yet been described. In contrast to conserved eutherian and avian sex chromosomes, the complex monotreme XnYn sex chromosomes are only partially syntenic between the two sampled genomes. Only the first X chromosomes are ancestral to both echidna and platypus (Rens et al., 2007), and all are unrelated to the mammalian X chromosomes (Figure 2—figure supplement 1), consistent with their independent evolution (Rens et al., 2007). Interestingly, the entirety of the echidna X4 and 47.6 M bp (67.9%) of the genic region of the platypus X5 chromosomes are syntenic with the avian Z chromosome (Figure 2). The phylogenetic scale of the genomes presented here precludes evolutionary inference about the origin of these shared sex chromosome sequences; however, the possibility of parallel evolution of sex chromosomes between such diverged lineages may prove an interesting future line of inquiry. Exploiting synteny to track candidate genes in grasses The Poaceae grass plant family is one of the best studied lineages of all multicellular eukaryotes and includes experimental model species (Brachypodium distachyon; Panicum hallii; Setaria viridis) and many of the most productive (Zea mays – maize/corn; Triticum aestivum – wheat, Oryza sativa – rice) and emerging (Sorghum bicolor – sorghum; Panicum virgatum – switchgrass) agricultural crops. Despite the tremendous genetic resources of these and other grasses, genomic comparisons among grasses are difficult, in part because of an ancient polyploid origin (see the next section), and because subsequent WGDs are a feature of most clades of grasses. For example, maize is an 11.4 M ya paleo-polyploid (Gaut and Doebley, 1997), allo-tetraploid switchgrass formed 4–6 M ya (Lovell et al., 2021b), and allo-hexaploid bread wheat arose about 8 k ya (Haas et al., 2019). In some cases, homeologous gene duplications from polyploidy have generated genetic diversity that can be targeted for crop improvement; however, in other cases the genetic basis of trait variation may be restricted to sequences that arose in a single subgenome. Thus, it is crucial to contextualize comparative–quantitative genomics and explicitly explore only the orthologous or homeologous regions of interest when searching for markers or candidate genes underlying heritable trait variation — a significant challenge in the complex and polyploid grass genomes. To help overcome this challenge and provide tools for grass comparative genomics, we conducted a GENESPACE run and built an interactive viewer hosted on Phytozome (Goodstein et al., 2012). Owing to its use of within-block orthology and synteny constraints (Figure 1), GENESPACE is ideally suited to conduct comparisons across species with diverse polyploidy events. Default parameters produced a largely contiguous map of synteny even across notoriously difficult comparisons like the paleo-homeologs between the maize subgenomes (Figure 3A). Furthermore, the sensitive synteny construction pipeline implemented by GENESPACE effectively masks additional paralogous regions like those from the Rho duplication that gave rise to all extant grasses. Figure 3 with 1 supplement see all Download asset Open asset Comparative–quantitative genomics in the grasses. (A) The GENESPACE syntenic map (‘riparian plot’) of orthologous regions among eight grass genomes. Chromosomes are ordered horizontally to maximize synteny with rice and ribbons are color coded by synteny to rice chromosomes. Genomes are ordered vertically by general phylogenetic positions. (B) The upper bars display the proportion of maize gene models without syntenic orthologs (‘absent’) in each genome, split by the full background (dark colors) and 86 C3/C4 genes (light colors). (C) The proportion of absent genes is higher in the C3 genomes (green bars), even when controlling for more global gene absences (lower odds ratios). (D) Syntenic orthologs, excluding homeologs among the 26 maize nested association mapping (NAM) founder genomes, with two quantitative trait loci (QTL) intervals highlighted on chromosome 3 (‘Chr3’) and chromosome 6 (’Chr6’). (E) Focal QTL regions that affect productivity in drought where only the genome that drives the QTL effect (middle), the top (B73) and bottom (Tzi8) genomes are presented and the region plotted is restricted to the physical B73 QTL interval and a 25 M bp buffer on either side. Note that the Chr3 QTL disarticulates into two intervals. Due to a larger number of potential candidate genes, the larger Chr3 region, flagged with **, is explored separately in Figure 3—figure supplement 1. (F) Presence–absence and copy number variation are presented for two of the three intervals as heatmaps where each row is a genome (order following panel D), each column is a pan-genome entry (see Figure 1), and the color of each tile indicates absence (gray), single copy (light blue), and multicopy (dark blue). PAV/CNV of the focal genome is outlined. For each interval, the estimated QTL allelic effect relative to B73 of each genome is plotted as bars to the right of the heatmap. Breeders and molecular biologists can take two general approaches to understand the genetic basis of complex traits: studying variation caused by a priori-defined genes of interest or determining candidate genes from genomic regions of interest. As an example of the exploration of lists of a priori-defined candidate genes, we analyzed PAV of 86 genes shown to be involved in the transition between C3 and C4 photosynthesis (Ding et al., 2015), the latter permitting ecological dominance in arid climates and agricultural productivity under forecasted increased heat load of the next century. To conduct this analysis, we built pan-genome annotations across the seven grasses anchored to C4 maize, which was the genome in which these genes were discovered. This resulted in 159 pan-genome entries: nearly always two placements for each gene in the paleo-tetraploid maize genome. Given that many of these genes were discovered in part because of sequence similarity to genes in Arabidopsis and other diverged plant species, it is not surprising that PAV among C3/C4 genes was lower than the background (9.7% vs. 38.2%, odds ratio = 5.7, p < 1 × 10−16; Figure 3B). However, these ratios were highly variable among genomes, particularly among the C3 species (wheat, rice, B. distachyon), which had far more absences than the C4 species (15.3% vs. 5.5%, odds ratio = 3.1, p = 6.25 × 10−8, Figure 3B). This effect is undoubtedly due in part to the increased evolutionary distance between maize and the C3 species compared to the other C4 species. However, when controlling for the elevated level of absent genes globally in C3 species, the effect was still very strong: the odds of C3 species having more of these C3/C4 genes at syntenic pan-genome positions than the background was always lower than the C4 species (Figure 3C). Despite these interesting patterns, given only a single C3/C4 phylogenetic split in this dataset, it is not possible to test evolutionary hypotheses regarding the causes of such PAV. Nonetheless, this result suggests a possible role of gene loss or gain as an evolutionary mechanism for drought- and heat-adapted photosynthesis. Like the exploration of a priori-defined sets of genes, finding candidate genes within QTL intervals usually involves querying a single reference genome and extracting genes with promising annotations or putatively functional polymorphism. In the case of a biparental mapping population genotyped against a single reference, this is a trivial process where genes within physical bounds of a QTL are the candidates. However, many genetic mapping populations now have reference genome sequences for all parents; this offers an opportunity to explore variation among functional alleles and PAV, which would be impossible with a single reference genome. GENESPACE is ideally suited for this type of exploration, and indeed was originally designed to solve this problem between the two P. hallii reference genomes and their F2 progeny (Lovell et al., 2018) using synteny to project the positions of genes across multiple genomes onto the physical positions of a reference. To illustrate this approach, we reanalyzed QTL generated from the 26-parent USA maize nested association mapping (NAM) population (Li et al., 2016). Originally, candidates for these QTL were defined by the proximate gene models only in the B73 reference genome (Li et al., 2016); however, with GENESPACE and the recently released NAM parent genomes (Hufford et al., 2021), it is now possible to evaluate candidate genes present in the genomes of other NAM founder lines but either absent or unannotated in the B73 reference genome. To explore this possibility, we built a single-copy synteny map across all 26 NAM founders anchored to the B73 genome (Figure 3D). We opted to focus on QTL where the allelic effect of a single parental genome was an outlier relative to all other alleles. Such ‘private’ allelic contributions in mul

Posted ContentDOI
15 Dec 2022-bioRxiv
TL;DR: In this paper , the authors used Arabidopsis thaliana plants deficient for miR472 function to study the impact of releasing its NLR targets on plant growth and reproduction and on defence against the fungal pathogen Plectospharaella cucumerina.
Abstract: After having co-existed in plant genomes for at least 200 million years, the products of microRNA (miRNA) and Nucleotide-Binding Leucine Rich Repeat protein (NLR) genes formed a regulatory relationship in the common ancestor of modern gymnosperms and angiosperms. From then on, DNA polymorphisms occurring at miRNA target sequences within NLR transcripts must have been compensated by mutations in the corresponding mature miRNA sequence, therefore maintaining that regulatory relationship. The potential evolutionary advantage of such regulation remains largely unknown and might be related to two mutually non-exclusive scenarios: miRNA-dependent regulation of NLR levels might prevent defence mis-activation with negative effects on plant growth and reproduction; or reduction of active miRNA levels in response to pathogen derived molecules (PAMPS and silencing suppressors) might rapidly release otherwise silent NLR transcripts for rapid translation and thereby enhance defence. Here, we used Arabidopsis thaliana plants deficient for miR472 function to study the impact of releasing its NLR targets on plant growth and reproduction and on defence against the fungal pathogen Plectospharaella cucumerina. We show that miR472 regulation has a dual role, participating both in the tight regulation of plant defence and growth. MIM472 lines, with reduced active miR472, are more resistant to pathogens and, correlatively, have reduced relative growth compared to wild-type plants. However, despite MIM472 lines flower at the same time than their wild-type, the end of their reproductive phase is delayed, and they exhibit higher adult biomass, resulting in similar seed yield as the wild-type. Our study highlights how negative consequences of defence activation might be compensated by changes in phenology and that miR472 reduction is an integral part of plant defence responses.

TL;DR: A high resolution predictive model of mutation rates as a function of cytogenetic features across the genome is created and it is found that mutation rates are significantly predicted by features such as GC content, histone modifications, and chromatin accessibility.
Abstract: 29401, USA 7 ​ Department of Biology and Microbiology, South Dakota State University, Brookings, SD 57007, USA † ​ corresponding authors: ​ greymonroe@gmail.com ​ , ​ weigel@weigelworld.org Classical evolutionary theory maintains that mutation rate variation between genes should be random with respect to fitness ​ 1–4 and evolutionary optimization of genic mutation rates remains controversial ​ 3,5 ​ . However, it has now become known that cytogenetic (DNA sequence + epigenomic) features influence local mutation probabilities 6 ​ , which is predicted by more recent theory to be a prerequisite for beneficial mutation rates between different classes of genes to readily evolve ​ 7 ​ . To test this possibility, we used ​ de novo mutations in ​ Arabidopsis thaliana ​ to create a high resolution predictive model of mutation rates as a function of cytogenetic features across the genome ​ . As expected, mutation rates are significantly predicted by features such as GC content, histone modifications, and chromatin accessibility. Deeper analyses of predicted mutation rates reveal effects of introns and untranslated exon regions in distancing coding sequences from mutational hotspots at the start and end of transcribed regions in A. thaliana ​

Peer ReviewDOI
04 Sep 2022
TL;DR: In this article , the authors compare three ciliate species that share complex pathways for natural genome editing and reveal a pathway for the origin and evolution of extremely rearranged chromosomes, which allows capture of intermediate states in the acquisition of scrambled genes.
Abstract: The comparison of three ciliate species that share complex pathways for natural genome editing allows capture of intermediate states in the acquisition of scrambled genes and elucidating a pathway for the origin and evolution of extremely rearranged chromosomes.