Journal Article•DOI•

Asymmetric subgenome selection and cis -regulatory divergence during cotton domestication

Maojun Wang¹, Lili Tu¹, Min Lin¹, Zhongxu Lin¹, Pengcheng Wang¹, Qingyong Yang¹, Zhengxiu Ye¹, Chao Shen¹, Jianying Li¹, Lin Zhang¹, Xiaolin Zhou¹, Xinhui Nie², Zhonghua Li¹, Kai Guo¹, Yizan Ma¹, Cong Huang¹, Shuangxia Jin¹, Longfu Zhu¹, Xiyan Yang¹, Ling Min¹, Daojun Yuan¹, Qinghua Zhang¹, Keith Lindsey³, Xianlong Zhang¹ - Show less +20 more•Institutions (3)

Huazhong Agricultural University¹, Xinjiang Production and Construction Corps², Durham University³

01 Apr 2017-Nature Genetics (Nature Research)-Vol. 49, Iss: 4, pp 579-587

TL;DR: A variation map for 352 wild and domesticated cotton accessions is described and evidence showing asymmetric subgenome domestication for directional selection of long fibers is provided, providing new insights into the evolution of gene organization, regulation and adaptation in a major crop.

read less

Abstract: Comparative population genomics offers an excellent opportunity for unraveling the genetic history of crop domestication. Upland cotton (Gossypium hirsutum) has long been an important economic crop, but a genome-wide and evolutionary understanding of the effects of human selection is lacking. Here, we describe a variation map for 352 wild and domesticated cotton accessions. We scanned 93 domestication sweeps occupying 74 Mb of the A subgenome and 104 Mb of the D subgenome, and identified 19 candidate loci for fiber-quality-related traits through a genome-wide association study. We provide evidence showing asymmetric subgenome domestication for directional selection of long fibers. Global analyses of DNase I-hypersensitive sites and 3D genome architecture, linking functional variants to gene transcription, demonstrate the effects of domestication on cis-regulatory divergence. This study provides new insights into the evolution of gene organization, regulation and adaptation in a major crop, and should serve as a rich resource for genome-based cotton improvement.

...read moreread less

Summary (5 min read)

Jump to: [A genome variation map for cotton] – [Cotton population properties and linkage disequilibrium] – [Selection signals during cotton domestication] – [Asymmetric subgenome domestication for long white fiber] – [Effects of domestication on cis-regulatory elements in promoters] – [Genome variation underlies distant regulatory divergence] – [DISCUSSION] – [METHODS] – [Plant materials and re-sequencing] – [Mapping and variation calling] – [Prediction of structural variation] – [Population-genetic analyses] – [Identification of domestication sweeps] – [Genome-wide association studies for fiber quality-related traits] – [Construction of ancestral karyotypes] – [RNA-seq and data analysis] – [Bisulfite-treated DNA sequencing data analysis] – [DNase-seq and DHS identification] – [Motif discovery] – [ChIP-Seq and data analysis] – [Hi-C experiments and sequencing] and [Hi-C data analysis]

A genome variation map for cotton

These included 31 wild accessions and 321 cultivated accessions from around the world (Fig. 1a and Supplementary Table 1 ).
These data were mapped against the TM-1 genome 9 to identify genomic variants.
In addition, the authors selected 50 representative accessions (10 wild and 40 cultivated cottons) from the 352 accessions for RNA sequencing (Supplementary Table 5 ), and generated 78,728 SNPs, of which more than 93.6% overlapped with SNPs from re-sequencing data.
This integrated variation data set represents a new resource for cotton genetics and breeding.

Cotton population properties and linkage disequilibrium

The authors explored the phylogenetic relationship between the 352 cotton accessions using a whole-genome SNP analysis.
This group could be further classified into two subclades (Group-III-1 and Group-III-2; Fig. 1b ), which exhibit different geographic distribution patterns.
The subclade Group-III-1 is represented by cotton accessions from northern China (NIR and NSEMR), while Group-III-2 includes the majority of accessions from southern China (YtRR).
This shows that a large amount of genetic variation in both subgenomes has been lost during cotton domestication, especially for the Dt. Compared with other major crops, cotton possesses narrow genetic diversity even within wild cotton accessions (Supplementary Table 6 ).
This reveals large population divergence between the Chinese group and the Wild group.

Selection signals during cotton domestication

Millennia of domestication has brought many morphological transformations to cotton, including an annualized growth cycle, photoperiod insensitivity, loss of seed dormancy, and superior spinnable white fiber 7, 8 .
The authors investigated nucleotide diversity of genes residing in the 25 QTL hotspots to identify putative loci with selection signals underlying these domestication-related traits.
Strikingly, 19 of 25 QTL hotspots with 327 genes were located in the Dt. Fiber quality improvement has been one of the most important breeding goals during cotton domestication.
Among these associations, 16 signals were previously uncharacterized.
The authors also identified a GWAS signal associated with fiber elongation rate on chromosome D04 (Fig. 2g ), where a gibberellin response gene is located.

Asymmetric subgenome domestication for long white fiber

Most fiber characteristics in wild Upland cotton were probably inherited directly from its wild A-genome diploid ancestor post-allopolyploidization 30 , while fiber color is similar to that of its D-genome diploid ancestor.
The development of the long white fiber trait in cultivated Upland cotton is the result of millennia of strong directional selection from its wild counterpart.
The genetic basis of this developmental change remains largely unknown.
To understand the relative contributions of the co-existing At and Dt genomes during domestication, the authors constructed ancestral pseudochromosomes to address this question at the subgenome level.
By comparing overlaps with domestication signals, the authors identified 620 homoeologous pairs that have been subject to domestication selection in the At or Dt (192 in the At and 428 in the Dt), and only 34 homoeologous pairs with selection signals in both subgenomes (Supplementary Fig. 6 ).

Effects of domestication on cis-regulatory elements in promoters

Human selection of desirable agronomic traits not only affects the organization of functional genes, but may also reshape the gene regulatory landscape.
Specifically, intergenic non-coding variants can affect the activity of cis-regulatory elements (CREs) [39] [40] [41] , and can contribute to differential gene expression patterns between populations (Supplementary Fig. 7 ).
As predicted, the patterns of chromatin modification marks in cotton are different between genic and TE regions (Supplementary Fig. 10 ).
To investigate how variants in promoter DHSs might influence the expression of genes, the authors looked for associations between variants and transcription binding motifs.
The authors found that some well-known transcription binding motifs were under purifying selection in the cultivated groups, and some were under positive selection (Fig. 4i and Supplementary Table 14 ).

Genome variation underlies distant regulatory divergence

A range of high-throughput methods, such as high-throughput chromosome conformation capture (Hi-C) and chromatin interaction analysis by paired-end tag sequencing (ChIA-PET), have been developed to understand 3D genome architecture in the eukaryotic nucleus 47, 48 .
The authors generated 1.1 billion Hi-C paired-end reads, of which ca. 322 million were valid interaction reads (Supplementary Table 15 ).
The authors found that chromatin interactions are significantly enriched at promoters, distal DHSs such as enhancers and at regions marked by the active chromatin mark H3K4me3, but are less frequent at regions marked by H3K9me2 (Fig. 5d ).
DNase I digestion of chromatin on a representative wild cotton accession revealed that more than 94% of enhancers are shared in wild and domesticated cottons (Fig. 5j ), suggesting that domestication has had a limited effect on qualitative changes to enhancers.

DISCUSSION

Genome re-sequencing of 352 accessions of Upland cotton has provided new insights into the genetic history of this important crop.
Interestingly, the authors found no obvious population divergence between geographic groups in China, probably because of frequent migration of accessions for improvement breeding within a short period after introduction.
The authors primarily characterized some key molecular signatures of selection responsible for spinnable fine white fiber, of which some candidates were further identified by a GWAS analysis.

METHODS

The horizontal grey dashed lines in b-d and f-i show the significance threshold of GWAS (1/n; 6.3).
The other significant associations are presented in Supplementary Table 10 .

Plant materials and re-sequencing

Based on the population structure analysis, a core germplasm set, including 282 accessions was determined (Supplementary Table 1 ).
Cotton plants were cultivated in the greenhouse in Wuhan, China.
Young leaves were collected 4 weeks after planting and immediately frozen in liquid nitrogen until use.
Genomic DNA was extracted from leaves using the CTAB method 56 .
Paired-end sequencing (PE 150-bp reads) of each library was performed on the Illumina HiSeq X Ten system.

Mapping and variation calling

The allotetraploid cotton genome (Gossypium hirsutum L. acc. TM-1) and its annotation 9 were downloaded from the Internet (see URLs).
Scaffolds with lengths less than 1000 bp were excluded from further analysis.
Paired-end re-sequencing reads were mapped to the TM-1 genome using BWA software with the default parameters.
To obtain high-quality SNPs and indels, only variation detected by both software tools with sequencing depth of at least 8 was retained for further analysis.
Indels in exons were classified according to whether they lead to a frame-shift effect.

Prediction of structural variation

Structural variations (SVs) were identified using three software tools: Breakdancer (version 1.3.6) 61 , Delly (version 2) 62 and laSV (version 1.0.3) 63 , which integrate most existing methods (read-depth, read-pair, split-reads and de novo assembly of sequencing reads) for SV discovery.
Breakdancer was run on all cotton accessions using the BWA alignment with the parameters (-q 20 -y 30).
Delly, which uses paired-end mapping and a split-read method to discover SVs in the genome, was run separately for each sample using default settings.
LaSV, which first performs a reference-free de novo assembly of the sequencing reads and then compares the assembled contigs with the reference genome to identify SVs, was run separately for each sample using parameters (-k 75 -l 150 -s 20).
SVs (deletion, duplication, insertion and inversion) were retained if supported by at least two methods with a mapping depth of more than 10×.

Population-genetic analyses

To conduct the phylogenetic analysis, SNPs of all accessions were filtered with minor allele frequency (MAF) 0.05.
These SNPs were used to construct a neighbour-joining tree using PHYLIP software 64 and visualized using the online tool iTOL (see URLs).
Principal component analysis (PCA) analysis was performed using this SNP set with the smartpca program embedded in the EIGENSOFT package 65 .
Population structure was analyzed using the Structure program which infers the population structure by identifying different numbers of clusters (K) 66 .

Identification of domestication sweeps

For domestication sweep analysis, the authors combined cultivated cotton groups (ABI and Chinese groups) into a single group to exclude the potential effect of genetic drift.
The genetic diversity in the wild group was compared with that in the cultivated group Windows with an empirical F ST cutoff (top 5%) were regarded as highly differentiated regions.
These regions were compared with the analysis of domestication sweeps.
Genes with nonsynonymous SNPs in these regions were selected as under selective pressure across groups.

Genome-wide association studies for fiber quality-related traits

The traits include fiber length, fiber strength, micronaire value, fiber uniformity and fiber elongation rate.
The significant association threshold was set as 1/n (n, total SNP number).
The significant association regions were manually checked from the aligned re-sequencing reads against the TM-1 genome using SAMtools 59 .

Construction of ancestral karyotypes

To analyze selection signals at the subgenome level, the authors constructed the ancestral karyotype for each of the 13 chromosomes in putative diploid ancestors.
Homoeologous synteny blocks were identified in the 13 chromosome pairs between the At and the Dt subgenomes using MCScanX with default settings 70 .
Syntenic gene pairs were identified in these syntenic blocks containing more than five aligned genes.
A reciprocal blastp was run using gene sequences from the At and Dt subgenomes.
Genomic sequences consisting of gene regions and their flanking 2 kb sequences were ordered based on the Dt subgenome and concatenated to construct ancestral karyotypes.

RNA-seq and data analysis

Cotton leaves were sampled for gene expression analysis at the same developmental stage as for DNA re-sequencing.
A total of 2 µg RNA were used for library construction using the Illumina TruSeq RNA Kit (Illumina, San Diego, CA, USA) following the manufacturer's instructions.
RNA sequencing was performed on the Illumina HiSeq 3000 system (paired-end 150-bp reads).
The expression level of each gene was determined using Cufflinks (version 2.2.1) with a multi-read and fragment bias correction method 73 .

Bisulfite-treated DNA sequencing data analysis

The authors downloaded bisulfite-treated DNA sequencing data for leaf and fiber of TM-1 from the National Center for Biotechnology Information (NCBI) Sequence Read Archive collection (SRX710548-SRX710553).
Trimmomatic software was applied to clip sequencing adapters and filter low-quality reads 74 .
The Bismark methylation extractor program was run to extract potentially methylated cytosines.
Cytosines in CG, CHG and CHH contexts covered by at least three sequencing reads were retained for a binomial test (P-value cutoff 1e-5).

DNase-seq and DHS identification

Purified DNA fragments of between 100 bp and 200 bp following DNase I digestion were isolated with a Pippin HT (Sage Science, Beverly, MA, USA).
A total of 10 ng of the isolated fragments was used for library construction using the Illumina TruSeq Sample Prep Kit.
Libraries were sequenced using the Illumina HiSeq 2000 system (paired-end 100-bp reads).
The unique mapping data were processed to identify DNase I hypersensitive sites (DHSs).
MACS (version 1.4.2) 80 , another peak-calling algorithm, was also run to identify DHSs.

Motif discovery

The promoter DHSs were screened for transcription factor (TF) binding motifs using the findMotifsGenome.pl program in HOMER software (see URLs) 82 , with the parameters '-size given -len 8,10,12 -chopify -mset plants'.
The 2 kb upstream sequences of genes were used for motif discovery by the Patch 1.0 program, which searches the TRANSFAC Public 6.0 database (see URLs), with the following parameters: 1) the minimum length of sites was 8; 2) the maximum number of mismatches was 1; 3) the mismatch penalty was 100; 4) the lower score boundary was 87.5.

ChIP-Seq and data analysis

For each sample, a total of 10 ng ChIP DNA and Input control DNA were used for library construction using the Illumina TruSeq Sample Prep Kit, according to the manufacturer's instructions.
ChIP libraries were sequenced on the Illumina HiSeq 3000 system (paired-end 150-bp reads).
After removing PCR duplication and multiple mapping reads, the unique mapping data were used to call histone modification peaks using MACS software (version 2.1.0) 80 .
The Input DNA sequencing data was used as a control.

Hi-C experiments and sequencing

The clean samples were ground to powder in liquid nitrogen.
DNA ends were labelled with biotin, incubated at 37°C for 45 min, and enzyme was inactivated with 20% SDS solution.
After ligation, proteinase K was added to reverse cross-linking by incubation at 65°C overnight.
Hi-C libraries were sequenced on the Illumina HiSeq 3000 system.
The Hi-C experiment was carried out as two biological replicates.

Hi-C data analysis

Raw Hi-C data were processed to filter low-quality reads and trim adapters using Trimmomatic (version 0.32) 74 .
Read pairs that did not map close to a restriction site, or were not within the expected fragment size following shearing, were first filtered.
The remaining valid read pairs were divided into intra-chromosomal pairs and inter-chromosomal pairs.
Results from the second pass after an initial fit were used for further analysis.
Fragments overlapping with intergenic DHSs or promoters were extracted to construct a regulatory interactome.

Did you find this useful? Give us your feedback

Figures (1)

Table 1 Summary of the numbers of genomic variants in cotton populations. 516

Content maybe subject to copyright Report

Durham Research Online

Deposited in DRO:

07 March 2017

Version of attached le:

Accepted Version

Peer-review status of attached le:

Peer-reviewed

Citation for published item:

Wang, M. and Tu, L. and Lin, M. and Lin, Z. and Wang, P. and Yang, Q. and Ye, Z. and Shen, C. and Zhou,

X. and Zhang, L. and Li, J. and Nie, X. and Li, Z. and Guo, K. and Ma, Y. and Jin, S. and Zhu, L. and Yang,

X. and Min, L. and Zhang, Q. and Lindsey, K. and Zhang, X. (2017) 'Asymmetric subgenome selection and

cis-regulatory divergence during cotton domestication.', Nature genetics., 49 (4). pp. 579-587.

Further information on publisher's website:

https://doi.org/10.1038/ng.3807

Publisher's copyright statement:

Additional information:

Use policy

The full-text may be used and/or reproduced, and given to third parties in any format or medium, without prior permission or charge, for

personal research or study, educational, or not-for-prot purposes provided that:

•

a full bibliographic reference is made to the original source

•

a link is made to the metadata record in DRO

•

the full-text is not changed in any way

The full-text must not be sold in any format or medium without the formal permission of the copyright holders.

Please consult the full DRO policy for further details.

Durham University Library, Stockton Road, Durham DH1 3LY, United Kingdom

Tel : +44 (0)191 334 3042 | Fax : +44 (0)191 334 2971

https://dro.dur.ac.uk

Asymmetric subgenome selection and cis-regulatory divergence 1

during cotton domestication 2

Maojun Wang

, Lili Tu

, Min Lin

1,2

, Zhongxu Lin

, Pengcheng Wang

, Qingyong 4

Yang

1,2

, Lin Zhang

, Zhengxiu Ye

, Chao Shen

, Jianying Li

, Kai Guo

, Xiaolin 5

Zhou

, Xinhui Nie

, Zhonghua Li

, Yizan Ma

, Cong Huang

, Shuangxia Jin

, Longfu 6

Zhu

, Xiyan Yang

, Ling Min

, Daojun Yuan

, Qinghua Zhang

, Keith Lindsey

& 7

Xianlong Zhang

National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural 10

University, Wuhan 430070, Hubei, China. 11

Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, 12

Huazhong Agricultural University, Wuhan 430070, Hubei, China 13

Key Laboratory of Oasis Eco-agriculture of the Xinjiang Production and 14

Construction Corps, College of Agronomy, Shihezi University, Shihezi, Xinjiang, 15

China. 16

College of Plant Science and Technology, Huazhong Agricultural University, 17

Wuhan 430070, Hubei, China 18

Department of Biosciences, Durham University, Durham DH1 3LE, United 19

Kingdom. 20

Correspondence should be addressed to X.Z. (xlzhang@mail.hzau.edu.cn) or K.L. 22

(keith.lindsey@durham.ac.uk) 23

Tel: +86-27-87280510 25

Fax: +86-27-87280196 26

Comparative population genomics offers an excellent opportunity for 27

unravelling the genetic history of crop domestication. Upland cotton (Gossypium 28

hirsutum) has long been an important economic crop, but a genome-wide and 29

evolutionary understanding of the effects of human selection is largely 30

unresolved. Here, we describe an integrated variation map for 352 wild and 31

domesticated cotton accessions. This has allowed us to scan 93 domestication 32

sweeps and identify 19 candidate loci for fiber quality-related traits by a 33

genome-wide association study. We provide evidence to show asymmetric 34

subgenome domestication for directional selection of long white fibers. Global 35

analyses of DNase I-hypersensitive sites and 3-dimensional genome architecture, 36

linking functional variants to gene transcription, reveal the effects of 37

domestication on cis-regulatory divergence. This study provides new insights into 38

the evolution of gene organization, regulation and adaptation in a major crop, 39

and represents a rich resource for genome-based cotton improvement. 40

Early human domestication of wild plants represented the first step in the 42

development of modern crop varieties, and migration and differential directional 43

selection over millennia has contributed to the adaptation of species in different 44

environments for improved yield and quality traits

. In the current genomic era, 45

high-throughput ‘omics’ technologies provide significant opportunities for a detailed 46

analysis of genetic change through domestication and for new, targeted and precise 47

genome-based crop breeding strategies

2,3

. 48

Cotton is one of the most important economic crops in the world, both as a 49

source of natural and renewable fiber for textiles, and as a source of seed oil and 50

protein

. Allotetraploid Upland cotton is formed from an inter-genomic hybridization 51

event approximately 1–2 million years ago

. Originally native to the Yucatan 52

peninsula in Mesoamerica, it was first domesticated at least 4,000 to 5,000 years ago, 53

with subsequent directional selection

. Modern varieties of cultivated cotton produce 54

spinnable fine white fiber, which is preferable to the sparser, coarse brown fiber of 55

wild cotton. Previous molecular studies have shown that domestication has 56

dramatically rewired the transcriptome during fiber development

7,8

. What remains 57

largely unknown, however, is the effect of human selection on the organization of the 58

cotton genome and its gene regulatory landscape. Using as a comparator the recently 59

published genome sequence of Texas Marker-1 (TM-1)

9,10

, we can address this 60

question through a comprehensive population genome analysis of multiple wild and 61

cultivated cotton genotypes. 62

RESULTS 64

A genome variation map for cotton 65

To construct an integrated variation map of Upland cotton, we collected a total of 352 66

diverse accessions for genomic sequence analysis

. These included 31 wild 67

accessions and 321 cultivated accessions from around the world (Fig. 1a and 68

Supplementary Table 1). A total of 6.1 Tb of sequence data were integrated, with an 69

average depth of 6.9× (Supplementary Table 1). These data were mapped against the 70

TM-1 genome

to identify genomic variants. We detected a total of 7,497,568 SNPs, 71

351,013 small indels (shorter than 10 bp) and 93,786 structural variants (SVs) (Table 72

1, Supplementary Fig. 1 and Supplementary Tables 2-4). The accuracy of SNPs 73

was estimated to be 98.2%, determined by Sanger sequencing of 300 randomly 74

selected SNPs in 3 individual accessions. In addition, we selected 50 representative 75

accessions (10 wild and 40 cultivated cottons) from the 352 accessions for RNA 76

sequencing (Supplementary Table 5), and generated 78,728 SNPs, of which more 77

than 93.6% overlapped with SNPs from re-sequencing data. This integrated variation 78

data set represents a new resource for cotton genetics and breeding. 79

Cotton population properties and linkage disequilibrium 81

We explored the phylogenetic relationship between the 352 cotton accessions using a 82

whole-genome SNP analysis. These cottons can be divided into 3 groups (Fig. 1b and 83

Supplementary Fig. 2), as supported by a principal component analysis (PCA; Fig. 84

1c). Wild cotton accessions cluster together (Group-I; the Wild group) except for a 85

few accessions which cluster into a second group (Group-II; the ABI group), which 86

mainly comprises cottons from America, Brazil and India. The third group (Group-III; 87

the Chinese group) mostly consists of cotton cultivars in China, which were collected 88

from the major Chinese cotton cultivation regions: the Northwestern Inland Region 89

(NIR), the Northern Specific Early Maturation Region (NSEMR), the Yellow River 90

Region (YRR) and the Yangtze River Region (YtRR)

. This group could be further 91

classified into two subclades (Group-III-1 and Group-III-2; Fig. 1b), which exhibit 92

different geographic distribution patterns. The subclade Group-III-1 is represented by 93

cotton accessions from northern China (NIR and NSEMR), while Group-III-2 94

includes the majority of accessions from southern China (YtRR). We observed that a 95

few cotton accessions, which were collected from North America, clustered into 96

Group-III, which might be due to the introduction of Upland cotton to China from 97

America during the first thirty years of the 20

century

. 98

Crop species may experience population bottlenecks during domestication

. To 99

examine this possibility in cotton, genetic diversity for each group was measured by 100

calculating π values. We found that genetic diversity decreased from the Wild cotton 101

group (π = 1.32 × 10

-3

; the A-subgenome (At, the lower case t denotes tetraploid), 102

1.36 × 10

-3

; the D-subgenome (Dt), 1.25 × 10

-3

) to the ABI group (π = 0.88 × 10

-3

; At, 103

0.96 × 10

-3

; Dt, 0.66 × 10

-3

) and to the Chinese group (π = 0.67 × 10

-3

; At, 0.72 × 10

-3

; 104

Dt, 0.56 × 10

-3

) (Fig. 1d and Supplementary Fig. 3). This shows that a large amount 105

of genetic variation in both subgenomes has been lost during cotton domestication, 106

especially for the Dt. Compared with other major crops, cotton possesses narrow 107

genetic diversity even within wild cotton accessions (Supplementary Table 6). To 108

investigate population divergence, we calculated the population fixation statistics (F

) 109

among groups (Fig. 1d). This reveals large population divergence between the 110

Chinese group and the Wild group. Population divergence between the Chinese group 111

and the ABI group was observed, suggesting that Upland cottons in China have 112

undergone population divergence after their introduction. 113

Linkage disequilibrium (LD; indicated by r

) was found to drop with physical 114

distance between SNPs in all cotton groups (Fig. 1e). The LD extent for each group 115

was measured as the chromosomal distance when LD dropped to half of its maximum 116

HTML Viewer

Frequently Asked Questions (1)

Q1. What are the contributions in this paper?

Wang et al. this paper proposed an asymmetric subgenome selection and cis-regulatory divergence during cotton domestication.