scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Asymmetric subgenome selection and cis -regulatory divergence during cotton domestication

TL;DR: A variation map for 352 wild and domesticated cotton accessions is described and evidence showing asymmetric subgenome domestication for directional selection of long fibers is provided, providing new insights into the evolution of gene organization, regulation and adaptation in a major crop.
Abstract: Comparative population genomics offers an excellent opportunity for unraveling the genetic history of crop domestication. Upland cotton (Gossypium hirsutum) has long been an important economic crop, but a genome-wide and evolutionary understanding of the effects of human selection is lacking. Here, we describe a variation map for 352 wild and domesticated cotton accessions. We scanned 93 domestication sweeps occupying 74 Mb of the A subgenome and 104 Mb of the D subgenome, and identified 19 candidate loci for fiber-quality-related traits through a genome-wide association study. We provide evidence showing asymmetric subgenome domestication for directional selection of long fibers. Global analyses of DNase I-hypersensitive sites and 3D genome architecture, linking functional variants to gene transcription, demonstrate the effects of domestication on cis-regulatory divergence. This study provides new insights into the evolution of gene organization, regulation and adaptation in a major crop, and should serve as a rich resource for genome-based cotton improvement.

Summary (5 min read)

A genome variation map for cotton

  • These included 31 wild accessions and 321 cultivated accessions from around the world (Fig. 1a and Supplementary Table 1 ).
  • These data were mapped against the TM-1 genome 9 to identify genomic variants.
  • In addition, the authors selected 50 representative accessions (10 wild and 40 cultivated cottons) from the 352 accessions for RNA sequencing (Supplementary Table 5 ), and generated 78,728 SNPs, of which more than 93.6% overlapped with SNPs from re-sequencing data.
  • This integrated variation data set represents a new resource for cotton genetics and breeding.

Cotton population properties and linkage disequilibrium

  • The authors explored the phylogenetic relationship between the 352 cotton accessions using a whole-genome SNP analysis.
  • This group could be further classified into two subclades (Group-III-1 and Group-III-2; Fig. 1b ), which exhibit different geographic distribution patterns.
  • The subclade Group-III-1 is represented by cotton accessions from northern China (NIR and NSEMR), while Group-III-2 includes the majority of accessions from southern China (YtRR).
  • This shows that a large amount of genetic variation in both subgenomes has been lost during cotton domestication, especially for the Dt. Compared with other major crops, cotton possesses narrow genetic diversity even within wild cotton accessions (Supplementary Table 6 ).
  • This reveals large population divergence between the Chinese group and the Wild group.

Selection signals during cotton domestication

  • Millennia of domestication has brought many morphological transformations to cotton, including an annualized growth cycle, photoperiod insensitivity, loss of seed dormancy, and superior spinnable white fiber 7, 8 .
  • The authors investigated nucleotide diversity of genes residing in the 25 QTL hotspots to identify putative loci with selection signals underlying these domestication-related traits.
  • Strikingly, 19 of 25 QTL hotspots with 327 genes were located in the Dt. Fiber quality improvement has been one of the most important breeding goals during cotton domestication.
  • Among these associations, 16 signals were previously uncharacterized.
  • The authors also identified a GWAS signal associated with fiber elongation rate on chromosome D04 (Fig. 2g ), where a gibberellin response gene is located.

Asymmetric subgenome domestication for long white fiber

  • Most fiber characteristics in wild Upland cotton were probably inherited directly from its wild A-genome diploid ancestor post-allopolyploidization 30 , while fiber color is similar to that of its D-genome diploid ancestor.
  • The development of the long white fiber trait in cultivated Upland cotton is the result of millennia of strong directional selection from its wild counterpart.
  • The genetic basis of this developmental change remains largely unknown.
  • To understand the relative contributions of the co-existing At and Dt genomes during domestication, the authors constructed ancestral pseudochromosomes to address this question at the subgenome level.
  • By comparing overlaps with domestication signals, the authors identified 620 homoeologous pairs that have been subject to domestication selection in the At or Dt (192 in the At and 428 in the Dt), and only 34 homoeologous pairs with selection signals in both subgenomes (Supplementary Fig. 6 ).

Effects of domestication on cis-regulatory elements in promoters

  • Human selection of desirable agronomic traits not only affects the organization of functional genes, but may also reshape the gene regulatory landscape.
  • Specifically, intergenic non-coding variants can affect the activity of cis-regulatory elements (CREs) [39] [40] [41] , and can contribute to differential gene expression patterns between populations (Supplementary Fig. 7 ).
  • As predicted, the patterns of chromatin modification marks in cotton are different between genic and TE regions (Supplementary Fig. 10 ).
  • To investigate how variants in promoter DHSs might influence the expression of genes, the authors looked for associations between variants and transcription binding motifs.
  • The authors found that some well-known transcription binding motifs were under purifying selection in the cultivated groups, and some were under positive selection (Fig. 4i and Supplementary Table 14 ).

Genome variation underlies distant regulatory divergence

  • A range of high-throughput methods, such as high-throughput chromosome conformation capture (Hi-C) and chromatin interaction analysis by paired-end tag sequencing (ChIA-PET), have been developed to understand 3D genome architecture in the eukaryotic nucleus 47, 48 .
  • The authors generated 1.1 billion Hi-C paired-end reads, of which ca. 322 million were valid interaction reads (Supplementary Table 15 ).
  • The authors found that chromatin interactions are significantly enriched at promoters, distal DHSs such as enhancers and at regions marked by the active chromatin mark H3K4me3, but are less frequent at regions marked by H3K9me2 (Fig. 5d ).
  • DNase I digestion of chromatin on a representative wild cotton accession revealed that more than 94% of enhancers are shared in wild and domesticated cottons (Fig. 5j ), suggesting that domestication has had a limited effect on qualitative changes to enhancers.

DISCUSSION

  • Genome re-sequencing of 352 accessions of Upland cotton has provided new insights into the genetic history of this important crop.
  • Interestingly, the authors found no obvious population divergence between geographic groups in China, probably because of frequent migration of accessions for improvement breeding within a short period after introduction.
  • The authors primarily characterized some key molecular signatures of selection responsible for spinnable fine white fiber, of which some candidates were further identified by a GWAS analysis.

METHODS

  • The horizontal grey dashed lines in b-d and f-i show the significance threshold of GWAS (1/n; 6.3).
  • The other significant associations are presented in Supplementary Table 10 .

Plant materials and re-sequencing

  • Based on the population structure analysis, a core germplasm set, including 282 accessions was determined (Supplementary Table 1 ).
  • Cotton plants were cultivated in the greenhouse in Wuhan, China.
  • Young leaves were collected 4 weeks after planting and immediately frozen in liquid nitrogen until use.
  • Genomic DNA was extracted from leaves using the CTAB method 56 .
  • Paired-end sequencing (PE 150-bp reads) of each library was performed on the Illumina HiSeq X Ten system.

Mapping and variation calling

  • The allotetraploid cotton genome (Gossypium hirsutum L. acc. TM-1) and its annotation 9 were downloaded from the Internet (see URLs).
  • Scaffolds with lengths less than 1000 bp were excluded from further analysis.
  • Paired-end re-sequencing reads were mapped to the TM-1 genome using BWA software with the default parameters.
  • To obtain high-quality SNPs and indels, only variation detected by both software tools with sequencing depth of at least 8 was retained for further analysis.
  • Indels in exons were classified according to whether they lead to a frame-shift effect.

Prediction of structural variation

  • Structural variations (SVs) were identified using three software tools: Breakdancer (version 1.3.6) 61 , Delly (version 2) 62 and laSV (version 1.0.3) 63 , which integrate most existing methods (read-depth, read-pair, split-reads and de novo assembly of sequencing reads) for SV discovery.
  • Breakdancer was run on all cotton accessions using the BWA alignment with the parameters (-q 20 -y 30).
  • Delly, which uses paired-end mapping and a split-read method to discover SVs in the genome, was run separately for each sample using default settings.
  • LaSV, which first performs a reference-free de novo assembly of the sequencing reads and then compares the assembled contigs with the reference genome to identify SVs, was run separately for each sample using parameters (-k 75 -l 150 -s 20).
  • SVs (deletion, duplication, insertion and inversion) were retained if supported by at least two methods with a mapping depth of more than 10×.

Population-genetic analyses

  • To conduct the phylogenetic analysis, SNPs of all accessions were filtered with minor allele frequency (MAF) 0.05.
  • These SNPs were used to construct a neighbour-joining tree using PHYLIP software 64 and visualized using the online tool iTOL (see URLs).
  • Principal component analysis (PCA) analysis was performed using this SNP set with the smartpca program embedded in the EIGENSOFT package 65 .
  • Population structure was analyzed using the Structure program which infers the population structure by identifying different numbers of clusters (K) 66 .

Identification of domestication sweeps

  • For domestication sweep analysis, the authors combined cultivated cotton groups (ABI and Chinese groups) into a single group to exclude the potential effect of genetic drift.
  • The genetic diversity in the wild group was compared with that in the cultivated group Windows with an empirical F ST cutoff (top 5%) were regarded as highly differentiated regions.
  • These regions were compared with the analysis of domestication sweeps.
  • Genes with nonsynonymous SNPs in these regions were selected as under selective pressure across groups.

Genome-wide association studies for fiber quality-related traits

  • The traits include fiber length, fiber strength, micronaire value, fiber uniformity and fiber elongation rate.
  • The significant association threshold was set as 1/n (n, total SNP number).
  • The significant association regions were manually checked from the aligned re-sequencing reads against the TM-1 genome using SAMtools 59 .

Construction of ancestral karyotypes

  • To analyze selection signals at the subgenome level, the authors constructed the ancestral karyotype for each of the 13 chromosomes in putative diploid ancestors.
  • Homoeologous synteny blocks were identified in the 13 chromosome pairs between the At and the Dt subgenomes using MCScanX with default settings 70 .
  • Syntenic gene pairs were identified in these syntenic blocks containing more than five aligned genes.
  • A reciprocal blastp was run using gene sequences from the At and Dt subgenomes.
  • Genomic sequences consisting of gene regions and their flanking 2 kb sequences were ordered based on the Dt subgenome and concatenated to construct ancestral karyotypes.

RNA-seq and data analysis

  • Cotton leaves were sampled for gene expression analysis at the same developmental stage as for DNA re-sequencing.
  • A total of 2 µg RNA were used for library construction using the Illumina TruSeq RNA Kit (Illumina, San Diego, CA, USA) following the manufacturer's instructions.
  • RNA sequencing was performed on the Illumina HiSeq 3000 system (paired-end 150-bp reads).
  • The expression level of each gene was determined using Cufflinks (version 2.2.1) with a multi-read and fragment bias correction method 73 .

Bisulfite-treated DNA sequencing data analysis

  • The authors downloaded bisulfite-treated DNA sequencing data for leaf and fiber of TM-1 from the National Center for Biotechnology Information (NCBI) Sequence Read Archive collection (SRX710548-SRX710553).
  • Trimmomatic software was applied to clip sequencing adapters and filter low-quality reads 74 .
  • The Bismark methylation extractor program was run to extract potentially methylated cytosines.
  • Cytosines in CG, CHG and CHH contexts covered by at least three sequencing reads were retained for a binomial test (P-value cutoff 1e-5).

DNase-seq and DHS identification

  • Purified DNA fragments of between 100 bp and 200 bp following DNase I digestion were isolated with a Pippin HT (Sage Science, Beverly, MA, USA).
  • A total of 10 ng of the isolated fragments was used for library construction using the Illumina TruSeq Sample Prep Kit.
  • Libraries were sequenced using the Illumina HiSeq 2000 system (paired-end 100-bp reads).
  • The unique mapping data were processed to identify DNase I hypersensitive sites (DHSs).
  • MACS (version 1.4.2) 80 , another peak-calling algorithm, was also run to identify DHSs.

Motif discovery

  • The promoter DHSs were screened for transcription factor (TF) binding motifs using the findMotifsGenome.pl program in HOMER software (see URLs) 82 , with the parameters '-size given -len 8,10,12 -chopify -mset plants'.
  • The 2 kb upstream sequences of genes were used for motif discovery by the Patch 1.0 program, which searches the TRANSFAC Public 6.0 database (see URLs), with the following parameters: 1) the minimum length of sites was 8; 2) the maximum number of mismatches was 1; 3) the mismatch penalty was 100; 4) the lower score boundary was 87.5.

ChIP-Seq and data analysis

  • For each sample, a total of 10 ng ChIP DNA and Input control DNA were used for library construction using the Illumina TruSeq Sample Prep Kit, according to the manufacturer's instructions.
  • ChIP libraries were sequenced on the Illumina HiSeq 3000 system (paired-end 150-bp reads).
  • After removing PCR duplication and multiple mapping reads, the unique mapping data were used to call histone modification peaks using MACS software (version 2.1.0) 80 .
  • The Input DNA sequencing data was used as a control.

Hi-C experiments and sequencing

  • The clean samples were ground to powder in liquid nitrogen.
  • DNA ends were labelled with biotin, incubated at 37°C for 45 min, and enzyme was inactivated with 20% SDS solution.
  • After ligation, proteinase K was added to reverse cross-linking by incubation at 65°C overnight.
  • Hi-C libraries were sequenced on the Illumina HiSeq 3000 system.
  • The Hi-C experiment was carried out as two biological replicates.

Hi-C data analysis

  • Raw Hi-C data were processed to filter low-quality reads and trim adapters using Trimmomatic (version 0.32) 74 .
  • Read pairs that did not map close to a restriction site, or were not within the expected fragment size following shearing, were first filtered.
  • The remaining valid read pairs were divided into intra-chromosomal pairs and inter-chromosomal pairs.
  • Results from the second pass after an initial fit were used for further analysis.
  • Fragments overlapping with intergenic DHSs or promoters were extracted to construct a regulatory interactome.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Durham Research Online
Deposited in DRO:
07 March 2017
Version of attached le:
Accepted Version
Peer-review status of attached le:
Peer-reviewed
Citation for published item:
Wang, M. and Tu, L. and Lin, M. and Lin, Z. and Wang, P. and Yang, Q. and Ye, Z. and Shen, C. and Zhou,
X. and Zhang, L. and Li, J. and Nie, X. and Li, Z. and Guo, K. and Ma, Y. and Jin, S. and Zhu, L. and Yang,
X. and Min, L. and Zhang, Q. and Lindsey, K. and Zhang, X. (2017) 'Asymmetric subgenome selection and
cis-regulatory divergence during cotton domestication.', Nature genetics., 49 (4). pp. 579-587.
Further information on publisher's website:
https://doi.org/10.1038/ng.3807
Publisher's copyright statement:
Additional information:
Use policy
The full-text may be used and/or reproduced, and given to third parties in any format or medium, without prior permission or charge, for
personal research or study, educational, or not-for-prot purposes provided that:
a full bibliographic reference is made to the original source
a link is made to the metadata record in DRO
the full-text is not changed in any way
The full-text must not be sold in any format or medium without the formal permission of the copyright holders.
Please consult the full DRO policy for further details.
Durham University Library, Stockton Road, Durham DH1 3LY, United Kingdom
Tel : +44 (0)191 334 3042 | Fax : +44 (0)191 334 2971
https://dro.dur.ac.uk

1
Asymmetric subgenome selection and cis-regulatory divergence 1
during cotton domestication 2
3
Maojun Wang
1
, Lili Tu
1
, Min Lin
1,2
, Zhongxu Lin
1
, Pengcheng Wang
1
, Qingyong 4
Yang
1,2
, Lin Zhang
1
, Zhengxiu Ye
1
, Chao Shen
1
, Jianying Li
1
, Kai Guo
1
, Xiaolin 5
Zhou
1
, Xinhui Nie
3
, Zhonghua Li
1
, Yizan Ma
1
, Cong Huang
1
, Shuangxia Jin
1
, Longfu 6
Zhu
1
, Xiyan Yang
4
, Ling Min
4
, Daojun Yuan
4
, Qinghua Zhang
1
, Keith Lindsey
5
& 7
Xianlong Zhang
1
8
9
1
National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural 10
University, Wuhan 430070, Hubei, China. 11
2
Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, 12
Huazhong Agricultural University, Wuhan 430070, Hubei, China 13
3
Key Laboratory of Oasis Eco-agriculture of the Xinjiang Production and 14
Construction Corps, College of Agronomy, Shihezi University, Shihezi, Xinjiang, 15
China. 16
4
College of Plant Science and Technology, Huazhong Agricultural University, 17
Wuhan 430070, Hubei, China 18
5
Department of Biosciences, Durham University, Durham DH1 3LE, United 19
Kingdom. 20
21
Correspondence should be addressed to X.Z. (xlzhang@mail.hzau.edu.cn) or K.L. 22
(keith.lindsey@durham.ac.uk) 23
24
Tel: +86-27-87280510 25
Fax: +86-27-87280196 26

2
Comparative population genomics offers an excellent opportunity for 27
unravelling the genetic history of crop domestication. Upland cotton (Gossypium 28
hirsutum) has long been an important economic crop, but a genome-wide and 29
evolutionary understanding of the effects of human selection is largely 30
unresolved. Here, we describe an integrated variation map for 352 wild and 31
domesticated cotton accessions. This has allowed us to scan 93 domestication 32
sweeps and identify 19 candidate loci for fiber quality-related traits by a 33
genome-wide association study. We provide evidence to show asymmetric 34
subgenome domestication for directional selection of long white fibers. Global 35
analyses of DNase I-hypersensitive sites and 3-dimensional genome architecture, 36
linking functional variants to gene transcription, reveal the effects of 37
domestication on cis-regulatory divergence. This study provides new insights into 38
the evolution of gene organization, regulation and adaptation in a major crop, 39
and represents a rich resource for genome-based cotton improvement. 40
41
Early human domestication of wild plants represented the first step in the 42
development of modern crop varieties, and migration and differential directional 43
selection over millennia has contributed to the adaptation of species in different 44
environments for improved yield and quality traits
1
. In the current genomic era, 45
high-throughput ‘omics’ technologies provide significant opportunities for a detailed 46
analysis of genetic change through domestication and for new, targeted and precise 47
genome-based crop breeding strategies
2,3
. 48
Cotton is one of the most important economic crops in the world, both as a 49
source of natural and renewable fiber for textiles, and as a source of seed oil and 50
protein
4
. Allotetraploid Upland cotton is formed from an inter-genomic hybridization 51
event approximately 1–2 million years ago
5
. Originally native to the Yucatan 52
peninsula in Mesoamerica, it was first domesticated at least 4,000 to 5,000 years ago, 53
with subsequent directional selection
6
. Modern varieties of cultivated cotton produce 54
spinnable fine white fiber, which is preferable to the sparser, coarse brown fiber of 55

3
wild cotton. Previous molecular studies have shown that domestication has 56
dramatically rewired the transcriptome during fiber development
7,8
. What remains 57
largely unknown, however, is the effect of human selection on the organization of the 58
cotton genome and its gene regulatory landscape. Using as a comparator the recently 59
published genome sequence of Texas Marker-1 (TM-1)
9,10
, we can address this 60
question through a comprehensive population genome analysis of multiple wild and 61
cultivated cotton genotypes. 62
63
RESULTS 64
A genome variation map for cotton 65
To construct an integrated variation map of Upland cotton, we collected a total of 352 66
diverse accessions for genomic sequence analysis
11
. These included 31 wild 67
accessions and 321 cultivated accessions from around the world (Fig. 1a and 68
Supplementary Table 1). A total of 6.1 Tb of sequence data were integrated, with an 69
average depth of 6.9× (Supplementary Table 1). These data were mapped against the 70
TM-1 genome
9
to identify genomic variants. We detected a total of 7,497,568 SNPs, 71
351,013 small indels (shorter than 10 bp) and 93,786 structural variants (SVs) (Table 72
1, Supplementary Fig. 1 and Supplementary Tables 2-4). The accuracy of SNPs 73
was estimated to be 98.2%, determined by Sanger sequencing of 300 randomly 74
selected SNPs in 3 individual accessions. In addition, we selected 50 representative 75
accessions (10 wild and 40 cultivated cottons) from the 352 accessions for RNA 76
sequencing (Supplementary Table 5), and generated 78,728 SNPs, of which more 77
than 93.6% overlapped with SNPs from re-sequencing data. This integrated variation 78
data set represents a new resource for cotton genetics and breeding. 79
80
Cotton population properties and linkage disequilibrium 81
We explored the phylogenetic relationship between the 352 cotton accessions using a 82
whole-genome SNP analysis. These cottons can be divided into 3 groups (Fig. 1b and 83
Supplementary Fig. 2), as supported by a principal component analysis (PCA; Fig. 84

4
1c). Wild cotton accessions cluster together (Group-I; the Wild group) except for a 85
few accessions which cluster into a second group (Group-II; the ABI group), which 86
mainly comprises cottons from America, Brazil and India. The third group (Group-III; 87
the Chinese group) mostly consists of cotton cultivars in China, which were collected 88
from the major Chinese cotton cultivation regions: the Northwestern Inland Region 89
(NIR), the Northern Specific Early Maturation Region (NSEMR), the Yellow River 90
Region (YRR) and the Yangtze River Region (YtRR)
12
. This group could be further 91
classified into two subclades (Group-III-1 and Group-III-2; Fig. 1b), which exhibit 92
different geographic distribution patterns. The subclade Group-III-1 is represented by 93
cotton accessions from northern China (NIR and NSEMR), while Group-III-2 94
includes the majority of accessions from southern China (YtRR). We observed that a 95
few cotton accessions, which were collected from North America, clustered into 96
Group-III, which might be due to the introduction of Upland cotton to China from 97
America during the first thirty years of the 20
th
century
13
. 98
Crop species may experience population bottlenecks during domestication
14
. To 99
examine this possibility in cotton, genetic diversity for each group was measured by 100
calculating π values. We found that genetic diversity decreased from the Wild cotton 101
group (π = 1.32 × 10
-3
; the A-subgenome (At, the lower case t denotes tetraploid), 102
1.36 × 10
-3
; the D-subgenome (Dt), 1.25 × 10
-3
) to the ABI group (π = 0.88 × 10
-3
; At, 103
0.96 × 10
-3
; Dt, 0.66 × 10
-3
) and to the Chinese group (π = 0.67 × 10
-3
; At, 0.72 × 10
-3
; 104
Dt, 0.56 × 10
-3
) (Fig. 1d and Supplementary Fig. 3). This shows that a large amount 105
of genetic variation in both subgenomes has been lost during cotton domestication, 106
especially for the Dt. Compared with other major crops, cotton possesses narrow 107
genetic diversity even within wild cotton accessions (Supplementary Table 6). To 108
investigate population divergence, we calculated the population fixation statistics (F
ST
) 109
among groups (Fig. 1d). This reveals large population divergence between the 110
Chinese group and the Wild group. Population divergence between the Chinese group 111
and the ABI group was observed, suggesting that Upland cottons in China have 112
undergone population divergence after their introduction. 113
Linkage disequilibrium (LD; indicated by r
2
) was found to drop with physical 114
distance between SNPs in all cotton groups (Fig. 1e). The LD extent for each group 115
was measured as the chromosomal distance when LD dropped to half of its maximum 116

Citations
More filters
Journal ArticleDOI
TL;DR: Improved genome assemblies of allotetraploid cotton species Gossypium hirsutum and GOSSypium barbadense provide insights into cotton evolution and inform the construction of introgression lines used to identify loci associated with fiber quality.
Abstract: Allotetraploid cotton species (Gossypium hirsutum and Gossypium barbadense) have long been cultivated worldwide for natural renewable textile fibers. The draft genome sequences of both species are available but they are highly fragmented and incomplete1-4. Here we report reference-grade genome assemblies and annotations for G. hirsutum accession Texas Marker-1 (TM-1) and G. barbadense accession 3-79 by integrating single-molecule real-time sequencing, BioNano optical mapping and high-throughput chromosome conformation capture techniques. Compared with previous assembled draft genomes1,3, these genome sequences show considerable improvements in contiguity and completeness for regions with high content of repeats such as centromeres. Comparative genomics analyses identify extensive structural variations that probably occurred after polyploidization, highlighted by large paracentric/pericentric inversions in 14 chromosomes. We constructed an introgression line population to introduce favorable chromosome segments from G. barbadense to G. hirsutum, allowing us to identify 13 quantitative trait loci associated with superior fiber quality. These resources will accelerate evolutionary and functional genomic studies in cotton and inform future breeding programs for fiber improvement.

354 citations

Journal ArticleDOI
TL;DR: The authors report an improved genome assembly of G. arboretum and resequencing of 243 diploid cotton accessions, which represents a major step toward understanding the evolution of the A genome of cotton.
Abstract: The ancestors of Gossypium arboreum and Gossypium herbaceum provided the A subgenome for the modern cultivated allotetraploid cotton. Here, we upgraded the G. arboreum genome assembly by integrating different technologies. We resequenced 243 G. arboreum and G. herbaceum accessions to generate a map of genome variations and found that they are equally diverged from Gossypium raimondii. Independent analysis suggested that Chinese G. arboreum originated in South China and was subsequently introduced to the Yangtze and Yellow River regions. Most accessions with domestication-related traits experienced geographic isolation. Genome-wide association study (GWAS) identified 98 significant peak associations for 11 agronomically important traits in G. arboreum. A nonsynonymous substitution (cysteine-to-arginine substitution) of GaKASIII seems to confer substantial fatty acid composition (C16:0 and C16:1) changes in cotton seeds. Resistance to fusarium wilt disease is associated with activation of GaGSTF9 expression. Our work represents a major step toward understanding the evolution of the A genome of cotton.

349 citations

Journal ArticleDOI
TL;DR: The authors resequence a core collection of upland cotton (Gossypium hirsutum) comprising 419 accessions and analyze genomic variation and conduct a genome-wide association study for 13 fiber quality and yield traits in 12 different environments.
Abstract: Upland cotton is the most important natural-fiber crop. The genomic variation of diverse germplasms and alleles underpinning fiber quality and yield should be extensively explored. Here, we resequenced a core collection comprising 419 accessions with 6.55-fold coverage depth and identified approximately 3.66 million SNPs for evaluating the genomic variation. We performed phenotyping across 12 environments and conducted genome-wide association study of 13 fiber-related traits. 7,383 unique SNPs were significantly associated with these traits and were located within or near 4,820 genes; more associated loci were detected for fiber quality than fiber yield, and more fiber genes were detected in the D than the A subgenome. Several previously undescribed causal genes for days to flowering, fiber length, and fiber strength were identified. Phenotypic selection for these traits increased the frequency of elite alleles during domestication and breeding. These results provide targets for molecular selection and genetic manipulation in cotton improvement.

271 citations

Journal ArticleDOI
TL;DR: A comprehensive genomic assessment of modern improved upland cotton based on the genome-wide resequencing of 318 landraces and modern improved cultivar or lines finds that two ethylene-pathway-related genes were associated with increased lint yield in improved cultivars.
Abstract: Upland cotton (Gossypium hirsutum) is the most important natural fiber crop in the world. The overall genetic diversity among cultivated species of cotton and the genetic changes that occurred during their improvement are poorly understood. Here we report a comprehensive genomic assessment of modern improved upland cotton based on the genome-wide resequencing of 318 landraces and modern improved cultivars or lines. We detected more associated loci for lint yield than for fiber quality, which suggests that lint yield has stronger selection signatures than other traits. We found that two ethylene-pathway-related genes were associated with increased lint yield in improved cultivars. We evaluated the population frequency of each elite allele in historically released cultivar groups and found that 54.8% of the elite genome-wide association study (GWAS) alleles detected were transferred from three founder landraces: Deltapine 15, Stoneville 2B and Uganda Mian. Our results provide a genomic basis for improving cotton cultivars and for further evolutionary analysis of polyploid crops.

266 citations

Journal ArticleDOI
TL;DR: The authors take a population genetic approach to resolve its origin and evolutionary history, and identify candidate genes related to important agricultural traits associated with improved stress tolerance, oil content, seed quality, and ecotype improvement of B. napus.
Abstract: Brassica napus (2n = 4x = 38, AACC) is an important allopolyploid crop derived from interspecific crosses between Brassica rapa (2n = 2x = 20, AA) and Brassica oleracea (2n = 2x = 18, CC). However, no truly wild B. napus populations are known; its origin and improvement processes remain unclear. Here, we resequence 588 B. napus accessions. We uncover that the A subgenome may evolve from the ancestor of European turnip and the C subgenome may evolve from the common ancestor of kohlrabi, cauliflower, broccoli, and Chinese kale. Additionally, winter oilseed may be the original form of B. napus. Subgenome-specific selection of defense-response genes has contributed to environmental adaptation after formation of the species, whereas asymmetrical subgenomic selection has led to ecotype change. By integrating genome-wide association studies, selection signals, and transcriptome analyses, we identify genes associated with improved stress tolerance, oil content, seed quality, and ecotype improvement. They are candidates for further functional characterization and genetic improvement of B. napus. Brassica napus is a globally important oil crop, but the origin of the allotetraploid genome and its improvement process are largely unknown. Here, the authors take a population genetic approach to resolve its origin and evolutionary history, and identify candidate genes related to important agricultural traits.

221 citations

References
More filters
Journal ArticleDOI
TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
Abstract: Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]

45,957 citations

Journal ArticleDOI
TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.
Abstract: Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ~10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: [email protected]

43,862 citations

Journal ArticleDOI
TL;DR: Timmomatic is developed as a more flexible and efficient preprocessing tool, which could correctly handle paired-end data and is shown to produce output that is at least competitive with, and in many cases superior to, that produced by other tools, in all scenarios tested.
Abstract: Motivation: Although many next-generation sequencing (NGS) read preprocessing tools already existed, we could not find any tool or combination of tools that met our requirements in terms of flexibility, correct handling of paired-end data and high performance. We have developed Trimmomatic as a more flexible and efficient preprocessing tool, which could correctly handle paired-end data. Results: The value of NGS read preprocessing is demonstrated for both reference-based and reference-free tasks. Trimmomatic is shown to produce output that is at least competitive with, and in many cases superior to, that produced by other tools, in all scenarios tested. Availability and implementation: Trimmomatic is licensed under GPL V3. It is cross-platform (Java 1.5+ required) and available at http://www.usadellab.org/cms/index.php?page=trimmomatic Contact: ed.nehcaa-htwr.1oib@ledasu Supplementary information: Supplementary data are available at Bioinformatics online.

39,291 citations

Journal ArticleDOI
TL;DR: Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.
Abstract: As the rate of sequencing increases, greater throughput is demanded from read aligners. The full-text minute index is often used to make alignment very fast and memory-efficient, but the approach is ill-suited to finding longer, gapped alignments. Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.

37,898 citations

Journal ArticleDOI
TL;DR: This work introduces PLINK, an open-source C/C++ WGAS tool set, and describes the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation, which focuses on the estimation and use of identity- by-state and identity/descent information in the context of population-based whole-genome studies.
Abstract: Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.

26,280 citations

Related Papers (5)
Frequently Asked Questions (1)
Q1. What are the contributions in this paper?

Wang et al. this paper proposed an asymmetric subgenome selection and cis-regulatory divergence during cotton domestication.