A high-quality genome assembly and annotation of the humpback grouper Cromileptes altivelas

doi:10.1101/2020.06.22.164277

A chromosome-level genome assembly and annotation of the humpback

grouper Cromileptes altivelas

Yun Sun

a, 1

, Dongdong Zhang

a, 1

, Jianzhi Shi

c,1

, Guisen Chen

a

, Ying Wu

a

, Yang Shen

a

,

Zhenjie Cao

a

, Linlin Zhang

b*

, Yongcan Zhou

a*

a

State Key Laboratory of Marine Resource Utilization in South China Sea, Hainan University,

Haikou 570228, China

b

Center for Ocean Mega-Science, The Key Laboratory of Experimental Marine Biology, Institute

of Oceanology, Chinese Academy of Sciences, Qingdao, China

c

Novogene Bioinformatics Institute, Beijing, 100083, China

1

These authors contributed equally to this work.

* To whom correspondence should be addressed.

Mailing address:

College of Marine Sciences

Hainan University

58 Renmin Avenue

Haikou 570228

PR ChinaPhone and Fax: 86-898-66256125

Email: zychnu@163.com (Yongcan Zhou)

linlinzhang@qdio.ac.cn (Linlin Zhang)

The copyright holder for this preprintthis version posted June 23, 2020. ; https://doi.org/10.1101/2020.06.22.164277doi: bioRxiv preprint

Abstract

Cromileptes altivelas that belongs to Serranidae in the order Perciformes, is widely

distributed throughout the tropical waters of the Indo-West Pacific regions. Due to

their excellent food quality and abundant nutrients, it has become a popular marine

food fish with high market values. Here, we reported a chromosome-level genome

assembly and annotation of the humpback grouper genome using more than 103X

PacBio long-reads and high-throughput chromosome conformation capture (Hi-C)

technologies. The N50 contig length of the assembly is as large as 4.14 Mbp, the final

assembly is 1.07 Gb with N50 of scaffold 44.78 Mb, and 99.24% of the scaffold

sequences were anchored into 24 chromosomes. The high-quality genome assembly

also showed high gene completeness with 27,067 protein coding genes and 3,710

ncRNAs. This high accurate genome assembly and annotation will not only provide

an essential genome resource for C. altivelas breeding and restocking, but will also

serve as a key resource for studying fish genomics and genetics.

Keywords: humpback grouper; genome assembly; evolution; PacBio; Hi-C

The copyright holder for this preprintthis version posted June 23, 2020. ; https://doi.org/10.1101/2020.06.22.164277doi: bioRxiv preprint

Data Description

Background & Summary

The humpback grouper Cromileptes altivelas (order Perciformes, family Epinephelinae) inhabits

the tropical waters of Indo-West Pacific oceans

1

. C. altivelas is increasing attracting attention as

high-value human food for its delicious ﬂavor and high nutritional value, and it also has great

ornamental value due to its unique body shape and beautiful color

1-3

(Fig.1).However, the wild

population of C. altivelis is increasingly exploited. Meanwhile, C. altivelis farming is limited by

its slow growth speed, low survive rate, and various pathogenic diseases

4-5

. Obtaining high-quality

genomic sequences is the foundation of developing genomic selection to improve the performance

of C. altivelis. The genome information is also critical to explore the genetic mechanisms of its

unique traits, immune system and evolutionary adaptation. Recently, genome sequences of seven

grouper fish species are available. Most of these fish species belong to the genus of Epinephelus.

There are few genome sequences of grouper fish species from other genera. Humpback grouper is

the only species of Cromileptes genus.

Here, combining a PacBio long-read sequencing and high-throughput chromosome

conformation capture (Hi-C) technologies, we sequenced the humpback grouper C. altivelas

genome with estimated size 1.07 Gb. The N50 scaffold size of final genome assembly reached

44.78Mb and 99.24% of the scaffold sequences were anchored into 24 chromosomes. Based on

the high-quality assembly, we annotated the protein-coding genes and ncRNAs. The high-quality

genome assembly and annotation will not only provide an essential genome resource for exploring

the economic values of C. altivelas breeding and restocking, but will also serve as a key resource

for studying fish genomics and genetics

Methods

Sample collection, library construction and sequencing

We sampled a single individual of female C. altivelas for genome sequencing from Hainan, China

(Fig.1). The total genomic DNA was extracted from muscular tissue using SDS lysis and magnetic

beads isolation method.

We applied a strategy combing four technologies for library construction and sequencing

including PacBio Sequel System (for genome assembly), the Illumina Hiseq 4000 System (for

genome survey), 10X Genomics link-reads (for scaffold construction), and Hi-C optical maps (for

chromosome construction). First, two paired-end Illumina sequence libraries were constructed

with an insert size of 350 bp, and sequencing was carried out on the Illumina HiSeq 4000 platform.

A total of 79.18 Gb (coverage of 71.98 X) of Paired-End 150 bp reads were produced. Raw

sequence data generated by the Illumina platform were filtered by the following criteria: filtered

reads with adapters, filtered reads with N bases more than 10%, and filtered reads with

low-quality bases (≤5) more than 50%. Second, a total of 113.49 Gb of polymerase reads data

were generated using PacBio Sequel platform, and a total of 106.3 Gb (coverage of 103 X)

subreads were obtained after removing adaptors and filtered with the default parameters. The

average and the N50 length of subreads reached 8.04 kb and 13.26 kb, respectively. Third, one

10X Genomics linked-read library was constructed and sequenced on Illumina HiSeq 4000

platform, which produced 129.1Gb (coverage of 117.4 X). Finally, an optical map was also

constructed from Hi-C, of which 119.2 Gb (coverage of 108.4 X) data were generated. All

sequence data are summarized in Table 1.

The copyright holder for this preprintthis version posted June 23, 2020. ; https://doi.org/10.1101/2020.06.22.164277doi: bioRxiv preprint

Genome size estimation

The genome size of C. altivelas was first estimated using k-mer spectrum with Jellyfish

6

(v2.1.3).

The distribution of 17-kmer showed a major peak at 57 (Figure S1). Based on the total number of

kmers (63,765,804,944) and corresponding to a kmer depth of 57, the C. altivelis genome size was

estimated to be 1118.70 Mb using the formula: Genome size= kmer_Number / Peak_Depth. The

modified genome size was 1104.81 Mb, the genome heterozygosity was 0.16%, and the repetition

rate was 46.38%.

De novo assembly of the C. altivelis genome

The contig assembly of the C. altivelis genome was carried out using the FALCON assembler

7

,

followed by two rounds of polishing with Quiver

8

. FALCON implements a hierarchical assembly

process that include the following steps: (1) subread error correction through aligning all reads to

each other using daligner

9

, the overlap data were then processed to generate error-corrected

consensus reads; after error correction, we obtained 28 Gb (35 X coverage) of error-corrected

reads; (2) second round of overlap detection using error-corrected reads; (3) construction of a

directed string graph from overlap data; and (4) resolving contig path from the string graph. After

FALCON assembly, the genome was polished by Quiver. Initial assembly of the PacBio data

resulted in a contig N50 (the minimum length of contigs accounting for half of the haploid

genome size) of 4.14 Mb. Then, PacBio contigs were first scaffolded using optical map data, and

the resulting scaffolds were further connected to super-scaffolds by 10X Genomics linked-read

data using the fragScaff software

10

. Finally, we used Illumina-derived short reads to correct any

remaining errors by pilon

11

. The final genome assembly of C. altivelis was with a total length of

1.07 Gb, contig N50 of 4,14 Mb, and scaffold N50 of 44.78 Mb (Table 2).

Hi-C technology was further used for chromosome construction. We performed quality control

of Hi-C raw data using HiCUP (version 3.0). We then aligned the raw reads to the draft assembled

sequence by Bowtie2 (version 2.2.2), and filtered out the low quality reads to build raw

intrachromosomal contact maps. Based on high quality Hi-C data, we anchored and orientated

primary scaffolds into 24 chromosomes (Fig. 2), which additively covered 99.24% of the whole

genome sequences.

Repetitive sequences annotation

The repetitive elements in the C. altivelis genome were identified by a combination of evidence-

based and ab initio approaches. We first used RepeatMasker (RepeatMasker, RRID:SCR

012954)

12

and RepeatProteinMask to search against Repbase. We then construct a de novo

repetitive element library using RepeatModeler and further utilized this de novo library for second

round searching by RepeatMasker. In addition, we used Tandem Repeats Finder

13

, LTR FINDER

(LTR FINDER, RRID:SCR 015247)

14

, PILER

15

, and RepeatScout (RepeatScout, RRID:SCR

014653)

16

with default parameters for further repetitive elements annotation. Overall, we found

473,252,116 bp repeat sequences, accounted for 44.35% of C. altivelis genome (Table 3A),

including 3.8% tandem repeats. Among transposable elements (TEs), there are 17.28% DNA

transposons, 24.07% retroelements including LINE、SINE and LTR, and 3.74% unclassified

elements (Table 3B).

The copyright holder for this preprintthis version posted June 23, 2020. ; https://doi.org/10.1101/2020.06.22.164277doi: bioRxiv preprint

Protein-coding gene prediction and functional annotation

To obtain a fully annotated C. altivelas genome, three approaches were combined to predict

protein-coding genes including homology-based prediction, ab initio prediction, and

transcriptome-based prediction. First, homology-based prediction was performed by TBLASTN

(TBLASTN, RRID:SCR 011822)

17

using protein repertoires of nine common vertebrates including

Branchiostoma floridae (Bfl, GCA_000003815.1), Cynoglossus semilaevis (Cse,

GCA_000523025.1), Danio rerio (Dre, GCF_000002035.6), Gasterosteus aculeatus (Gac,

GCA_000180675.1), Larimichthys crocea (Lcr, GCA_000972845.1), Oryzias latipes (Ola,

GCA_002234675.1), Oreochromis niloticus (Oni, GCF_001858045.1), and Takifugu rubripes (Tru,

GCF_000180615.1). The Basic Local Alignment Search Tool (BLAST) hits were then conjoined

by Solar software

18

. GeneWise (GeneWise, RRID:SCR 015054)

19

was then used to predict the

exact gene structure of the corresponding genomic region on each BLAST hit. Homology

predictions were denoted as “Homology-set”.

Second, to provide further evidence for evaluating the predicted gene models, we assembled

38.67 Gb RNA-sequencing (RNA-seq) data derived from five different tissues by both de novo

and reference-guided approaches. De novo RNA-seq assembly approach was performed by Trinity

pipeline

20

, resulting in 370,688 contigs with an average length of 909 bp (Trinity-set). For

reference-guided approach, short reads were directly mapped to the genome using Tophat (Tophat,

RRID:SCR 013035)

21

to identify putative exon regions and splice junctions. Cufflinks (Cufflinks,

RRID:SCR 014597)

22

and cuffmerge was then used to assemble the mapped reads into gene

models (Cufflinks-set). These assembled Trinity-set and Cufflinks-set were then aligned against

the C. altivelis genome by Program to Assemble Spliced Alignment (PASA). Valid transcript

alignments were clustered based on genome mapping location and assembled into gene structures.

Gene models created by PASA

23

were denoted as “Transcripts-set”.

Third, ab initio prediction was performed on repeat-masked C. altivelas genome using Augustus

(Augustus, RRID:SCR 008417)

24

, GeneID

25

, GeneScan

26

, GlimmerHMM (GlimmerHMM,

RRID:SCR 002654)

27

and SNAP

28

. Of these, Augustus, SNAP, and GlimmerHMM were trained

by PASA-H-set gene models. Finally, three predicted gene models were integrated by

EvidenceModeler

29

. Weights for each type of evidence were set as follows: Transdecoder >

GeneWise = Cufflinks-set > Augustus > GeneID = SNAP = GlimmerHMM = GeneScan. The gene

models were further updated by PASA2 to generate untranslated regions, alternative splicing

variation information. Finally, a total of 27,242 protein-coding genes were obtained with a mean

of 8.7 exons per gene (Table 4). The lengths of genes, coding sequence, introns, and exons in C.

altivelis were comparable to those of closely related genomes (Supplementary Table S1).

Gene functions of protein-coding genes were annotated by searching functional motifs, domains,

and the possible biological process of genes to known databases such as SwissProt

30

, Pfam

31

, NR

database (from NCBI), Gene Ontology

32

, and Kyoto Encyclopedia of Genes and Genomes

33

. A

total of 27,067 protein-coding genes (99.4%) were successfully annotated for at least one function

terms (Supplementary Table S2).

Non-coding gene prediction

We also predicted noncoding RNA genes in the C. altivelis genome. The rRNA fragments were

predicted by searching against human rRNA database using BLAST with an E-value of 1E-10.

The tRNA genes were identified by tRNAscan-SE (tRNAscan-SE, RRID:SCR 010835) software

34

.

The copyright holder for this preprintthis version posted June 23, 2020. ; https://doi.org/10.1101/2020.06.22.164277doi: bioRxiv preprint

A high-quality genome assembly and annotation of the humpback grouper Cromileptes altivelas

Figures

Citations

Automated Eukaryotic Gene Structure Annotation Using EVidenceModeler and the Program to Assemble Spliced Alignments

Application of second-generation sequencing (SGS) and third generation sequencing (TGS) in aquaculture breeding program

References

Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement

OrthoMCL: identification of ortholog groups for eukaryotic genomes.

PAML: a program package for phylogenetic analysis by maximum likelihood

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data

The Pfam protein families database: towards a more sustainable future

Related Papers (5)

Draft genome of the Northern snakehead, Channa argus

The First Highly Contiguous Genome Assembly of Pikeperch (Sander lucioperca), an Emerging Aquaculture Species in Europe

Chromosome-level genome assembly of the East Asian common octopus (Octopus sinensis) using PacBio sequencing and Hi-C technology.

A chromosome-level genome assembly for the Pacific oyster Crassostrea gigas

Efficient assembly and annotation of the transcriptome of catfish by RNA-Seq analysis of a doubled haploid homozygote