scispace - formally typeset
Open AccessPosted ContentDOI

A high-quality genome assembly and annotation of the humpback grouper Cromileptes altivelas

Reads0
Chats0
TLDR
In this article, a chromosome-level genome assembly and annotation of the humpback grouper genome using more than 103X PacBio long-reads and high-throughput chromosome conformation capture (Hi-C) technologies was reported.
Abstract
Cromileptes altivelas that belongs to Serranidae in the order Perciformes, is widely distributed throughout the tropical waters of the Indo-West Pacific regions Due to their excellent food quality and abundant nutrients, it has become a popular marine food fish with high market values Here, we reported a chromosome-level genome assembly and annotation of the humpback grouper genome using more than 103X PacBio long-reads and high-throughput chromosome conformation capture (Hi-C) technologies The N50 contig length of the assembly is as large as 414 Mbp, the final assembly is 107 Gb with N50 of scaffold 4478 Mb, and 9924% of the scaffold sequences were anchored into 24 chromosomes The high-quality genome assembly also showed high gene completeness with 27,067 protein coding genes and 3,710 ncRNAs This high accurate genome assembly and annotation will not only provide an essential genome resource for C altivelas breeding and restocking, but will also serve as a key resource for studying fish genomics and genetics

read more

Content maybe subject to copyright    Report

A chromosome-level genome assembly and annotation of the humpback
grouper Cromileptes altivelas
Yun Sun
a, 1
, Dongdong Zhang
a, 1
, Jianzhi Shi
c,1
, Guisen Chen
a
, Ying Wu
a
, Yang Shen
a
,
Zhenjie Cao
a
, Linlin Zhang
b*
, Yongcan Zhou
a*
a
State Key Laboratory of Marine Resource Utilization in South China Sea, Hainan University,
Haikou 570228, China
b
Center for Ocean Mega-Science, The Key Laboratory of Experimental Marine Biology, Institute
of Oceanology, Chinese Academy of Sciences, Qingdao, China
c
Novogene Bioinformatics Institute, Beijing, 100083, China
1
These authors contributed equally to this work.
* To whom correspondence should be addressed.
Mailing address:
College of Marine Sciences
Hainan University
58 Renmin Avenue
Haikou 570228
PR ChinaPhone and Fax: 86-898-66256125
Email: zychnu@163.com (Yongcan Zhou)
linlinzhang@qdio.ac.cn (Linlin Zhang)
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted June 23, 2020. ; https://doi.org/10.1101/2020.06.22.164277doi: bioRxiv preprint

Abstract
Cromileptes altivelas that belongs to Serranidae in the order Perciformes, is widely
distributed throughout the tropical waters of the Indo-West Pacific regions. Due to
their excellent food quality and abundant nutrients, it has become a popular marine
food fish with high market values. Here, we reported a chromosome-level genome
assembly and annotation of the humpback grouper genome using more than 103X
PacBio long-reads and high-throughput chromosome conformation capture (Hi-C)
technologies. The N50 contig length of the assembly is as large as 4.14 Mbp, the final
assembly is 1.07 Gb with N50 of scaffold 44.78 Mb, and 99.24% of the scaffold
sequences were anchored into 24 chromosomes. The high-quality genome assembly
also showed high gene completeness with 27,067 protein coding genes and 3,710
ncRNAs. This high accurate genome assembly and annotation will not only provide
an essential genome resource for C. altivelas breeding and restocking, but will also
serve as a key resource for studying fish genomics and genetics.
Keywords: humpback grouper; genome assembly; evolution; PacBio; Hi-C
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted June 23, 2020. ; https://doi.org/10.1101/2020.06.22.164277doi: bioRxiv preprint

Data Description
Background & Summary
The humpback grouper Cromileptes altivelas (order Perciformes, family Epinephelinae) inhabits
the tropical waters of Indo-West Pacific oceans
1
. C. altivelas is increasing attracting attention as
high-value human food for its delicious flavor and high nutritional value, and it also has great
ornamental value due to its unique body shape and beautiful color
1-3
(Fig.1).However, the wild
population of C. altivelis is increasingly exploited. Meanwhile, C. altivelis farming is limited by
its slow growth speed, low survive rate, and various pathogenic diseases
4-5
. Obtaining high-quality
genomic sequences is the foundation of developing genomic selection to improve the performance
of C. altivelis. The genome information is also critical to explore the genetic mechanisms of its
unique traits, immune system and evolutionary adaptation. Recently, genome sequences of seven
grouper fish species are available. Most of these fish species belong to the genus of Epinephelus.
There are few genome sequences of grouper fish species from other genera. Humpback grouper is
the only species of Cromileptes genus.
Here, combining a PacBio long-read sequencing and high-throughput chromosome
conformation capture (Hi-C) technologies, we sequenced the humpback grouper C. altivelas
genome with estimated size 1.07 Gb. The N50 scaffold size of final genome assembly reached
44.78Mb and 99.24% of the scaffold sequences were anchored into 24 chromosomes. Based on
the high-quality assembly, we annotated the protein-coding genes and ncRNAs. The high-quality
genome assembly and annotation will not only provide an essential genome resource for exploring
the economic values of C. altivelas breeding and restocking, but will also serve as a key resource
for studying fish genomics and genetics
Methods
Sample collection, library construction and sequencing
We sampled a single individual of female C. altivelas for genome sequencing from Hainan, China
(Fig.1). The total genomic DNA was extracted from muscular tissue using SDS lysis and magnetic
beads isolation method.
We applied a strategy combing four technologies for library construction and sequencing
including PacBio Sequel System (for genome assembly), the Illumina Hiseq 4000 System (for
genome survey), 10X Genomics link-reads (for scaffold construction), and Hi-C optical maps (for
chromosome construction). First, two paired-end Illumina sequence libraries were constructed
with an insert size of 350 bp, and sequencing was carried out on the Illumina HiSeq 4000 platform.
A total of 79.18 Gb (coverage of 71.98 X) of Paired-End 150 bp reads were produced. Raw
sequence data generated by the Illumina platform were filtered by the following criteria: filtered
reads with adapters, filtered reads with N bases more than 10%, and filtered reads with
low-quality bases (5) more than 50%. Second, a total of 113.49 Gb of polymerase reads data
were generated using PacBio Sequel platform, and a total of 106.3 Gb (coverage of 103 X)
subreads were obtained after removing adaptors and filtered with the default parameters. The
average and the N50 length of subreads reached 8.04 kb and 13.26 kb, respectively. Third, one
10X Genomics linked-read library was constructed and sequenced on Illumina HiSeq 4000
platform, which produced 129.1Gb (coverage of 117.4 X). Finally, an optical map was also
constructed from Hi-C, of which 119.2 Gb (coverage of 108.4 X) data were generated. All
sequence data are summarized in Table 1.
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted June 23, 2020. ; https://doi.org/10.1101/2020.06.22.164277doi: bioRxiv preprint

Genome size estimation
The genome size of C. altivelas was first estimated using k-mer spectrum with Jellyfish
6
(v2.1.3).
The distribution of 17-kmer showed a major peak at 57 (Figure S1). Based on the total number of
kmers (63,765,804,944) and corresponding to a kmer depth of 57, the C. altivelis genome size was
estimated to be 1118.70 Mb using the formula: Genome size= kmer_Number / Peak_Depth. The
modified genome size was 1104.81 Mb, the genome heterozygosity was 0.16%, and the repetition
rate was 46.38%.
De novo assembly of the C. altivelis genome
The contig assembly of the C. altivelis genome was carried out using the FALCON assembler
7
,
followed by two rounds of polishing with Quiver
8
. FALCON implements a hierarchical assembly
process that include the following steps: (1) subread error correction through aligning all reads to
each other using daligner
9
, the overlap data were then processed to generate error-corrected
consensus reads; after error correction, we obtained 28 Gb (35 X coverage) of error-corrected
reads; (2) second round of overlap detection using error-corrected reads; (3) construction of a
directed string graph from overlap data; and (4) resolving contig path from the string graph. After
FALCON assembly, the genome was polished by Quiver. Initial assembly of the PacBio data
resulted in a contig N50 (the minimum length of contigs accounting for half of the haploid
genome size) of 4.14 Mb. Then, PacBio contigs were first scaffolded using optical map data, and
the resulting scaffolds were further connected to super-scaffolds by 10X Genomics linked-read
data using the fragScaff software
10
. Finally, we used Illumina-derived short reads to correct any
remaining errors by pilon
11
. The final genome assembly of C. altivelis was with a total length of
1.07 Gb, contig N50 of 4,14 Mb, and scaffold N50 of 44.78 Mb (Table 2).
Hi-C technology was further used for chromosome construction. We performed quality control
of Hi-C raw data using HiCUP (version 3.0). We then aligned the raw reads to the draft assembled
sequence by Bowtie2 (version 2.2.2), and filtered out the low quality reads to build raw
intrachromosomal contact maps. Based on high quality Hi-C data, we anchored and orientated
primary scaffolds into 24 chromosomes (Fig. 2), which additively covered 99.24% of the whole
genome sequences.
Repetitive sequences annotation
The repetitive elements in the C. altivelis genome were identified by a combination of evidence-
based and ab initio approaches. We first used RepeatMasker (RepeatMasker, RRID:SCR
012954)
12
and RepeatProteinMask to search against Repbase. We then construct a de novo
repetitive element library using RepeatModeler and further utilized this de novo library for second
round searching by RepeatMasker. In addition, we used Tandem Repeats Finder
13
, LTR FINDER
(LTR FINDER, RRID:SCR 015247)
14
, PILER
15
, and RepeatScout (RepeatScout, RRID:SCR
014653)
16
with default parameters for further repetitive elements annotation. Overall, we found
473,252,116 bp repeat sequences, accounted for 44.35% of C. altivelis genome (Table 3A),
including 3.8% tandem repeats. Among transposable elements (TEs), there are 17.28% DNA
transposons, 24.07% retroelements including LINESINE and LTR, and 3.74% unclassified
elements (Table 3B).
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted June 23, 2020. ; https://doi.org/10.1101/2020.06.22.164277doi: bioRxiv preprint

Protein-coding gene prediction and functional annotation
To obtain a fully annotated C. altivelas genome, three approaches were combined to predict
protein-coding genes including homology-based prediction, ab initio prediction, and
transcriptome-based prediction. First, homology-based prediction was performed by TBLASTN
(TBLASTN, RRID:SCR 011822)
17
using protein repertoires of nine common vertebrates including
Branchiostoma floridae (Bfl, GCA_000003815.1), Cynoglossus semilaevis (Cse,
GCA_000523025.1), Danio rerio (Dre, GCF_000002035.6), Gasterosteus aculeatus (Gac,
GCA_000180675.1), Larimichthys crocea (Lcr, GCA_000972845.1), Oryzias latipes (Ola,
GCA_002234675.1), Oreochromis niloticus (Oni, GCF_001858045.1), and Takifugu rubripes (Tru,
GCF_000180615.1). The Basic Local Alignment Search Tool (BLAST) hits were then conjoined
by Solar software
18
. GeneWise (GeneWise, RRID:SCR 015054)
19
was then used to predict the
exact gene structure of the corresponding genomic region on each BLAST hit. Homology
predictions were denoted as “Homology-set”.
Second, to provide further evidence for evaluating the predicted gene models, we assembled
38.67 Gb RNA-sequencing (RNA-seq) data derived from five different tissues by both de novo
and reference-guided approaches. De novo RNA-seq assembly approach was performed by Trinity
pipeline
20
, resulting in 370,688 contigs with an average length of 909 bp (Trinity-set). For
reference-guided approach, short reads were directly mapped to the genome using Tophat (Tophat,
RRID:SCR 013035)
21
to identify putative exon regions and splice junctions. Cufflinks (Cufflinks,
RRID:SCR 014597)
22
and cuffmerge was then used to assemble the mapped reads into gene
models (Cufflinks-set). These assembled Trinity-set and Cufflinks-set were then aligned against
the C. altivelis genome by Program to Assemble Spliced Alignment (PASA). Valid transcript
alignments were clustered based on genome mapping location and assembled into gene structures.
Gene models created by PASA
23
were denoted as Transcripts-set”.
Third, ab initio prediction was performed on repeat-masked C. altivelas genome using Augustus
(Augustus, RRID:SCR 008417)
24
, GeneID
25
, GeneScan
26
, GlimmerHMM (GlimmerHMM,
RRID:SCR 002654)
27
and SNAP
28
. Of these, Augustus, SNAP, and GlimmerHMM were trained
by PASA-H-set gene models. Finally, three predicted gene models were integrated by
EvidenceModeler
29
. Weights for each type of evidence were set as follows: Transdecoder >
GeneWise = Cufflinks-set > Augustus > GeneID = SNAP = GlimmerHMM = GeneScan. The gene
models were further updated by PASA2 to generate untranslated regions, alternative splicing
variation information. Finally, a total of 27,242 protein-coding genes were obtained with a mean
of 8.7 exons per gene (Table 4). The lengths of genes, coding sequence, introns, and exons in C.
altivelis were comparable to those of closely related genomes (Supplementary Table S1).
Gene functions of protein-coding genes were annotated by searching functional motifs, domains,
and the possible biological process of genes to known databases such as SwissProt
30
, Pfam
31
, NR
database (from NCBI), Gene Ontology
32
, and Kyoto Encyclopedia of Genes and Genomes
33
. A
total of 27,067 protein-coding genes (99.4%) were successfully annotated for at least one function
terms (Supplementary Table S2).
Non-coding gene prediction
We also predicted noncoding RNA genes in the C. altivelis genome. The rRNA fragments were
predicted by searching against human rRNA database using BLAST with an E-value of 1E-10.
The tRNA genes were identified by tRNAscan-SE (tRNAscan-SE, RRID:SCR 010835) software
34
.
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted June 23, 2020. ; https://doi.org/10.1101/2020.06.22.164277doi: bioRxiv preprint

Citations
More filters

Automated Eukaryotic Gene Structure Annotation Using EVidenceModeler and the Program to Assemble Spliced Alignments

TL;DR: The experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation.
Journal ArticleDOI

Application of second-generation sequencing (SGS) and third generation sequencing (TGS) in aquaculture breeding program

TL;DR: In aquaculture, high-throughput sequencing technologies have expanded gene-based to genome-wide research in aqua-culture species as discussed by the authors, and the application of these novel sequencing technologies has generated Quantitative Trait Loci (QTL) and novel genes associated with commercially important production traits, which are useful for essential processes in selective breeding programs such as population genomics evaluation, Marker-Assisted Selection (MAS) and Genomic Selection (GS).
References
More filters
Journal ArticleDOI

Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement

TL;DR: Pilon is a fully automated, all-in-one tool for correcting draft assemblies and calling sequence variants of multiple sizes, including very large insertions and deletions, which is being used to improve the assemblies of thousands of new genomes and to identify variants from thousands of clinically relevant bacterial strains.
Journal ArticleDOI

OrthoMCL: identification of ortholog groups for eukaryotic genomes.

TL;DR: OrthoMCL provides a scalable method for constructing orthologous groups across multiple eukaryotic taxa, using a Markov Cluster algorithm to group (putative) orthologs and paralogs.
Journal ArticleDOI

PAML: a program package for phylogenetic analysis by maximum likelihood

TL;DR: The strength of PAML, in comparison with other phylogenetic packages currently available, is its implementation of a variety of evolutionary models, which include several models of variable evolutionary rates among sites, models for combined analyses of multiple gene sequence data and models for amino acid sequences.
Journal ArticleDOI

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data

TL;DR: This work presents a statistical framework for calling SNPs, discovering somatic mutations, inferring population genetical parameters and performing association tests directly based on sequencing data without explicit genotyping or linkage-based imputation and demonstrates that this method achieves comparable accuracy to alternative methods for estimating site allele count, for inferring allele frequency spectrum and for association mapping.
Journal ArticleDOI

The Pfam protein families database: towards a more sustainable future

TL;DR: Pfam is now primarily based on the UniProtKB reference proteomes, with the counts of matched sequences and species reported on the website restricted to this smaller set, and the facility to view the relationship between families within a clan has been improved by the introduction of a new tool.
Related Papers (5)