scispace - formally typeset
Search or ask a question
Posted ContentDOI

A high-quality genome assembly and annotation of the humpback grouper Cromileptes altivelas

TL;DR: In this article, a chromosome-level genome assembly and annotation of the humpback grouper genome using more than 103X PacBio long-reads and high-throughput chromosome conformation capture (Hi-C) technologies was reported.
Abstract: Cromileptes altivelas that belongs to Serranidae in the order Perciformes, is widely distributed throughout the tropical waters of the Indo-West Pacific regions Due to their excellent food quality and abundant nutrients, it has become a popular marine food fish with high market values Here, we reported a chromosome-level genome assembly and annotation of the humpback grouper genome using more than 103X PacBio long-reads and high-throughput chromosome conformation capture (Hi-C) technologies The N50 contig length of the assembly is as large as 414 Mbp, the final assembly is 107 Gb with N50 of scaffold 4478 Mb, and 9924% of the scaffold sequences were anchored into 24 chromosomes The high-quality genome assembly also showed high gene completeness with 27,067 protein coding genes and 3,710 ncRNAs This high accurate genome assembly and annotation will not only provide an essential genome resource for C altivelas breeding and restocking, but will also serve as a key resource for studying fish genomics and genetics

Summary (1 min read)

Data Description Background & Summary

  • The humpback grouper Cromileptes altivelas (order Perciformes, family Epinephelinae) inhabits the tropical waters of Indo-West Pacific oceans1.
  • Obtaining high-quality genomic sequences is the foundation of developing genomic selection to improve the performance of C. altivelis.
  • There are few genome sequences of grouper fish species from other genera.
  • Third, one 10X Genomics linked-read library was constructed and sequenced on Illumina HiSeq 4000 platform, which produced 129.1Gb (coverage of 117.4 X).

Genome size estimation

  • The genome size of C. altivelas was first estimated using k-mer spectrum with Jellyfish6 (v2.1.3).
  • Hi-C technology was further used for chromosome construction.
  • Based on high quality Hi-C data, the authors anchored and orientated primary scaffolds into 24 chromosomes (Fig. 2), which additively covered 99.24% of the whole genome sequences.

Repetitive sequences annotation

  • The authors first used RepeatMasker (RepeatMasker, RRID:SCR 012954)12 and RepeatProteinMask to search against Repbase.
  • In addition, the authors used Tandem Repeats Finder13, LTR FINDER (LTR FINDER, RRID:SCR 015247)14, PILER15, and RepeatScout (RepeatScout, RRID:SCR 014653)16 with default parameters for further repetitive elements annotation.
  • Gene models created by PASA23 were denoted as “Transcripts-set”.
  • The lengths of genes, coding sequence, introns, and exons in C. altivelis were comparable to those of closely related genomes (Supplementary Table S1).
  • A total of 27,067 protein-coding genes (99.4%) were successfully annotated for at least one function terms (Supplementary Table S2).

Non-coding gene prediction

  • The authors also predicted noncoding RNA genes in the C. altivelis genome.
  • The rRNA fragments were predicted by searching against human rRNA database using BLAST with an E-value of 1E-10.
  • The tRNA genes were identified by tRNAscan-SE (tRNAscan-SE, RRID:SCR 010835) software34.
  • The miRNA and snRNA genes were predicted by INFERNAL (INFERNAL, RRID:SCR 011809) 35 using Rfam database36.

Genome evolution analysis

  • To trace the evolutionary position of C. altivelis, nucleotide and protein datasets containing 1082 single-copy genes from the 16 species were used for phylogenetic tree reconstruction and divergence time estimation.
  • There were 1,045 gene families and 1,584 genes in C. altivelis without significant homologous hits to L. crocea, L. oculatus and D. rerio.
  • A Markov chain Monte Carlo analysis was run for 20,000 generations using a burn-in of 1,000 iterations.
  • These phylogenetic analyses indicated that C. altivelis diverged from the common ancestral of G. aculeatus approximately 50.5 million years ago (Fig.3).

Code availability

  • No specific code was developed in this work.
  • The data analyses were performed according to the manuals and protocols provided by the developers of the corresponding bioinformatics tools that are described in the Methods section.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A chromosome-level genome assembly and annotation of the humpback
grouper Cromileptes altivelas
Yun Sun
a, 1
, Dongdong Zhang
a, 1
, Jianzhi Shi
c,1
, Guisen Chen
a
, Ying Wu
a
, Yang Shen
a
,
Zhenjie Cao
a
, Linlin Zhang
b*
, Yongcan Zhou
a*
a
State Key Laboratory of Marine Resource Utilization in South China Sea, Hainan University,
Haikou 570228, China
b
Center for Ocean Mega-Science, The Key Laboratory of Experimental Marine Biology, Institute
of Oceanology, Chinese Academy of Sciences, Qingdao, China
c
Novogene Bioinformatics Institute, Beijing, 100083, China
1
These authors contributed equally to this work.
* To whom correspondence should be addressed.
Mailing address:
College of Marine Sciences
Hainan University
58 Renmin Avenue
Haikou 570228
PR ChinaPhone and Fax: 86-898-66256125
Email: zychnu@163.com (Yongcan Zhou)
linlinzhang@qdio.ac.cn (Linlin Zhang)
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted June 23, 2020. ; https://doi.org/10.1101/2020.06.22.164277doi: bioRxiv preprint

Abstract
Cromileptes altivelas that belongs to Serranidae in the order Perciformes, is widely
distributed throughout the tropical waters of the Indo-West Pacific regions. Due to
their excellent food quality and abundant nutrients, it has become a popular marine
food fish with high market values. Here, we reported a chromosome-level genome
assembly and annotation of the humpback grouper genome using more than 103X
PacBio long-reads and high-throughput chromosome conformation capture (Hi-C)
technologies. The N50 contig length of the assembly is as large as 4.14 Mbp, the final
assembly is 1.07 Gb with N50 of scaffold 44.78 Mb, and 99.24% of the scaffold
sequences were anchored into 24 chromosomes. The high-quality genome assembly
also showed high gene completeness with 27,067 protein coding genes and 3,710
ncRNAs. This high accurate genome assembly and annotation will not only provide
an essential genome resource for C. altivelas breeding and restocking, but will also
serve as a key resource for studying fish genomics and genetics.
Keywords: humpback grouper; genome assembly; evolution; PacBio; Hi-C
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted June 23, 2020. ; https://doi.org/10.1101/2020.06.22.164277doi: bioRxiv preprint

Data Description
Background & Summary
The humpback grouper Cromileptes altivelas (order Perciformes, family Epinephelinae) inhabits
the tropical waters of Indo-West Pacific oceans
1
. C. altivelas is increasing attracting attention as
high-value human food for its delicious flavor and high nutritional value, and it also has great
ornamental value due to its unique body shape and beautiful color
1-3
(Fig.1).However, the wild
population of C. altivelis is increasingly exploited. Meanwhile, C. altivelis farming is limited by
its slow growth speed, low survive rate, and various pathogenic diseases
4-5
. Obtaining high-quality
genomic sequences is the foundation of developing genomic selection to improve the performance
of C. altivelis. The genome information is also critical to explore the genetic mechanisms of its
unique traits, immune system and evolutionary adaptation. Recently, genome sequences of seven
grouper fish species are available. Most of these fish species belong to the genus of Epinephelus.
There are few genome sequences of grouper fish species from other genera. Humpback grouper is
the only species of Cromileptes genus.
Here, combining a PacBio long-read sequencing and high-throughput chromosome
conformation capture (Hi-C) technologies, we sequenced the humpback grouper C. altivelas
genome with estimated size 1.07 Gb. The N50 scaffold size of final genome assembly reached
44.78Mb and 99.24% of the scaffold sequences were anchored into 24 chromosomes. Based on
the high-quality assembly, we annotated the protein-coding genes and ncRNAs. The high-quality
genome assembly and annotation will not only provide an essential genome resource for exploring
the economic values of C. altivelas breeding and restocking, but will also serve as a key resource
for studying fish genomics and genetics
Methods
Sample collection, library construction and sequencing
We sampled a single individual of female C. altivelas for genome sequencing from Hainan, China
(Fig.1). The total genomic DNA was extracted from muscular tissue using SDS lysis and magnetic
beads isolation method.
We applied a strategy combing four technologies for library construction and sequencing
including PacBio Sequel System (for genome assembly), the Illumina Hiseq 4000 System (for
genome survey), 10X Genomics link-reads (for scaffold construction), and Hi-C optical maps (for
chromosome construction). First, two paired-end Illumina sequence libraries were constructed
with an insert size of 350 bp, and sequencing was carried out on the Illumina HiSeq 4000 platform.
A total of 79.18 Gb (coverage of 71.98 X) of Paired-End 150 bp reads were produced. Raw
sequence data generated by the Illumina platform were filtered by the following criteria: filtered
reads with adapters, filtered reads with N bases more than 10%, and filtered reads with
low-quality bases (5) more than 50%. Second, a total of 113.49 Gb of polymerase reads data
were generated using PacBio Sequel platform, and a total of 106.3 Gb (coverage of 103 X)
subreads were obtained after removing adaptors and filtered with the default parameters. The
average and the N50 length of subreads reached 8.04 kb and 13.26 kb, respectively. Third, one
10X Genomics linked-read library was constructed and sequenced on Illumina HiSeq 4000
platform, which produced 129.1Gb (coverage of 117.4 X). Finally, an optical map was also
constructed from Hi-C, of which 119.2 Gb (coverage of 108.4 X) data were generated. All
sequence data are summarized in Table 1.
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted June 23, 2020. ; https://doi.org/10.1101/2020.06.22.164277doi: bioRxiv preprint

Genome size estimation
The genome size of C. altivelas was first estimated using k-mer spectrum with Jellyfish
6
(v2.1.3).
The distribution of 17-kmer showed a major peak at 57 (Figure S1). Based on the total number of
kmers (63,765,804,944) and corresponding to a kmer depth of 57, the C. altivelis genome size was
estimated to be 1118.70 Mb using the formula: Genome size= kmer_Number / Peak_Depth. The
modified genome size was 1104.81 Mb, the genome heterozygosity was 0.16%, and the repetition
rate was 46.38%.
De novo assembly of the C. altivelis genome
The contig assembly of the C. altivelis genome was carried out using the FALCON assembler
7
,
followed by two rounds of polishing with Quiver
8
. FALCON implements a hierarchical assembly
process that include the following steps: (1) subread error correction through aligning all reads to
each other using daligner
9
, the overlap data were then processed to generate error-corrected
consensus reads; after error correction, we obtained 28 Gb (35 X coverage) of error-corrected
reads; (2) second round of overlap detection using error-corrected reads; (3) construction of a
directed string graph from overlap data; and (4) resolving contig path from the string graph. After
FALCON assembly, the genome was polished by Quiver. Initial assembly of the PacBio data
resulted in a contig N50 (the minimum length of contigs accounting for half of the haploid
genome size) of 4.14 Mb. Then, PacBio contigs were first scaffolded using optical map data, and
the resulting scaffolds were further connected to super-scaffolds by 10X Genomics linked-read
data using the fragScaff software
10
. Finally, we used Illumina-derived short reads to correct any
remaining errors by pilon
11
. The final genome assembly of C. altivelis was with a total length of
1.07 Gb, contig N50 of 4,14 Mb, and scaffold N50 of 44.78 Mb (Table 2).
Hi-C technology was further used for chromosome construction. We performed quality control
of Hi-C raw data using HiCUP (version 3.0). We then aligned the raw reads to the draft assembled
sequence by Bowtie2 (version 2.2.2), and filtered out the low quality reads to build raw
intrachromosomal contact maps. Based on high quality Hi-C data, we anchored and orientated
primary scaffolds into 24 chromosomes (Fig. 2), which additively covered 99.24% of the whole
genome sequences.
Repetitive sequences annotation
The repetitive elements in the C. altivelis genome were identified by a combination of evidence-
based and ab initio approaches. We first used RepeatMasker (RepeatMasker, RRID:SCR
012954)
12
and RepeatProteinMask to search against Repbase. We then construct a de novo
repetitive element library using RepeatModeler and further utilized this de novo library for second
round searching by RepeatMasker. In addition, we used Tandem Repeats Finder
13
, LTR FINDER
(LTR FINDER, RRID:SCR 015247)
14
, PILER
15
, and RepeatScout (RepeatScout, RRID:SCR
014653)
16
with default parameters for further repetitive elements annotation. Overall, we found
473,252,116 bp repeat sequences, accounted for 44.35% of C. altivelis genome (Table 3A),
including 3.8% tandem repeats. Among transposable elements (TEs), there are 17.28% DNA
transposons, 24.07% retroelements including LINESINE and LTR, and 3.74% unclassified
elements (Table 3B).
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted June 23, 2020. ; https://doi.org/10.1101/2020.06.22.164277doi: bioRxiv preprint

Protein-coding gene prediction and functional annotation
To obtain a fully annotated C. altivelas genome, three approaches were combined to predict
protein-coding genes including homology-based prediction, ab initio prediction, and
transcriptome-based prediction. First, homology-based prediction was performed by TBLASTN
(TBLASTN, RRID:SCR 011822)
17
using protein repertoires of nine common vertebrates including
Branchiostoma floridae (Bfl, GCA_000003815.1), Cynoglossus semilaevis (Cse,
GCA_000523025.1), Danio rerio (Dre, GCF_000002035.6), Gasterosteus aculeatus (Gac,
GCA_000180675.1), Larimichthys crocea (Lcr, GCA_000972845.1), Oryzias latipes (Ola,
GCA_002234675.1), Oreochromis niloticus (Oni, GCF_001858045.1), and Takifugu rubripes (Tru,
GCF_000180615.1). The Basic Local Alignment Search Tool (BLAST) hits were then conjoined
by Solar software
18
. GeneWise (GeneWise, RRID:SCR 015054)
19
was then used to predict the
exact gene structure of the corresponding genomic region on each BLAST hit. Homology
predictions were denoted as “Homology-set”.
Second, to provide further evidence for evaluating the predicted gene models, we assembled
38.67 Gb RNA-sequencing (RNA-seq) data derived from five different tissues by both de novo
and reference-guided approaches. De novo RNA-seq assembly approach was performed by Trinity
pipeline
20
, resulting in 370,688 contigs with an average length of 909 bp (Trinity-set). For
reference-guided approach, short reads were directly mapped to the genome using Tophat (Tophat,
RRID:SCR 013035)
21
to identify putative exon regions and splice junctions. Cufflinks (Cufflinks,
RRID:SCR 014597)
22
and cuffmerge was then used to assemble the mapped reads into gene
models (Cufflinks-set). These assembled Trinity-set and Cufflinks-set were then aligned against
the C. altivelis genome by Program to Assemble Spliced Alignment (PASA). Valid transcript
alignments were clustered based on genome mapping location and assembled into gene structures.
Gene models created by PASA
23
were denoted as Transcripts-set”.
Third, ab initio prediction was performed on repeat-masked C. altivelas genome using Augustus
(Augustus, RRID:SCR 008417)
24
, GeneID
25
, GeneScan
26
, GlimmerHMM (GlimmerHMM,
RRID:SCR 002654)
27
and SNAP
28
. Of these, Augustus, SNAP, and GlimmerHMM were trained
by PASA-H-set gene models. Finally, three predicted gene models were integrated by
EvidenceModeler
29
. Weights for each type of evidence were set as follows: Transdecoder >
GeneWise = Cufflinks-set > Augustus > GeneID = SNAP = GlimmerHMM = GeneScan. The gene
models were further updated by PASA2 to generate untranslated regions, alternative splicing
variation information. Finally, a total of 27,242 protein-coding genes were obtained with a mean
of 8.7 exons per gene (Table 4). The lengths of genes, coding sequence, introns, and exons in C.
altivelis were comparable to those of closely related genomes (Supplementary Table S1).
Gene functions of protein-coding genes were annotated by searching functional motifs, domains,
and the possible biological process of genes to known databases such as SwissProt
30
, Pfam
31
, NR
database (from NCBI), Gene Ontology
32
, and Kyoto Encyclopedia of Genes and Genomes
33
. A
total of 27,067 protein-coding genes (99.4%) were successfully annotated for at least one function
terms (Supplementary Table S2).
Non-coding gene prediction
We also predicted noncoding RNA genes in the C. altivelis genome. The rRNA fragments were
predicted by searching against human rRNA database using BLAST with an E-value of 1E-10.
The tRNA genes were identified by tRNAscan-SE (tRNAscan-SE, RRID:SCR 010835) software
34
.
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted June 23, 2020. ; https://doi.org/10.1101/2020.06.22.164277doi: bioRxiv preprint

Citations
More filters
10 Dec 2007
TL;DR: The experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation.
Abstract: EVidenceModeler (EVM) is presented as an automated eukaryotic gene structure annotation tool that reports eukaryotic gene structures as a weighted consensus of all available evidence. EVM, when combined with the Program to Assemble Spliced Alignments (PASA), yields a comprehensive, configurable annotation system that predicts protein-coding genes and alternatively spliced isoforms. Our experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation.

1,528 citations

Journal ArticleDOI
TL;DR: In aquaculture, high-throughput sequencing technologies have expanded gene-based to genome-wide research in aqua-culture species as discussed by the authors, and the application of these novel sequencing technologies has generated Quantitative Trait Loci (QTL) and novel genes associated with commercially important production traits, which are useful for essential processes in selective breeding programs such as population genomics evaluation, Marker-Assisted Selection (MAS) and Genomic Selection (GS).

7 citations

References
More filters
Journal ArticleDOI
TL;DR: This protocol begins with raw sequencing reads and produces a transcriptome assembly, lists of differentially expressed and regulated genes and transcripts, and publication-quality visualizations of analysis results, which takes less than 1 d of computer time for typical experiments and ∼1 h of hands-on time.
Abstract: Recent advances in high-throughput cDNA sequencing (RNA-seq) can reveal new genes and splice variants and quantify expression genome-wide in a single assay. The volume and complexity of data from RNA-seq experiments necessitate scalable, fast and mathematically principled analysis software. TopHat and Cufflinks are free, open-source software tools for gene discovery and comprehensive expression analysis of high-throughput mRNA sequencing (RNA-seq) data. Together, they allow biologists to identify new genes and new splice variants of known ones, as well as compare gene and transcript expression under two or more conditions. This protocol describes in detail how to use TopHat and Cufflinks to perform such analyses. It also covers several accessory tools and utilities that aid in managing data, including CummeRbund, a tool for visualizing RNA-seq analysis results. Although the procedure assumes basic informatics skills, these tools assume little to no background with RNA-seq analysis and are meant for novices and experts alike. The protocol begins with raw sequencing reads and produces a transcriptome assembly, lists of differentially expressed and regulated genes and transcripts, and publication-quality visualizations of analysis results. The protocol's execution time depends on the volume of transcriptome sequencing data and available computing resources but takes less than 1 d of computer time for typical experiments and ∼1 h of hands-on time.

10,913 citations

Journal ArticleDOI
TL;DR: A program is described, tRNAscan-SE, which identifies 99-100% of transfer RNA genes in DNA sequence while giving less than one false positive per 15 gigabases.
Abstract: We describe a program, tRNAscan-SE, which identifies 99-100% of transfer RNA genes in DNA sequence while giving less than one false positive per 15 gigabases. Two previously described tRNA detection programs are used as fast, first-pass prefilters to identify candidate tRNAs, which are then analyzed by a highly selective tRNA covariance model. This work represents a practical application of RNA covariance models, which are general, probabilistic secondary structure profiles based on stochastic context-free grammars. tRNAscan-SE searches at approximately 30 000 bp/s. Additional extensions to tRNAscan-SE detect unusual tRNA homologues such as selenocysteine tRNAs, tRNA-derived repetitive elements and tRNA pseudogenes.

9,629 citations

Journal ArticleDOI
TL;DR: Zdobnov et al. as discussed by the authors proposed a measure for quantitative assessment of genome assembly and annotation completeness based on evolutionarily informed expectations of gene content, and implemented the assessment procedure in open-source software, with sets of Benchmarking Universal Single-Copy Orthologs.
Abstract: Motivation Genomics has revolutionized biological research, but quality assessment of the resulting assembled sequences is complicated and remains mostly limited to technical measures like N50. Results We propose a measure for quantitative assessment of genome assembly and annotation completeness based on evolutionarily informed expectations of gene content. We implemented the assessment procedure in open-source software, with sets of Benchmarking Universal Single-Copy Orthologs, named BUSCO. Availability and implementation Software implemented in Python and datasets available for download from http://busco.ezlab.org. Contact evgeny.zdobnov@unige.ch Supplementary information Supplementary data are available at Bioinformatics online.

7,747 citations

Journal ArticleDOI
TL;DR: The Swiss-Prot, TrEMBL and PIR protein database activities have united to form the Universal Protein Knowledgebase (UniProt), which is to provide a comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and query interfaces.
Abstract: To provide the scientific community with a single, centralized, authoritative resource for protein sequences and functional information, the Swiss-Prot, TrEMBL and PIR protein database activities have united to form the Universal Protein Knowledgebase (UniProt) consortium. Our mission is to provide a comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and query interfaces. The central database will have two sections, corresponding to the familiar Swiss-Prot (fully manually curated entries) and TrEMBL (enriched with automated classification, annotation and extensive cross-references). For convenient sequence searches, UniProt also provides several non-redundant sequence databases. The UniProt NREF (UniRef) databases provide representative subsets of the knowledgebase suitable for efficient searching. The comprehensive UniProt Archive (UniParc) is updated daily from many public source databases. The UniProt databases can be accessed online (http://www.uniprot.org) or downloaded in several formats (ftp://ftp.uniprot.org/pub). The scientific community is encouraged to submit data for inclusion in UniProt.

7,298 citations

Journal ArticleDOI
TL;DR: A new algorithm for finding tandem repeats which works without the need to specify either the pattern or pattern size is presented and its ability to detect tandem repeats that have undergone extensive mutational change is demonstrated.
Abstract: A tandem repeat in DNA is two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats have been shown to cause human disease, may play a variety of regulatory and evolutionary roles and are important laboratory and analytic tools. Extensive knowledge about pattern size, copy number, mutational history, etc. for tandem repeats has been limited by the inability to easily detect them in genomic sequence data. In this paper, we present a new algorithm for finding tandem repeats which works without the need to specify either the pattern or pattern size. We model tandem repeats by percent identity and frequency of indels between adjacent pattern copies and use statistically based recognition criteria. We demonstrate the algorithm’s speed and its ability to detect tandem repeats that have undergone extensive mutational change by analyzing four sequences: the human frataxin gene, the human β T cell receptor locus sequence and two yeast chromosomes. These sequences range in size from 3 kb up to 700 kb. A World Wide Web server interface at c3.biomath.mssm.edu/trf.html has been established for automated use of the program.

6,577 citations

Related Papers (5)