scispace - formally typeset

Posted ContentDOI

A high-quality genome assembly and annotation of the humpback grouper Cromileptes altivelas

23 Jun 2020-bioRxiv-

AbstractCromileptes altivelas that belongs to Serranidae in the order Perciformes, is widely distributed throughout the tropical waters of the Indo-West Pacific regions Due to their excellent food quality and abundant nutrients, it has become a popular marine food fish with high market values Here, we reported a chromosome-level genome assembly and annotation of the humpback grouper genome using more than 103X PacBio long-reads and high-throughput chromosome conformation capture (Hi-C) technologies The N50 contig length of the assembly is as large as 414 Mbp, the final assembly is 107 Gb with N50 of scaffold 4478 Mb, and 9924% of the scaffold sequences were anchored into 24 chromosomes The high-quality genome assembly also showed high gene completeness with 27,067 protein coding genes and 3,710 ncRNAs This high accurate genome assembly and annotation will not only provide an essential genome resource for C altivelas breeding and restocking, but will also serve as a key resource for studying fish genomics and genetics

Topics: Genome (56%), Genomics (53%), Sequence assembly (51%)

Summary (1 min read)

Data Description Background & Summary

  • The humpback grouper Cromileptes altivelas (order Perciformes, family Epinephelinae) inhabits the tropical waters of Indo-West Pacific oceans1.
  • Obtaining high-quality genomic sequences is the foundation of developing genomic selection to improve the performance of C. altivelis.
  • There are few genome sequences of grouper fish species from other genera.
  • Third, one 10X Genomics linked-read library was constructed and sequenced on Illumina HiSeq 4000 platform, which produced 129.1Gb (coverage of 117.4 X).

Genome size estimation

  • The genome size of C. altivelas was first estimated using k-mer spectrum with Jellyfish6 (v2.1.3).
  • Hi-C technology was further used for chromosome construction.
  • Based on high quality Hi-C data, the authors anchored and orientated primary scaffolds into 24 chromosomes (Fig. 2), which additively covered 99.24% of the whole genome sequences.

Repetitive sequences annotation

  • The authors first used RepeatMasker (RepeatMasker, RRID:SCR 012954)12 and RepeatProteinMask to search against Repbase.
  • In addition, the authors used Tandem Repeats Finder13, LTR FINDER (LTR FINDER, RRID:SCR 015247)14, PILER15, and RepeatScout (RepeatScout, RRID:SCR 014653)16 with default parameters for further repetitive elements annotation.
  • Gene models created by PASA23 were denoted as “Transcripts-set”.
  • The lengths of genes, coding sequence, introns, and exons in C. altivelis were comparable to those of closely related genomes (Supplementary Table S1).
  • A total of 27,067 protein-coding genes (99.4%) were successfully annotated for at least one function terms (Supplementary Table S2).

Non-coding gene prediction

  • The authors also predicted noncoding RNA genes in the C. altivelis genome.
  • The rRNA fragments were predicted by searching against human rRNA database using BLAST with an E-value of 1E-10.
  • The tRNA genes were identified by tRNAscan-SE (tRNAscan-SE, RRID:SCR 010835) software34.
  • The miRNA and snRNA genes were predicted by INFERNAL (INFERNAL, RRID:SCR 011809) 35 using Rfam database36.

Genome evolution analysis

  • To trace the evolutionary position of C. altivelis, nucleotide and protein datasets containing 1082 single-copy genes from the 16 species were used for phylogenetic tree reconstruction and divergence time estimation.
  • There were 1,045 gene families and 1,584 genes in C. altivelis without significant homologous hits to L. crocea, L. oculatus and D. rerio.
  • A Markov chain Monte Carlo analysis was run for 20,000 generations using a burn-in of 1,000 iterations.
  • These phylogenetic analyses indicated that C. altivelis diverged from the common ancestral of G. aculeatus approximately 50.5 million years ago (Fig.3).

Code availability

  • No specific code was developed in this work.
  • The data analyses were performed according to the manuals and protocols provided by the developers of the corresponding bioinformatics tools that are described in the Methods section.

Did you find this useful? Give us your feedback

...read more

Content maybe subject to copyright    Report

A chromosome-level genome assembly and annotation of the humpback
grouper Cromileptes altivelas
Yun Sun
a, 1
, Dongdong Zhang
a, 1
, Jianzhi Shi
c,1
, Guisen Chen
a
, Ying Wu
a
, Yang Shen
a
,
Zhenjie Cao
a
, Linlin Zhang
b*
, Yongcan Zhou
a*
a
State Key Laboratory of Marine Resource Utilization in South China Sea, Hainan University,
Haikou 570228, China
b
Center for Ocean Mega-Science, The Key Laboratory of Experimental Marine Biology, Institute
of Oceanology, Chinese Academy of Sciences, Qingdao, China
c
Novogene Bioinformatics Institute, Beijing, 100083, China
1
These authors contributed equally to this work.
* To whom correspondence should be addressed.
Mailing address:
College of Marine Sciences
Hainan University
58 Renmin Avenue
Haikou 570228
PR ChinaPhone and Fax: 86-898-66256125
Email: zychnu@163.com (Yongcan Zhou)
linlinzhang@qdio.ac.cn (Linlin Zhang)
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted June 23, 2020. ; https://doi.org/10.1101/2020.06.22.164277doi: bioRxiv preprint

Abstract
Cromileptes altivelas that belongs to Serranidae in the order Perciformes, is widely
distributed throughout the tropical waters of the Indo-West Pacific regions. Due to
their excellent food quality and abundant nutrients, it has become a popular marine
food fish with high market values. Here, we reported a chromosome-level genome
assembly and annotation of the humpback grouper genome using more than 103X
PacBio long-reads and high-throughput chromosome conformation capture (Hi-C)
technologies. The N50 contig length of the assembly is as large as 4.14 Mbp, the final
assembly is 1.07 Gb with N50 of scaffold 44.78 Mb, and 99.24% of the scaffold
sequences were anchored into 24 chromosomes. The high-quality genome assembly
also showed high gene completeness with 27,067 protein coding genes and 3,710
ncRNAs. This high accurate genome assembly and annotation will not only provide
an essential genome resource for C. altivelas breeding and restocking, but will also
serve as a key resource for studying fish genomics and genetics.
Keywords: humpback grouper; genome assembly; evolution; PacBio; Hi-C
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted June 23, 2020. ; https://doi.org/10.1101/2020.06.22.164277doi: bioRxiv preprint

Data Description
Background & Summary
The humpback grouper Cromileptes altivelas (order Perciformes, family Epinephelinae) inhabits
the tropical waters of Indo-West Pacific oceans
1
. C. altivelas is increasing attracting attention as
high-value human food for its delicious flavor and high nutritional value, and it also has great
ornamental value due to its unique body shape and beautiful color
1-3
(Fig.1).However, the wild
population of C. altivelis is increasingly exploited. Meanwhile, C. altivelis farming is limited by
its slow growth speed, low survive rate, and various pathogenic diseases
4-5
. Obtaining high-quality
genomic sequences is the foundation of developing genomic selection to improve the performance
of C. altivelis. The genome information is also critical to explore the genetic mechanisms of its
unique traits, immune system and evolutionary adaptation. Recently, genome sequences of seven
grouper fish species are available. Most of these fish species belong to the genus of Epinephelus.
There are few genome sequences of grouper fish species from other genera. Humpback grouper is
the only species of Cromileptes genus.
Here, combining a PacBio long-read sequencing and high-throughput chromosome
conformation capture (Hi-C) technologies, we sequenced the humpback grouper C. altivelas
genome with estimated size 1.07 Gb. The N50 scaffold size of final genome assembly reached
44.78Mb and 99.24% of the scaffold sequences were anchored into 24 chromosomes. Based on
the high-quality assembly, we annotated the protein-coding genes and ncRNAs. The high-quality
genome assembly and annotation will not only provide an essential genome resource for exploring
the economic values of C. altivelas breeding and restocking, but will also serve as a key resource
for studying fish genomics and genetics
Methods
Sample collection, library construction and sequencing
We sampled a single individual of female C. altivelas for genome sequencing from Hainan, China
(Fig.1). The total genomic DNA was extracted from muscular tissue using SDS lysis and magnetic
beads isolation method.
We applied a strategy combing four technologies for library construction and sequencing
including PacBio Sequel System (for genome assembly), the Illumina Hiseq 4000 System (for
genome survey), 10X Genomics link-reads (for scaffold construction), and Hi-C optical maps (for
chromosome construction). First, two paired-end Illumina sequence libraries were constructed
with an insert size of 350 bp, and sequencing was carried out on the Illumina HiSeq 4000 platform.
A total of 79.18 Gb (coverage of 71.98 X) of Paired-End 150 bp reads were produced. Raw
sequence data generated by the Illumina platform were filtered by the following criteria: filtered
reads with adapters, filtered reads with N bases more than 10%, and filtered reads with
low-quality bases (5) more than 50%. Second, a total of 113.49 Gb of polymerase reads data
were generated using PacBio Sequel platform, and a total of 106.3 Gb (coverage of 103 X)
subreads were obtained after removing adaptors and filtered with the default parameters. The
average and the N50 length of subreads reached 8.04 kb and 13.26 kb, respectively. Third, one
10X Genomics linked-read library was constructed and sequenced on Illumina HiSeq 4000
platform, which produced 129.1Gb (coverage of 117.4 X). Finally, an optical map was also
constructed from Hi-C, of which 119.2 Gb (coverage of 108.4 X) data were generated. All
sequence data are summarized in Table 1.
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted June 23, 2020. ; https://doi.org/10.1101/2020.06.22.164277doi: bioRxiv preprint

Genome size estimation
The genome size of C. altivelas was first estimated using k-mer spectrum with Jellyfish
6
(v2.1.3).
The distribution of 17-kmer showed a major peak at 57 (Figure S1). Based on the total number of
kmers (63,765,804,944) and corresponding to a kmer depth of 57, the C. altivelis genome size was
estimated to be 1118.70 Mb using the formula: Genome size= kmer_Number / Peak_Depth. The
modified genome size was 1104.81 Mb, the genome heterozygosity was 0.16%, and the repetition
rate was 46.38%.
De novo assembly of the C. altivelis genome
The contig assembly of the C. altivelis genome was carried out using the FALCON assembler
7
,
followed by two rounds of polishing with Quiver
8
. FALCON implements a hierarchical assembly
process that include the following steps: (1) subread error correction through aligning all reads to
each other using daligner
9
, the overlap data were then processed to generate error-corrected
consensus reads; after error correction, we obtained 28 Gb (35 X coverage) of error-corrected
reads; (2) second round of overlap detection using error-corrected reads; (3) construction of a
directed string graph from overlap data; and (4) resolving contig path from the string graph. After
FALCON assembly, the genome was polished by Quiver. Initial assembly of the PacBio data
resulted in a contig N50 (the minimum length of contigs accounting for half of the haploid
genome size) of 4.14 Mb. Then, PacBio contigs were first scaffolded using optical map data, and
the resulting scaffolds were further connected to super-scaffolds by 10X Genomics linked-read
data using the fragScaff software
10
. Finally, we used Illumina-derived short reads to correct any
remaining errors by pilon
11
. The final genome assembly of C. altivelis was with a total length of
1.07 Gb, contig N50 of 4,14 Mb, and scaffold N50 of 44.78 Mb (Table 2).
Hi-C technology was further used for chromosome construction. We performed quality control
of Hi-C raw data using HiCUP (version 3.0). We then aligned the raw reads to the draft assembled
sequence by Bowtie2 (version 2.2.2), and filtered out the low quality reads to build raw
intrachromosomal contact maps. Based on high quality Hi-C data, we anchored and orientated
primary scaffolds into 24 chromosomes (Fig. 2), which additively covered 99.24% of the whole
genome sequences.
Repetitive sequences annotation
The repetitive elements in the C. altivelis genome were identified by a combination of evidence-
based and ab initio approaches. We first used RepeatMasker (RepeatMasker, RRID:SCR
012954)
12
and RepeatProteinMask to search against Repbase. We then construct a de novo
repetitive element library using RepeatModeler and further utilized this de novo library for second
round searching by RepeatMasker. In addition, we used Tandem Repeats Finder
13
, LTR FINDER
(LTR FINDER, RRID:SCR 015247)
14
, PILER
15
, and RepeatScout (RepeatScout, RRID:SCR
014653)
16
with default parameters for further repetitive elements annotation. Overall, we found
473,252,116 bp repeat sequences, accounted for 44.35% of C. altivelis genome (Table 3A),
including 3.8% tandem repeats. Among transposable elements (TEs), there are 17.28% DNA
transposons, 24.07% retroelements including LINESINE and LTR, and 3.74% unclassified
elements (Table 3B).
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted June 23, 2020. ; https://doi.org/10.1101/2020.06.22.164277doi: bioRxiv preprint

Protein-coding gene prediction and functional annotation
To obtain a fully annotated C. altivelas genome, three approaches were combined to predict
protein-coding genes including homology-based prediction, ab initio prediction, and
transcriptome-based prediction. First, homology-based prediction was performed by TBLASTN
(TBLASTN, RRID:SCR 011822)
17
using protein repertoires of nine common vertebrates including
Branchiostoma floridae (Bfl, GCA_000003815.1), Cynoglossus semilaevis (Cse,
GCA_000523025.1), Danio rerio (Dre, GCF_000002035.6), Gasterosteus aculeatus (Gac,
GCA_000180675.1), Larimichthys crocea (Lcr, GCA_000972845.1), Oryzias latipes (Ola,
GCA_002234675.1), Oreochromis niloticus (Oni, GCF_001858045.1), and Takifugu rubripes (Tru,
GCF_000180615.1). The Basic Local Alignment Search Tool (BLAST) hits were then conjoined
by Solar software
18
. GeneWise (GeneWise, RRID:SCR 015054)
19
was then used to predict the
exact gene structure of the corresponding genomic region on each BLAST hit. Homology
predictions were denoted as “Homology-set”.
Second, to provide further evidence for evaluating the predicted gene models, we assembled
38.67 Gb RNA-sequencing (RNA-seq) data derived from five different tissues by both de novo
and reference-guided approaches. De novo RNA-seq assembly approach was performed by Trinity
pipeline
20
, resulting in 370,688 contigs with an average length of 909 bp (Trinity-set). For
reference-guided approach, short reads were directly mapped to the genome using Tophat (Tophat,
RRID:SCR 013035)
21
to identify putative exon regions and splice junctions. Cufflinks (Cufflinks,
RRID:SCR 014597)
22
and cuffmerge was then used to assemble the mapped reads into gene
models (Cufflinks-set). These assembled Trinity-set and Cufflinks-set were then aligned against
the C. altivelis genome by Program to Assemble Spliced Alignment (PASA). Valid transcript
alignments were clustered based on genome mapping location and assembled into gene structures.
Gene models created by PASA
23
were denoted as Transcripts-set”.
Third, ab initio prediction was performed on repeat-masked C. altivelas genome using Augustus
(Augustus, RRID:SCR 008417)
24
, GeneID
25
, GeneScan
26
, GlimmerHMM (GlimmerHMM,
RRID:SCR 002654)
27
and SNAP
28
. Of these, Augustus, SNAP, and GlimmerHMM were trained
by PASA-H-set gene models. Finally, three predicted gene models were integrated by
EvidenceModeler
29
. Weights for each type of evidence were set as follows: Transdecoder >
GeneWise = Cufflinks-set > Augustus > GeneID = SNAP = GlimmerHMM = GeneScan. The gene
models were further updated by PASA2 to generate untranslated regions, alternative splicing
variation information. Finally, a total of 27,242 protein-coding genes were obtained with a mean
of 8.7 exons per gene (Table 4). The lengths of genes, coding sequence, introns, and exons in C.
altivelis were comparable to those of closely related genomes (Supplementary Table S1).
Gene functions of protein-coding genes were annotated by searching functional motifs, domains,
and the possible biological process of genes to known databases such as SwissProt
30
, Pfam
31
, NR
database (from NCBI), Gene Ontology
32
, and Kyoto Encyclopedia of Genes and Genomes
33
. A
total of 27,067 protein-coding genes (99.4%) were successfully annotated for at least one function
terms (Supplementary Table S2).
Non-coding gene prediction
We also predicted noncoding RNA genes in the C. altivelis genome. The rRNA fragments were
predicted by searching against human rRNA database using BLAST with an E-value of 1E-10.
The tRNA genes were identified by tRNAscan-SE (tRNAscan-SE, RRID:SCR 010835) software
34
.
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted June 23, 2020. ; https://doi.org/10.1101/2020.06.22.164277doi: bioRxiv preprint

Citations
More filters

10 Dec 2007
TL;DR: The experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation.
Abstract: EVidenceModeler (EVM) is presented as an automated eukaryotic gene structure annotation tool that reports eukaryotic gene structures as a weighted consensus of all available evidence. EVM, when combined with the Program to Assemble Spliced Alignments (PASA), yields a comprehensive, configurable annotation system that predicts protein-coding genes and alternatively spliced isoforms. Our experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation.

986 citations


Journal ArticleDOI
Abstract: High-throughput sequencing technologies have expanded gene-based to genome-wide research in aquaculture species. Several techniques adopting the Second Generation Sequencing (SGS), Third Generation Sequencing (TGS) platforms and/or hybrid genome assemblies have been widely employed in various aquaculture research areas including general breeding program, disease and health management, sex determination and reproduction biology, environmental stress response, nutrigenomics, morphological appearance, and meat quality/texture. Notably, the application of these novel sequencing technologies has generated Quantitative Trait Loci (QTL) and novel genes associated with commercially important production traits, which are useful for essential processes in selective breeding programs such as population genomics evaluation, Marker-Assisted Selection (MAS) and Genomic Selection (GS). These genomic approaches are also used as genetic traceability tools for seafood fraud assessment and tracking of farm escapees for wild stock conservation. Genomic data generated by these platforms could aid in establishing proper breeding strategies for more profitable and sustainable aquaculture.

References
More filters

Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.
Abstract: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straight-forward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.

81,150 citations


Journal ArticleDOI
TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.
Abstract: Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ~10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: [email protected]

35,234 citations


Journal ArticleDOI
TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.
Abstract: We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the logexpectation score, and refinement using treedependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.

32,394 citations


Journal ArticleDOI
TL;DR: The Trinity method for de novo assembly of full-length transcripts and evaluate it on samples from fission yeast, mouse and whitefly, whose reference genome is not yet available, providing a unified solution for transcriptome reconstruction in any sample.
Abstract: Massively parallel sequencing of cDNA has enabled deep and efficient probing of transcriptomes. Current approaches for transcript reconstruction from such data often rely on aligning reads to a reference genome, and are thus unsuitable for samples with a partial or missing reference genome. Here we present the Trinity method for de novo assembly of full-length transcripts and evaluate it on samples from fission yeast, mouse and whitefly, whose reference genome is not yet available. By efficiently constructing and analyzing sets of de Bruijn graphs, Trinity fully reconstructs a large fraction of transcripts, including alternatively spliced isoforms and transcripts from recently duplicated genes. Compared with other de novo transcriptome assemblers, Trinity recovers more full-length transcripts across a broad range of expression levels, with a sensitivity similar to methods that rely on genome alignments. Our approach provides a unified solution for transcriptome reconstruction in any sample, especially in the absence of a reference genome.

12,649 citations


Journal ArticleDOI
TL;DR: TopHat2 is described, which incorporates many significant enhancements to TopHat, and combines the ability to identify novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes.
Abstract: TopHat is a popular spliced aligner for RNA-sequence (RNA-seq) experiments. In this paper, we describe TopHat2, which incorporates many significant enhancements to TopHat. TopHat2 can align reads of various lengths produced by the latest sequencing technologies, while allowing for variable-length indels with respect to the reference genome. In addition to de novo spliced alignment, TopHat2 can align reads across fusion breaks, which can occur after genomic translocations. TopHat2 combines the ability to identify novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes. TopHat2 is available at http://ccb.jhu.edu/software/tophat.

9,972 citations


Related Papers (5)