Draft genome of the Northern snakehead, Channa argus

doi:10.1093/GIGASCIENCE/GIX011

Giga Science, 6, 2017, 1–5

doi: 10.1093/gigascience/gix011

Advance Access Publication Date: 2 March 2017

Data Note

DATA NOTE

Draft genome of the Northern snakehead, Channa

argus

Jian Xu

1,†

, Chao Bian

2,3,4,†

, Kunci Chen

5,†

, Guiming Liu

6

, Yanliang Jiang

1

,

Qing Luo

5

, Xinxin You

2,3

, Wenzhu Peng

1,7

,JiaLi

3

,YuHuang

3

, Yunhai Yi

3

,

Chuanju Dong

1,8

,HuaDeng

9

, Songhao Zhang

1

, Hanyuan Zhang

1

,

Qiong Shi

2,3,10,∗

and Peng Xu

1,7,∗

1

Key Laboratory of Aquatic Genomics, Ministry of Agriculture, CAFS Key Laboratory of Aquatic Genomics and

Beijing Key Laboratory of Fishery Biotechnology, Chinese Academy of Fishery Sciences, Fengtai, Beijing,

100141, China,

2

BGI Research Center for Aquatic Genomics, Chinese Academy of Fishery Sciences, Shenzhen,

Guangdong, 518083, China,

3

Shenzhen Key Lab of Marine Genomics, Guangdong Provincial Key Lab of

Molecular Breeding in Marine Economic Animals, BGI, Shenzhen, Guangdong, 518083, China,

4

Centre of

Reproduction, Development and Aging, Faculty of Health Sciences, University of Macau, Taipa, Macau, China,

5

Pearl River Fisheries Research Institute, Chinese Academy of Fishery Sciences, Guangzhou, Guangdong,

510380, China,

6

CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics,

Chinese Academy of Sciences, Chaoyang, Beijing, 100029, China,

7

Fujian Collaborative Innovation Center for

Exploitation and Utilization of Marine Biological Resources, Xiamen University, Xiamen, Fujian, 361102, China,

8

College of Fishery, Henan Normal University, Xinxiang, Henan, 453007, China,

9

Research Institute of Forestry

Policy and Information,Chinese Academy of Forestry, Haidian, Beijing, 100091, China and

10

Laboratory of

Aquatic Genomics, College of Ecology and Evolution, School of Life Sciences, Sun Yat-Sen University,

Guangzhou, Guangdong, 510275, China

∗

Correspondence address. Qiong Shi: BGI Research Center for Aquatic Genomics, Chinese Academy of Fishery Sciences, Shenzhen, Guangdong, 518083,

China. E-mail: shiqiong@genomics.cn; Peng Xu: Key Laboratory of Aquatic Genomics, Ministry of Agriculture, CAFS Key Laboratory of Aquatic Genomics

and Beijing Key Laboratory of Fishery Biotechnology, Chinese Academy of Fishery Sciences, Fengtai, Beijing, 100141, China.

E-mail: xupeng77@xmu.edu.cn

†

Contributed equally to this work.

Abstract

Background: The Northern snakehead (Channa argus), a member of the Channidae family of the Perciformes, is an

economically important freshwater sh native to East Asia. In North America, it has become notorious as an intentionally

released invasive species. Its ability to breathe air with gills and migrate short distances over land makes it a good model

for bimodal breath research. Therefore, recent research has focused on the identication of relevant candidate genes. Here,

we performed whole genome sequencing of C. argus to construct its draft genome, aiming to offer useful information for

Received: 17 August 2016; Revised: 5 January 2017; Accepted: 25 February 2017

C



The Author 2017. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons

Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium,

provided the original work is properly cited.

1

Downloaded from https://academic.oup.com/gigascience/article/6/4/gix011/3059652 by guest on 16 August 2022

2 Xu et al.

further functional studies and identication of target genes related to its unusual facultative air breathing. Findings: We

assembled the C. argus genome with a total of 140.3 Gb of raw reads, which were sequenced using the Illumina HiSeq2000

platform. The nal draft genome assembly was approximately 615.3 Mb, with a contig N50 of 81.4 kb and scaffold N50 of 4.5

Mb. The identied repeat sequences account for 18.9% of the whole genome. The 19 877 protein-coding genes were

predicted from the genome assembly, with an average of 10.5 exons per gene. Conclusion: We generated a high-quality

draft genome of C. argus, which will provide a valuable genetic resource for further biomedical investigations of this

economically important teleost sh.

Keywords: Channa argus; genome assembly; annotation; gene prediction

Data description

Introduction of C. argus

The Northern snakehead (Channa argus) is a special snakehead

sh cultivated mainly in Asia and Africa for food, especially in

China with an annual production of about 510 000 tons (worth

∼1.6 billion US dollars) (Fig. 1). Genetic degradation caused by in-

breeding of C. argus cultivation has led to higher susceptibility to

diseases. Furthermore, C. argus is considered a serious invasive

species in North America, due to its wide-range diet, parental

care, and rapid colonization and expansion [1]. C. argus has a

specialized aerial breathing organ, the suprabranchial chamber,

which facilitates its aquatic–aerial bimodal breathing. Because

of its aggressive status in ecosystem of rivers, lakes, and ponds,

and little consumption of the C. argus in America for food, this

leads to threats to the balance of ecosystems. For both economic

and ecological consideration, it is vital to develop genomic

resources for further genetic breeding studies or ecological re-

search. So far, the genome sequence of C. argus has not been

reported, and hence in our current study we performed genome

sequencing, assembly, and annotation of this teleost species.

C. argus genome sequencing on the Illumina platform

Genomic DNA was extracted from blood sample of a single fe-

male C. argus (Fishbase ID: 4799) using Qiagen GenomicTip100

(Qiagen). The sh was obtained from the Pearl River Fish-

eries Research Institute, Chinese Academy of Fishery Sciences,

Guangzhou, China. A whole-genome shotgun sequencing strat-

egy was applied, and short-insert libraries (180, 500, and 800

bp) and long-insert libraries (3 and 5 kb) were constructed us-

ing the standard protocol provided by Illumina (San Diego, CA,

USA). Paired-end sequencing with a 2 × 100-bp read length was

Figure: 1: the Northern snakehead sh, Channa argus.

performed on the short-insert and long-insert libraries using the

Illumina HiSeq2000 platform. In total, we generated about 140.3

Gb of raw reads, including 33.0, 36.9, 17.4, 26.5, and 26.5 Gb of

reads from the 180-, 500-, 800-, 3-, and 5-kb libraries. After re-

moval of low-quality and redundant reads, we obtained about

138.2 Gb of clean data for further de novo assembling of the C.

argus genome.

Estimation of C. argus genome size and sequencing

coverage

All the cleaned reads were subjected to 17-mer frequency dis-

tribution analysis [2]. As the total number of k-mers was about

5.90 × 10

10

and the peak of k-mers at a depth of 88, the genome

size of C. argus was calculated to be 670.4 Mb using the following

formula: genome size = k-mer

number / peak depth. Therefore,

the sequencing coverage was found to be ∼124.5 × based on the

estimated genome size.

De novo genome assembly and quality assessment

For whole genome assembly, SOAPdenovo2 [3] was used with

optimized parameters (-K 75) to construct contigs and original

scaffolds by using the reads from short-insert libraries. All reads

were then mapped onto contigs for scaffold construction by uti-

lizing the paired-end information of long-insert libraries. Some

intra-scaffold gaps were lled by local software using read-pairs

in which one end uniquely mapped to a contig and the other

end was located within a gap. Finally, a draft C. argus genome of

615.3 Mb was assembled, with a contig N50 size of 81.4 kb and a

scaffold N50 size of 4.5 Mb (Table 1).

Subsequently, the Core Eukaryotic Genes Mapping Approach

software [4] (version 2.3) with 248 conserved Core Eukaryotic

Downloaded from https://academic.oup.com/gigascience/article/6/4/gix011/3059652 by guest on 16 August 2022

Draft genome of the Northern snakehead 3

Table 1: summary of the Channa argus genome assembly and

annotation

Genome assembly

Contig N50 size (kb) 81

Contig number (>100 bp) 29 146

Scaffold N50 size (Mb) 4.5

Scaffold number (>100 bp) 5297

Total length (Mb) 615.3

Genome coverage (X) 224.6

The longest scaffold (bp) 18 736 006

Genome annotation

Protein-coding gene number 19 877

Mean transcript length (kb) 16.5

Mean exons per gene 10.5

Mean exon length (bp) 175.0

Mean intron length (bp) 1537.3

Genes was utilized to evaluate completeness of genes. Our

results demonstrated that the generated genome assembly

covered 242 of the 248 Core Eukaryotic Gene sequences, sug-

gesting a high level of completeness within the genome assem-

bly. Alongside this, we also used BUSCO (version 1.22) [5](the

representative vertebrate gene set containing 3023 single-copy

genes that are highly conserved in vertebrates) software to as-

sess the quality of the generated genome assembly. The assess-

ment demonstrated that the BUSCO value is 82.9%, containing

C: 66% [D: 1.4%], F: 16%, M: 17%, n: 3023 (C: complete [D: dupli-

cated], F: fragmented, M: missed, n: genes), suggesting a high

quality of the generated assembly.

Repeat sequence within the C. argus genome assembly

To analyze the C. argus genome, we employed Tandem Repeats

Finder [6] (version 4.04) with core parameters set as “Match = 2,

Mismatch = 7, Delta = 7, PM = 80, PI = 10, Minscore = 50, and

MaxPerid = 2000” to identify tandem repeats. Simultaneously,

RepeatModeler (version 1.04) and LTR

FINDER [7] were utilized

to construct a de novo repeat library with default parameters.

Subsequently, we used RepeatMasker [8] (version 3.2.9) to map

our assembled sequences on the Repbase TE (version 14.04) [9]

and the de novo repeat libraries to search for known and novel

transposable elements (TEs). In addition, the TE-related proteins

were annotated by using RepeatProteinMask software [8](ver-

sion 3.2.2). In summary, the total identied repeat sequences

accounted for 18.94% of the C. argus genome (Table 2). Among

them, long interspersed nuclear elements were the most abun-

dant type of repeat sequences and occupy 8.92% of the whole

genome.

Gene annotation

Gene annotation of the C. argus genome was conducted using

several approaches, including transcriptome-based prediction,

de novo prediction, and homology-based prediction. RNA-seq

datasets of pooled 13 tissues were obtained from our previous

work [10]. We mapped these RNA reads onto our genome assem-

bly using TopHat1.2 software [11], and then we employed Cuf-

inks (version 2.2.1) [12] to predict the gene structures. Further-

more, we performed Augustus (version 2.5.5) [13], GlimmerHMM

(version 3.0.1) [14], and GenScan (version 1.0) [15] softwares for

de novo prediction on the repeat-masked C. argus genome assem-

bly. The protein sequences of zebrash (Danio rerio)[16], Japanese

puffer (Fugu rubripes)[17], medaka (Oryzias latipes)[18], spotted

green puffersh (Tetraodon nigroviridis)[19] (the above 5 species

were downloaded from Ensembl release 75), blue spotted mud-

skipper (Boleophthalmus boddarti)[20], and golden arowana (Scle-

ropages formosus)[21] were mapped on the C. argus genome us-

ing TblastN with e-value ≤ 1e-5. Subsequently, Genewise2.2.0

software [22] was employed to predict the potential gene struc-

tures on all alignments. Finally, the above three datasets were

integrated to yield a comprehensive and nonredundant gene set

using GLEAN (https://sourceforge.net/projects/glean-gene/)[23]

with several lter steps (removing partial sequences or genes

shorter than 150 bp or prematurely terminated/frame-shifted

genes). The nal total gene set was composed of 19 877 genes,

with an average of 10.5 exons per gene (Table 1).

Construction of gene families and phylogenetic tree

We downloaded the protein sequences of zebrash [17],

Japanese puffer [18], stickleback (Gasterosteus aculeatus)[24],

spotted green puffersh [20], and medaka [19] from the Ensembl

Core database (release 75), and we also obtained the protein se-

quences of Asian seabass (Lates calcarifer)[25], blue spotted mud-

skipper [21], and golden arowana [22] from their correspond-

ing ftp websites, respectively. The consensus proteome set of

the above eight species and snakehead sh was ltered to re-

move those protein sequences <50 amino acids and resulted

in a dataset of 190 566 protein sequences, which was used as

the input le for OrthoMCL [26] to construct gene families. A to-

tal of 17 954 OrthoMCL families were built utilizing an effective

database size of 190 566 sequences for all-to-all BLASTP strategy

with an E-value of 1e-5 and a Markov Chain Clustering default

ination parameter. We further identied 24 gene families that

were specic in the snakehead sh (Fig. 2a).

Subsequently, we selected 1918 single-copy (only one gene

from each species) orthogroups from the above-mentioned 9

teleost species. We used MUSCLE (version 3.8.31) [27] to align the

Table 2: the detailed classication of repeat sequences of Channa argus

Repbase TEs TE protiens De novo Combined TEs

Type Length (bp) % in genome Length (bp) % in genome Length (bp) % in genome Length (bp) % in genome

DNA 17 984 515 2.92 6 784 728 1.10 25 663 752 4.17 35 435 946 5.76

LINE 16 799 343 2.73 17 563 763 2.85 54 890 557 8.92 60 651 866 9.86

SINE 4 512 385 0.73 0 0 6 672 552 1.08 9 026 285 1.47

LTR 4 421 728 0.72 3 031 607 0.49 24 144 657 3.92 26 983 318 4.39

Other 8125 0.001 0 0 0 0 8125 0.001

Unknown 0 0 0 0 9 413 375 1.53 9 413 375 1.53

Total 41 585 442 6.76 27 363 267 4.45 103 162 115 16.77 116 545 270 18.94

Downloaded from https://academic.oup.com/gigascience/article/6/4/gix011/3059652 by guest on 16 August 2022

4 Xu et al.

Figure: 2: genome evolution. (a) Orthologous gene families across ve sh genomes (Snakehead sh, Zebrash, Asian seabass, Mudskipper, and Arowana). (b) Phylogeny

of ray-nned shes (the arowana as the outgroup species).

protein sequences from the 1918 orthogroups, respectively. We

also converted protein alignments to their corresponding cod-

ing DNA sequence alignments using an in-house perl script. All

the translated coding DNA sequence sequences were then com-

bined into one “supergene” for each species. Nondegenerated

sites (4D) extracted from the supergenes were then joined into

new sequence of each species to construct a phylogenetic tree

(Fig. 2b) using MrBayes [28] (Version 3.2, with the GTR+gamma

model).

Conclusion

We report the rst whole genome sequencing, assembly, and

annotation of the Northern snakehead (Channa argus). The -

nal draft genome assembly is approximately 615.3 Mb, account-

ing for 91.8% of the estimated genome size (670.4 Mb). We

also predicted 19 877 protein-coding genes from the generated

assembly.

The draft genome assembly will be valuable resource for

genetic breeding, environmental DNA detection of invasive

species, and biological studies on this economically important

teleost sh. Based on these genomic data, researchers will be

able to develop genetic markers for further quantitative trait

locus and genome-wide association studies on growth traits.

These markers will also be very useful for DNA barcoding in

screening invasive C. argus for ecological protection.

Availability of supporting data

The raw sequencing reads of all libraries have been deposited at

NCBI (SRP078899). Further supporting data are available in the

GigaScience database, GigaDB [29].

Downloaded from https://academic.oup.com/gigascience/article/6/4/gix011/3059652 by guest on 16 August 2022

Draft genome of the Northern snakehead 5

Abbreviation

TE: transposable element.

Author contributions

PX designed the study. JX, CB, GL, JL, HD, YH, YX, and QS as-

sembled and annotated the genome. CB and YY performed the

evolution analysis. JX, YJ, XY, QL, and HZ analyzed the data. WP,

CD, SZ, and KC collected the sample and prepared the quality

control. JX, CB, QS, and PX wrote the manuscript. QS and PX par-

ticipated in discussions and provided advice. All authors read

and approved the nal manuscript.

Acknowledgements

This work was supported by Central Public-interest Scientic In-

stitution Basal Research Fund, CAFS (No. 2015C005, No. 2016HY-

JC0301), the National Natural Science Foundation of China (No.

31422057, No.31402291), the National Infrastructure of Fishery

Germplasm Resources of China (No. 2017DKA30470), Special

Project on the Integration of Industry, Education and Research

of Guangdong Province (No. 2013B090800017), Quality Inspec-

tion Programs of Scientic Research Project (No. 2015IK246), and

Shenzhen Special Program for Future Industrial Development

(No. JSGG20141020113728803).

Competing interests

The authors declare that they have no competing interests.

References

1. Jiang Y, Feng S, Xu J et al. Comparative transcriptome anal-

ysis between aquatic and aerial breathing organs of Channa

argus to reveal the genetic basis underlying bimodal respira-

tion. Mar Genomics 29:89–96.

2. Marcais G, Kingsford C. A fast, lock-free approach for ef-

cient parallel counting of occurrences of k-mers. Bioinfor-

matics 2011;27(6):764–70.

3. Luo R, Liu B, Xie Y et al. SOAPdenovo2: an empirically

improved memory-efcient short-read de novo assembler.

GigaScience 2012;1(1):18.

4. Parra G, Bradnam K, Korf I. CEGMA: a pipeline to accurately

annotate core genes in eukaryotic genomes. Bioinformatics

2007;23(9):1061–67.

5. Simao FA, Waterhouse RM, Ioannidis P et al. BUSCO:

assessing genome assembly and annotation complete-

ness with single-copy orthologs. Bioinformatics 2015;31(19):

3210–12.

6. Benson G. Tandem repeats nder: a program to analyze DNA

sequences. Nucleic acids research 1999;27(2):573–80.

7. Xu Z, Wang H. LTR˙FINDER: an efcient tool for the predic-

tion of full-length LTR retrotransposons. Nucleic Acids Res

2007;35(Web Server issue):W265–68.

8. Tarailo-Graovac M, Chen N. Using RepeatMasker to identify

repetitive elements in genomic sequences. In: Editoral board,

Baxevanis Andreas D et al. (eds.), Current Protocols in Bioin-

formatics 2009, Chapter 4:Unit 4 10.

9. Jurka J, Kapitonov VV, Pavlicek A et al. Repbase Update,

a database of eukaryotic repetitive elements. Cytogenetic

Genome Res 2005;110(1–4):462–67.

10. Jiang Y, Feng S, Xu J et al. Comparative transcrip-

tome analysis between aquatic and aerial breathing or-

gans of Channa argus to reveal the genetic basis un-

derlying bimodal respiration. Mar Genomics 2016: DOI:

10.1016/j.margen.2016.1006.1002.

11. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice

junctions with RNA-Seq. Bioinformatics 2009;25(9):1105–11.

12. Trapnell C, Williams BA, Pertea G et al. Transcript assem-

bly and quantication by RNA-Seq reveals unannotated tran-

scripts and isoform switching during cell differentiation. Nat

Biotechnol 2010;28(5):511–15.

13. Stanke M, Steinkamp R, Waack S et al. AUGUSTUS: a web

server for gene nding in eukaryotes. Nucleic Acids Res

2004;32(Web Server issue):W309–12.

14. Majoros WH, Pertea M, Salzberg SL. TigrScan and Glim-

merHMM: two open source ab initio eukaryotic gene-nders.

Bioinformatics 2004;20(16):2878–79.

15. Cai Y, Gonzalez JV, Liu Z et al. Computational systems

biology methods in molecular biology, chemistry biology,

molecular biomedicine, and biopharmacy. BioMed Res Int

2014;2014:746814.

16. Howe K, Clark MD, Torroja CF et al. The zebrash refer-

ence genome sequence and its relationship to the human

genome. Nature 2013;496(7446):498–503.

17. Aparicio S, Chapman J, Stupka E et al. Whole-genome shot-

gun assembly and analysis of the genome of Fugu rubripes.

Science 2002;297(5585):1301–10.

18. Kasahara M, Naruse K, Sasaki S et al. The medaka draft

genome and insights into vertebrate genome evolution.

Nature 2007;447(7145):714–19.

19. Jaillon O, Aury JM, Brunet F et al. Genome duplication in the

teleost sh Tetraodon nigroviridis reveals the early verte-

brate proto-karyotype. Nature 2004;431(7011):946–57.

20. You X, Bian C, Zan Q et al. Mudskipper genomes provide in-

sights into the terrestrial adaptation of amphibious shes.

Nat Commun 2014;5:5594.

21. Bian C, Hu Y, Ravi V et al. The Asian arowana (Scleropages

formosus) genome provides new insights into the evolution

of an early lineage of teleosts. Sci Rep 2016;6:24501.

22. Birney E, Clamp M, Durbin R. GeneWise and Genomewise.

Genome Res 2004;14(5):988–95.

23. Elsik CG, Mackey AJ, Reese JT et al. Creating a honey bee con-

sensus gene set. Genome Biol 2007;8(1):R13.

24. Jones FC, Grabherr MG, Chan YF et al. The genomic ba-

sis of adaptive evolution in threespine sticklebacks. Nature

2012;484(7392):55–61.

25. Vij S, Kuhl H, Kuznetsova IS et al. Chromosomal-level

assembly of the Asian seabass genome using long se-

quence reads and multi-layered scaffolding. PLoS genetics

2016;12(4):e1005954.

26. Li L, Stoeckert CJ, Jr, Roos DS. OrthoMCL: identication

of ortholog groups for eukaryotic genomes. Genome Res

2003;13(9):2178–89.

27. Edgar RC. MUSCLE: multiple sequence alignment with

high accuracy and high throughput. Nucleic Acids Res

2004;32(5):1792–97.

28. Ronquist F, Teslenko M, van der Mark P et al. MrBayes 3.2:

efcient Bayesian phylogenetic inference and model choice

across a large model space. Systematic Biol 2012;61(3):539–

42.

29. Xu J, Bian C, Chen K et al. Supporting data for the draft

genome of the Northern snakehead, Channa argus. Giga-

Science Database. 2017; http://dx.doi.org/10.5524/100279.

Downloaded from https://academic.oup.com/gigascience/article/6/4/gix011/3059652 by guest on 16 August 2022

Draft genome of the Northern snakehead, Channa argus

Citations

Research advances in the genomics and applications for molecular breeding of aquaculture animals

Effects of single or conjoint administration of lactic acid bacteria as potential probiotics on growth, immune response and disease resistance of snakehead fish (Channa argus).

An NGS-based approach for the identification of sex-specific markers in snakehead (Channa argus).

Transcriptome analysis of immune-related gene expression in hybrid snakehead (Channa maculata ♀ × Channa argus ♂) after challenge with Nocardia seriolae.

The First Highly Contiguous Genome Assembly of Pikeperch (Sander lucioperca), an Emerging Aquaculture Species in Europe

References

MUSCLE: multiple sequence alignment with high accuracy and high throughput

MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice across a Large Model Space

Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation

TopHat: discovering splice junctions with RNA-Seq

BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs

Related Papers (5)

Genome sequencing of the sweetpotato whitefly Bemisia tabaci MED/Q

BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs

Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype

The Atlantic salmon genome provides insights into rediploidization

Whole-genome sequence of a flatfish provides insights into ZW sex chromosome evolution and adaptation to a benthic lifestyle