scispace - formally typeset
Search or ask a question
Journal ArticleDOI

High-quality genome assembly of channel catfish, Ictalurus punctatus

TL;DR: A high-quality genome assembly for a channel catfish from a breeding stock inbred in China for more than three generations, which was originally imported to China from North America is reported, which is comparable to a recent report of the “Coco”Channel catfish.
Abstract: The channel catfish (Ictalurus punctatus), a species native to North America, is one of the most important commercial freshwater fish in the world, especially in the United States’ aquaculture industry. Since its introduction into China in 1984, both cultivation area and yield of this species have been dramatically increased such that China is now the leading producer of channel catfish. To aid genomic research in this species, data sets such as genetic linkage groups, long-insert libraries, physical maps, bacterial artificial clones (BAC) end sequences (BES), transcriptome assemblies, and reference genome sequences have been generated. Here, using diverse assembly methods, we provide a comparable high-quality genome assembly for a channel catfish from a breeding stock inbred in China for more than three generations, which was originally imported to China from North America. Approximately 201.6 gigabases (Gb) of genome reads were sequenced by the Illumina HiSeq 2000 platform. Subsequently, we generated high quality, cost-effective and easily assembled sequences of the channel catfish genome with a scaffold N50 of 7.2 Mb and 95.6 % completeness. We also predicted that the channel catfish genome contains 21,556 protein-coding genes and 275.3 Mb (megabase pairs) of repetitive sequences. We report a high-quality genome assembly of the channel catfish, which is comparable to a recent report of the “Coco” channel catfish. These generated genome data could be used as an initial platform for molecular breeding to obtain novel catfish varieties using genomic approaches.

Content maybe subject to copyright    Report

DATA NOTE Open Access
High-quality genome assembly of channel
catfish, Ictalurus punctatus
Xiaohui Chen
1,2
, Liqiang Zhong
1,2
, Chao Bian
3
, Pao Xu
4
, Ying Qiu
3
, Xinxin You
3
, Shiyong Zhang
1,2
, Yu Huang
3
,
Jia Li
3
, Minghua Wang
1,2
, Qin Qin
1,2
, Xiaohua Zhu
1,2
, Chao Peng
3
, Alex Wong
5
, Zhifei Zhu
6,7
, Min Wang
3,6,7
,
Ruobo Gu
4,6
, Junmin Xu
3,6,7*
, Qiong Shi
3,6,7,8,9*
and Wenji Bian
1,2*
Abstract
Background: The channel catfish (Ictalurus punctatus), a species native to North America, is one of the most
important commercial freshwater fish in the world, especially in the United States aquaculture industry. Since its
introduction into China in 1984, both cultivation area and yield of this species have been dramatically increased
such that China is now the leading producer of channel catfish. To aid genomic research in this species, data sets
such as genetic linkage groups, long-insert libraries, physical maps, bacterial artificial clones (BAC) end sequences
(BES), transcriptome assemblies, and reference genome sequenc es have been generated. Here, using diverse
assembly methods, we provide a comparable high-quality genome assembly for a channel catfish from a breeding
stock inbred in China for more than three generations, which was originally imported to China from North America.
Findings: Approximately 201.6 gigabases (Gb) of genome reads were sequenced by the Illumina HiSeq 2000
platform. Subsequently, we generated high quality, cost-effective and easily assembled sequences of the channel
catfish genome with a scaffold N50 of 7.2 Mb and 95.6 % completeness. We also predicted that the channel catfish
genome contains 21,556 protein-coding genes and 275.3 Mb (megabase pairs) of repetitive sequences.
Conclusions: We report a high-quality genome assembly of the channel catfish, which is comparable to a recent
report of the Coco channel catfish. These generated genome data could be used as an initial platform for
molecular breeding to obtain novel catfish varieties using genomic approaches.
Keywords: Channel catfish, Whole genome sequencing, Assembly, Gene prediction, Repetitive sequence
Data description
Library construction, read sequencing and filtering
To generate genome sequence data, genomic DNA from
mixed tissues (including muscle and skin) of channel
catfish was extracted from a chosen individual cultured at
a local base of the Freshwater Fisheries Research Institute
(Jiangsu Province, Nanjing, China) using Qiagen Geno-
micTip100 (Qiagen, Hilden, DE) as per standard proto-
cols. Isolated genomic DNA was subsequently used to
construct short-insert libraries (250, 500 and 800 bp) and
long-insert libraries (2, 5, 10 and 20 kb) with the standard
protocol provided by Illumina (San Diego, USA). Paired-
end sequencing was performed using the Illumina HiSeq
2000 platform to generate 125-bp reads using a whole
genome shotgun sequencing (WGS) strategy [1].
To improve the quality of sequenced reads, we
trimmed 4 bases with edges from the reads of short-
insert libraries and long-insert libraries, discarded dupli-
cated reads from the long-insert libraries, and removed
reads containing 10 or more Ns and low-qua lity bases.
Finally, a total of 201.6-Gb clean reads were generated
for further genome assembly.
Genome assembly and quality assessments
At first, we estimated the channel catfish genome size
using k-mer analysis [2] with the formula : G = N*(L
17 + 1)/K_depth, where N is the total number of
* Correspondence:
xujunmin@genomics.cn; shiqiong@genomics.cn; js6060@sina.com
Equal contributors
3
Shenzhen Key Laboratory of Marine Genomics, Guangdong Provincial Key
Lab of Molecular Breeding in Marine Economic Animals, Shenzhen 518083,
China
1
Freshwater Fisheries Research Institute of Jiangsu Province, Nanjing 210017,
China
Full list of author information is available at the end of the article
© 2016 The Author(s). Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Chen et al. GigaScience (2016) 5:39
DOI 10.1186/s13742-016-0142-5

reads, and K_depth indicates the freque ncy of reads
occurring more frequently than the others. The ca lcu-
lated genom e size is 0.8 39 Gb, which is shorter than
that (1 Gb) from a 2016 report of an American-native
channel catfish [3].
Simultaneously, we employed SOAPdenovo2 (version
2.04.4) software [4] with optimized parameters (pregraph
K27d 1; contig M 1; scaff F b 1.5 p 16) to link
sequenced reads to contigs and original scaffolds. All
reads were then aligned onto the contigs for scaffold
construction by utilizing long-insert paired-end informa-
tion, which was subsequently supplied to link contigs to
scaffolds in a step-wise manner. Gaps were closed using
approximately 480 million of Illumina paired-end reads
generated from the three libraries with insert sizes of
250, 500 and 800 bp as the input for GapCloser (v1.12-
r6, default parameters and p set to 25) [2]. A final gen-
ome assembly of 0.845 Gb in length was obtained
(Table 1), which is slightly shorter than that (0.942 Gb)
of a recently reported a American-native channel catfish
genome [3]. The calculated contig N50 was 48.5 kilo-
bases (kb), and the scaffold N50 was 7.2 Mb (Table 1).
These values are also comparable to those in [3] (see de-
tails in Table 2).
Two typical methods were then used to assess the
quality and completeness of the generated assembly.
First, transcriptome evaluation was used to assess the
completeness of gene regions in the genome assembly.
We carried out de novo a ssembly of the RNA sequences
of skin and muscle tissues using Trinity software [5].
The assembled fragments were then aligned to the
genome assembly with BLAT [6] (E-value = 10e-6, iden-
tity = 90 % and coverage >90 %). Our results indicate
that the catfish genome assembly covered more than
90 % of gene-coding regions. Subsequently, Core
Eukaryotic Genes Mapping Approach (CEGMA) soft-
ware (version 2.3) [7] was employed with 248 conserved
core eukaryotic genes (CEGs) to assess the gene space
completeness within the generated genome assembly.
These results demonstrate that the genome assembly
covered more than 95 % of the CEG sequences, suggest-
ing a high level of completeness.
Transcriptome sequencing
Total RNA was extracted from muscle and skin tissues
of a channel catfish (the same individual used for the
above-mentioned genome sequencing) using TRIzol re-
agent (Invitrogen, USA). After purification using RNeasy
Animal Mini Kit (Qiagen, USA), equal amounts of total
RNA from each tissue were subjected to transcriptome
sequencing (RNA-seq) on the HiSeq 2000 platform.
Genome annotation
Repeat annotation
Firstly, RepeatModeller (version 1.04) and LTR_FINDER
[8] were used to build a de novo repeat library with de-
fault parameters. Subsequently, RepeatMasker [9] (ver-
sion 3.2.9) was utilized to map our sequences against the
Repbase [10] transposable element ( TE) library (vers ion
14.04) and the de novo repeat library, so as to search for
known and novel TEs. Next, we annotated tandem re-
peats using Tandem Repeat Finder [11] (version 4.04)
with core parameters set as Match = 2, Mismatch = 7,
Delta = 7, PM = 80, PI = 10, Minscore = 50, and Max-
Perid = 2000. Furthermore, TE-relevant proteins were
identified in our assembly using RepeatProteinMask
software [9] (version 3.2.2). These identified repeat se-
quences accounted for 32.56 % of the channel catfish
genome, of which the single largest class of TEs (repre-
senting 9.35 % of the whole genome) was the Tc1-
mariner family.
Annotations of gene structure and function
The channel catfish genome assembly was annotated
using three independent approaches: homology, de novo
and RNA-seq annotations. For homology annotation, the
protein sequences from zebrafish, Japanese fugu, spotted
green pufferfish, Japanese medaka (Ensembl release 75),
blue spotted mudskipper [1] and golden arowana [12]
were mapped on the channel catfish genome using
TblastN with e-value 1E-5. Genewise 2.2.0 software
[13] was then employed to predict the potential gene
structures of all alignments. Short genes (with fewer
than 150 bp) and prematurely terminated or frame-
shifted genes were discarded. Next, de novo annotation
was used to annotate the gene structure from the gen-
ome assembly. We randomly selected 1000 complete
genes from the homology annotation set to train the pa-
rameters for AUGUSTUS 2.5 [14]. Simultaneously, all
Table 1 Catfish genome assembly and annotation statistics
Genome assembly
Contig N50 size (kb) 48.5
Contig number (>100 bp) 66,332
Scaffold N50 size (Mb) 7.2
Scaffold number (>100 bp) 31,979
Total length (Mb) 845.4
Genome coverage (X) 201.6
Longest scaffold (bp) 26,612,498
Genome annotation
Protein-coding gene number 21,556
Mean transcript length (kb) 16.1
Mean exons per gene 8.7
Mean exon length (bp) 190.2
Mean intron length (bp) 1872.4
Chen et al. GigaScience (2016) 5:39 Page 2 of 4

repetitive regions were replaced in the channel catfish
genome with N to decline the ratio of pseudogene an-
notations. Subsequently, we utilized AUGUSTUS 2.5
and GENSCAN 1.0 [15] for de novo prediction of
repeat-masked genome sequences. The filtered processes
performed on the de novo annotation were the same as
those used for homology prediction. Simultaneously, the
RNA-seq annotation pipeline was also used to detect
gene regions. We employed Tophat 1.2 software [16] to
map the RNA reads extracted from the skin and muscle
transcriptomes onto the channel catfish genome se-
quences. We then sorted and integrated Tophat align-
ments, and used Cufflink software [17] to analyze
potential gene structures. Results from all three of the
above-mentioned annotation pipelines were merged to
produce a comprehensive and non-redundant gene set
using GLEAN [18]. This gene set contained 21,556 genes
with an average of 8.7 exons per gene (Table 1). Because
different annotation pipelines were applied, the total
gene number predicted here is lower than the 26,661 re-
ported in the American-native channel catfish genome
[3]. The Cuffdiff package [17] of Cufflink software (ver-
sion 2.0.2.Linux_×86_64) with core parameters (FDR
0.05 geometric-norm TRUE compatible-hits-norm
TRUE) was utilized to calculate expression levels accord-
ing to the GLEAN gene set and Tophat alignmen ts.
About 93.4 % of genes were pred icted from at least two
types of evidence, and approximate 78 % of the genes
showed expression activity (fragments per kilobase of
exon model per million mapped reads >0) in the skin
and muscle tissues.
Simultaneously, all protein sequences from GLEAN
results were mapped to SwissProt and TrEMBL [19]
(UniProt release 2011.06) databases using BlastP [20]
with an E-value 1e-5 to find the best hit for each pro-
tein. We also used InterProScan 4.7 software [21] to
align the protein sequences against public databases, in-
cluding Pfam [22], PRINTS [23], ProDom [24] and
SMART [25], to examine the known motifs and domains
in our sequences. Over 94.5 % of these predicted genes
possessed at least one related functional assignment
from other public databases (SwissProt [19], Interpro
[21], TrEMBL and KEGG [26]). In addition, the gene
structures (including exon length, intron regions and
mRNAs) and exon number distributions (Table 1) were
predicted to be similar to other representative teleost
species such as zebrafish and medaka.
Conclusion
We generated a channel catfish genome assemb ly with
high quality and comparable structures to other pub-
lished fish genomes, especially the Coco catfish genome
[3]. This new assembly is a valuable resource and refer-
ence for further construction of high-density genetic
linkage maps and identification of quantitative trait loci
for molecular breeding of catfishes.
Availability of supporting data
Supporting data are available in the GigaDB database
[27]. Raw whole genome sequencing and transcriptome
data are deposited in the SRA under bioproject number
PRJNA319455.
Abbreviations
BAC, bacterial artificial clone; BES, BAC end sequences/sequencing; CEG, core
eukaryotic genes; CEGMA, core eukaryotic genes mapping approach; Gb,
gigabases; kb, kilobases; Mb, megabases; TE, transposable element; WGS,
whole genome shotgun
Table 2 Comparison of genome assembly in sequenced fishes
Species Sequencing platform (Mb) Assembled genome size (Mb) scaffold N50 (kb) contig N50 (kb)
catfish (BGI) Illumina 845 7248 48.5
catfish (Lius study [1]) Illumina, Pacbio 942 7726 77.2
zebrafish Illumina, Sanger 1412 1551 25.0
Atlantic herring Illumina 808 1840 21.3
greenpuffer Sanger 342 100 16.0
medaka Sanger 700 1410 9.8
stickleback Sanger, Illumina 463 10,800 83.2
fugu Sanger 332 unknown 16.5
cod 454 753 459 2.8
platyfish 454, Illumina 669 1102 21.0
lamprey 454, Illumina 816 173 unknown
lancelets Illumina 520 unknown unknown
tuna 454, Illumina 800 136 7.6
mudskipper Illumina 983 2309 20.0
Chen et al. GigaScience (2016) 5:39 Page 3 of 4

Funding
This study was supported by the National Key Technology R&D Program of
China (No. 2012BAD26B03), Fund for Independent Innovation of Agricultural
Science and Technology of Jiangsu Province (No. CX(15)1013), Human
Resources and Social Security of Jiangsu Province (No. 2014-NY-008), Three-
Side Innovation Projects for Aquaculture in Jiangsu Province (No. Y2014-25
&Y2015-12), Shenzhen Special Program for Future Industrial Development
(No. JSGG20141020113728803), and Zhenjiang Leading Talent Program for
Innovation and Entrepreneurship.
Authors contributions
XC, QS, CB, JX and WB conceived the project. LZ, YH, SZ, MHW, QQ, XY, CP, AW,
ZZ, MW and RG collected the samples and extracted the genomic DNA. CB, YQ,
JL and YH performed the genome assembly and data analysis. CB, QS, XC, LZ,
XP, XZ and WB wrote the paper and all authors read and approved the final
manuscript.
Competing interests
The authors declare that they have no competing interests.
Author details
1
Freshwater Fisheries Research Institute of Jiangsu Province, Nanjing 210017,
China.
2
The Jiangsu Provincial Platform for Conservation and Utilization of
Agricultural Germplasm, Nanjing 210017, China.
3
Shenzhen Key Laboratory of
Marine Genomics, Guangdong Provincial Key Lab of Molecular Breeding in
Marine Economic Animals, Shenzhen 518083, China.
4
Freshwater Fisheries
Research Center, Chinese Academy of Fishery Sciences, Wuxi 214081, China.
5
BGI-Hong Kong, Hong Kong 999077, China.
6
BGI Zhenjiang Institute of
Hydrobiology, Zhenjiang 212000, China.
7
BGI Zhenjiang Fisheries Science and
Technology Industrial Co. Ltd, Zhenjiang 212000, China.
8
Laboratory of
Aquatic Genomics, College of Ecology and Evolution, School of Life Sciences,
Sun Yat-Sen University, Guangzhou 510275, China.
9
Center for Marine
Research, College of Life Sciences and Oceanography, Shenzhen University,
Shenzhen 518060, China.
Received: 20 May 2016 Accepted: 3 August 2016
References
1. You X, Bian C, Zan Q, Xu X, Liu X, Chen J, et al. Mudskipper genomes
provide insights into the terrestrial adaptation of amphibious fishes. Nat
Commun. 2014;5:5594.
2. Li R, Fan W, Tian G, Zhu H, He L, Cai J, et al. The sequence and de novo
assembly of the giant panda genome. Nature. 2010;463(7279):3117.
3. Liu Z, Liu S, Yao J, Bao L, Zhang J, Li Y, et al. The channel catfish genome
sequence provides insights into the evolution of scale formation in teleosts.
Nat Commun. 2016;7:11757.
4. Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, et al. SOAPdenovo2: an
empirically improved memory-efficient short-read de novo assembler.
GigaScience. 2012;1:12.
5. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al.
Trinity: reconstructing a full-length transcriptome without a genome from
RNA-Seq data. Nat Biotechnol. 2011;29(7):64452.
6. Kent WJ. BLATthe BLAST-like alignment tool. Genome Res. 2002;12(4):65664.
7. Parra G, Bradnam K, Korf I. CEGMA: a pipeline to accurately annotate core
genes in eukaryotic genomes. Bioinformatics. 2007;23(9):10617.
8. Xu Z, Wang H. LTR_FINDER: an efficient tool for the prediction of full-length
LTR retrotransposons. Nucleic Acids Res. 2007;35(Web Server issue):W2658.
9. Tarailo-Graovac M, Chen N. Using RepeatMasker to identify repetitive
elements in genomic sequences. Curr Protoc Bioinformatics. 2009;Chapter 4:
Unit 4. 10.
10. Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J.
Repbase Update, a database of eukaryotic repetitive elements. Cytogenet
Genome Res. 2005;110(14):4627.
11. Benson G. Tandem repeats finder: a program to analyze DNA sequences.
Nucleic Acids Res. 1999;27(2):57380.
12. Bian C, Hu Y, Ravi V, Kuznetsova IS, Shen X, Mu X, et al. The Asian arowana
(Scleropages formosus) genome provides new insights into the evolution of
an early lineage of teleosts. Sci Rep. 2016;6:24501.
13. Birney E, Clamp M, Durbin R. GeneWise and Genomewise. Genome Res.
2004;14(5):98895.
14. Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B. AUGUSTUS:
ab initio prediction of alternative transcripts. Nucleic Acids Res. 2006;34(Web
Server issue):W4359.
15. Burge C, Karlin S. Prediction of complete gene structures in human
genomic DNA. J Mol Biol. 1997;268(1):7894.
16. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with
RNA-Seq. Bioinformatics. 2009;25(9):1105 11.
17. Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L.
Differential analysis of gene regulation at transcript resolution with RNA-seq.
Nat Biotechnol. 2013;31(1):4653.
18. Elsik CG, Mackey AJ, Reese JT, Milshina NV, Roos DS, Weinstock GM. Creating
a honey bee consensus gene set. Genome Biol. 2007;8(1):R13.
19. Bairoch A, Apweiler R. The SWISS-PROT protein sequence database and its
supplement TrEMBL in 2000. Nucleic Acids Res. 2000;28(1):458.
20. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, et al.
Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res. 1997;25(17):3389402.
21. Zdobnov EM, Apweiler R. InterProScanan integration platform for the
signature-recognition methods in InterPro. Bioinformatics. 2001;17(9):8478.
22. Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sconnhammer EL. The
Pfam protein families database. Nucleic Acids Res. 2000;28(1):2636.
23. Attwood TK, Cronig MD, Flower DR, Lewis AP, Mabey JE, Scordis P, et al.
PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Res. 2000;
28(1):2257.
24. Corpet F, Gouzy J, Kahn D. Recent improvements of the ProDom database
of protein domain families. Nucleic Acids Res. 1999;27(1):2637.
25. Schult J, Copley RR, Doerks T, Ponting CP, Bork P. SMART: a web-based tool
for the study of genetically mobile domains. Nucleic Acids Res. 2000;28(1):
2314.
26. Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. KEGG: Kyoto
Encyclopedia of Genes and Genomes. Nucleic Acids Res. 1999;27(1):2934.
27. Chen X, Zhong L, Bian C, Xu P, Qiu Y, You X, Zhang S, Yu H, Li J, Wang M,
Qin Q, Zhu X, Peng C, Wong A, Zhu Z, Wang M, Ruobo G, Xu J, Shi Q, Bian
W. Supporting data for High-quality genome assembly of channel catfish,
Ictalurus punctatus. 2016. GigaScience Database, http://dx.doi.org/10.5524/
100212.
We accept pre-submission inquiries
Our selector tool helps you to find the most relevant journal
We provide round the clock customer support
Convenient online submission
Thorough peer review
Inclusion in PubMed and all major indexing services
Maximum visibility for your research
Submit your manuscript at
www.biomedcentral.com/submit
Submit your next manuscript to BioMed Central
and we will help you at every step:
Chen et al. GigaScience (2016) 5:39 Page 4 of 4
Citations
More filters
Journal ArticleDOI
TL;DR: Direct genotyping by sequencing (GBS) techniques have underpinned many of the advances in aquaculture genetics and breeding to date, and have been extensively applied to generate population‐level SNP genotype data.
Abstract: Selective breeding is increasingly recognized as a key component of sustainable production of aquaculture species. The uptake of genomic technology in aquaculture breeding has traditionally lagged behind terrestrial farmed animals. However, the rapid development and application of sequencing technologies has allowed aquaculture to narrow the gap, leading to substantial genomic resources for all major aquaculture species. While high-density single-nucleotide polymorphism (SNP) arrays for some species have been developed recently, direct genotyping by sequencing (GBS) techniques have underpinned many of the advances in aquaculture genetics and breeding to date. In particular, restriction-site associated DNA sequencing (RAD-Seq) and subsequent variations have been extensively applied to generate population-level SNP genotype data. These GBS techniques are not dependent on prior genomic information such as a reference genome assembly for the species of interest. As such, they have been widely utilized by researchers and companies focussing on nonmodel aquaculture species with relatively small research communities. Applications of RAD-Seq techniques have included generation of genetic linkage maps, performing genome-wide association studies, improvements of reference genome assemblies and, more recently, genomic selection for traits of interest to aquaculture like growth, sex determination or disease resistance. In this review, we briefly discuss the history of GBS, the nuances of the various GBS techniques, bioinformatics approaches and application of these techniques to various aquaculture species.

186 citations


Cites background from "High-quality genome assembly of cha..."

  • ...…Lates calcarifer, Vij et al. 2016; Mediterranean mussel, Mytilus galloprovincialis, Murgarella et al. 2016; turbot, Scophthalmus maximus, Figueras et al. 2016; Atlantic salmon, Lien et al. 2016; channel catfish, Chen et al. 2016), and new sequencing data will improve genome quality and annotation....

    [...]

Journal ArticleDOI
TL;DR: It is shown that intermuscular bone is formed in the more basal teleosts by intramembranous ossification and may be involved in muscle contractibility and coordinating cellular events in M. amblycephala.
Abstract: The blunt snout bream Megalobrama amblycephala is the economically most important cyprinid fish species As an herbivore, it can be grown by eco-friendly and resource-conserving aquaculture However, the large number of intermuscular bones in the trunk musculature is adverse to fish meat processing and consumption As a first towards optimizing this aquatic livestock, we present a 1116-Gb draft genome of M amblycephala, with 77954 Mb anchored on 24 linkage groups Integrating spatiotemporal transcriptome analyses, we show that intermuscular bone is formed in the more basal teleosts by intramembranous ossification and may be involved in muscle contractibility and coordinating cellular events Comparative analysis revealed that olfactory receptor genes, especially of the beta type, underwent an extensive expansion in herbivorous cyprinids, whereas the gene for the umami receptor T1R1 was specifically lost in M amblycephala The composition of gut microflora, which contributes to the herbivorous adaptation of M amblycephala, was found to be similar to that of other herbivores As a valuable resource for the improvement of M amblycephala livestock, the draft genome sequence offers new insights into the development of intermuscular bone and herbivorous adaptation

78 citations

Journal ArticleDOI
TL;DR: This review assesses the availability of complete genomes of aquaculture animals and then briefly discusses the sequencing technologies and SNP array for SNPs genotyping, and summarizes the current status of genetic linkage map construction, QTL mapping, GWAS, and GS in aquatic animals.

70 citations

Journal ArticleDOI
TL;DR: The high-density genetic linkage map is constructed and the sex-linked marker in channel catfish is developed, which are important genetic resources for future marker-assisted selection (MAS) of this economically important teleost.
Abstract: A high-density genetic linkage map is of particular importance in the fine mapping for important economic traits and whole genome assembly in aquaculture species. The channel catfish (Ictalurus punctatus), a species native to North America, is one of the most important commercial freshwater fish in the world. Outside of the United States, China has become the major producer and consumer of channel catfish after experiencing rapid development in the past three decades. In this study, based on restriction site associated DNA sequencing (RAD-seq), a high-density genetic linkage map of channel catfish was constructed by using single nucleotide polymorphisms (SNPs) in a F1 family composed of 156 offspring and their two parental individuals. A total of 4,768 SNPs were assigned to 29 linkage groups (LGs), and the length of the linkage map reached 2,480.25 centiMorgans (cM) with an average distance of 0.55 cM between loci. Based on this genetic linkage map, 223 genomic scaffolds were anchored to the 29 LGs of channel catfish, and a total length of 704.66 Mb was assembled. Quantitative trait locus (QTL) mapping and genome-wide association analysis identified 10 QTLs of sex-related and six QTLs of growth-related traits at LG17 and LG28, respectively. Candidate genes associated with sex dimorphism, including spata2, spata5, sf3, zbtb38, and fox, were identified within QTL intervals on the LG17. A sex-linked marker with simple sequence repeats (SSR) in zbtb38 gene of the LG17 was validated for practical verification of sex in the channel catfish. Thus, the LG17 was considered as a sex-related LG. Potential growth-related genes were also identified, including important regulators such as megf9, npffr1, and gas1. In a word, we constructed the high-density genetic linkage map and developed the sex-linked marker in channel catfish, which are important genetic resources for future marker-assisted selection (MAS) of this economically important teleost.

35 citations

Journal ArticleDOI
14 Feb 2018-Genes
TL;DR: The power of synergy of cytogenetics and genomics in fish cytogenomics, its potential to understand the complexity of genome evolution in vertebrates, is highlighted, also linked to clinical applications and the chromosomal backgrounds of speciation.
Abstract: To understand the cytogenomic evolution of vertebrates, we must first unravel the complex genomes of fishes, which were the first vertebrates to evolve and were ancestors to all other vertebrates. We must not forget the immense time span during which the fish genomes had to evolve. Fish cytogenomics is endowed with unique features which offer irreplaceable insights into the evolution of the vertebrate genome. Due to the general DNA base compositional homogeneity of fish genomes, fish cytogenomics is largely based on mapping DNA repeats that still represent serious obstacles in genome sequencing and assembling, even in model species. Localization of repeats on chromosomes of hundreds of fish species and populations originating from diversified environments have revealed the biological importance of this genomic fraction. Ribosomal genes (rDNA) belong to the most informative repeats and in fish, they are subject to a more relaxed regulation than in higher vertebrates. This can result in formation of a literal 'rDNAome' consisting of more than 20,000 copies with their high proportion employed in extra-coding functions. Because rDNA has high rates of transcription and recombination, it contributes to genome diversification and can form reproductive barrier. Our overall knowledge of fish cytogenomics grows rapidly by a continuously increasing number of fish genomes sequenced and by use of novel sequencing methods improving genome assembly. The recently revealed exceptional compositional heterogeneity in an ancient fish lineage (gars) sheds new light on the compositional genome evolution in vertebrates generally. We highlight the power of synergy of cytogenetics and genomics in fish cytogenomics, its potential to understand the complexity of genome evolution in vertebrates, which is also linked to clinical applications and the chromosomal backgrounds of speciation. We also summarize the current knowledge on fish cytogenomics and outline its main future avenues.

26 citations


Cites background from "High-quality genome assembly of cha..."

  • ...8 PB, I [164,165] Kryptolebias marmoratus Cyprinodontiformes Rivulidae 680....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.
Abstract: The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

70,111 citations

Journal ArticleDOI
TL;DR: The Kyoto Encyclopedia of Genes and Genomes (KEGG) as discussed by the authors is a knowledge base for systematic analysis of gene functions in terms of the networks of genes and molecules.
Abstract: Kyoto Encyclopedia of Genes and Genomes (KEGG) is a knowledge base for systematic analysis of gene functions in terms of the networks of genes and molecules. The major component of KEGG is the PATHWAY database that consists of graphical diagrams of biochemical pathways including most of the known metabolic pathways and some of the known regulatory pathways. The pathway information is also represented by the ortholog group tables summarizing orthologous and paralogous gene groups among different organisms. KEGG maintains the GENES database for the gene catalogs of all organisms with complete genomes and selected organisms with partial genomes, which are continuously re-annotated, as well as the LIGAND database for chemical compounds and enzymes. Each gene catalog is associated with the graphical genome map for chromosomal locations that is represented by Java applet. In addition to the data collection efforts, KEGG develops and provides various computational tools, such as for reconstructing biochemical pathways from the complete genome sequence and for predicting gene regulatory networks from the gene expression profiles. The KEGG databases are daily updated and made freely available (http://www.genome.ad.jp/kegg/).

24,024 citations

Journal ArticleDOI
TL;DR: The Trinity method for de novo assembly of full-length transcripts and evaluate it on samples from fission yeast, mouse and whitefly, whose reference genome is not yet available, providing a unified solution for transcriptome reconstruction in any sample.
Abstract: Massively parallel sequencing of cDNA has enabled deep and efficient probing of transcriptomes. Current approaches for transcript reconstruction from such data often rely on aligning reads to a reference genome, and are thus unsuitable for samples with a partial or missing reference genome. Here we present the Trinity method for de novo assembly of full-length transcripts and evaluate it on samples from fission yeast, mouse and whitefly, whose reference genome is not yet available. By efficiently constructing and analyzing sets of de Bruijn graphs, Trinity fully reconstructs a large fraction of transcripts, including alternatively spliced isoforms and transcripts from recently duplicated genes. Compared with other de novo transcriptome assemblers, Trinity recovers more full-length transcripts across a broad range of expression levels, with a sensitivity similar to methods that rely on genome alignments. Our approach provides a unified solution for transcriptome reconstruction in any sample, especially in the absence of a reference genome.

15,665 citations

Journal ArticleDOI
TL;DR: The definition and use of family-specific, manually curated gathering thresholds are explained and some of the features of domains of unknown function (also known as DUFs) are discussed, which constitute a rapidly growing class of families within Pfam.
Abstract: Pfam is a widely used database of protein families and domains. This article describes a set of major updates that we have implemented in the latest release (version 24.0). The most important change is that we now use HMMER3, the latest version of the popular profile hidden Markov model package. This software is approximately 100 times faster than HMMER2 and is more sensitive due to the routine use of the forward algorithm. The move to HMMER3 has necessitated numerous changes to Pfam that are described in detail. Pfam release 24.0 contains 11,912 families, of which a large number have been significantly updated during the past two years. Pfam is available via servers in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/).

14,075 citations


"High-quality genome assembly of cha..." refers methods in this paper

  • ...7 software [21] to align the protein sequences against public databases, including Pfam [22], PRINTS [23], ProDom [24] and SMART [25], to examine the known motifs and domains in our sequences....

    [...]

Journal ArticleDOI
TL;DR: The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer.
Abstract: Motivation: A new protocol for sequencing the messenger RNA in a cell, known as RNA-Seq, generates millions of short sequence fragments in a single run. These fragments, or ‘reads’, can be used to measure levels of gene expression and to identify novel splice variants of genes. However, current software for aligning RNA-Seq data to a genome relies on known splice junctions and cannot identify novel ones. TopHat is an efficient read-mapping algorithm designed to align reads from an RNA-Seq experiment to a reference genome without relying on known splice sites. Results: We mapped the RNA-Seq reads from a recent mammalian RNA-Seq experiment and recovered more than 72% of the splice junctions reported by the annotation-based software from that study, along with nearly 20 000 previously unreported junctions. The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer. We describe several challenges unique to ab initio splice site discovery from RNA-Seq reads that will require further algorithm development. Availability: TopHat is free, open-source software available from http://tophat.cbcb.umd.edu Contact: ude.dmu.sc@eloc Supplementary information: Supplementary data are available at Bioinformatics online.

11,473 citations


"High-quality genome assembly of cha..." refers methods in this paper

  • ...2 software [16] to map the RNA reads extracted from the skin and muscle transcriptomes onto the channel catfish genome sequences....

    [...]

Related Papers (5)
18 Sep 2014-Nature