scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Chromosome genome assembly and annotation of the yellowbelly pufferfish with PacBio and Hi-C sequencing data.

TL;DR: The genome resource in this work will be used for the conservation and population genetics of the yellowbelly pufferfish, as well as in vertebrate chromosome evolution studies.
Abstract: Pufferfish are ideal models for vertebrate chromosome evolution studies. The yellowbelly pufferfish, Takifugu flavidus, is an important marine fish species in the aquaculture industry and ecology of East Asia. The chromosome assembly of the species could facilitate the study of chromosome evolution and functional gene mapping. To this end, 44, 27 and 50 Gb reads were generated for genome assembly using Illumina, PacBio and Hi-C sequencing technologies, respectively. More than 13 Gb full-length transcripts were sequenced on the PacBio platform. A 366 Mb genome was obtained with the contig of 4.4 Mb and scaffold N50 length of 15.7 Mb. 266 contigs were reliably assembled into 22 chromosomes, representing 95.9% of the total genome. A total of 29,416 protein-coding genes were predicted and 28,071 genes were functionally annotated. More than 97.7% of the BUSCO genes were successfully detected in the genome. The genome resource in this work will be used for the conservation and population genetics of the yellowbelly pufferfish, as well as in vertebrate chromosome evolution studies. Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.10008740

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This review assesses the availability of complete genomes of aquaculture animals and then briefly discusses the sequencing technologies and SNP array for SNPs genotyping, and summarizes the current status of genetic linkage map construction, QTL mapping, GWAS, and GS in aquatic animals.

70 citations

Journal ArticleDOI
TL;DR: In this article, a haplotype-resolved genome assembly of a zig-zag eel (Mastacembelus armatus) at the chromosomal scale is presented.
Abstract: The origin of sex chromosomes requires the establishment of recombination suppression between the proto-sex chromosomes. In many fish species, the sex chromosome pair is homomorphic with a recent origin, providing species for studying how and why recombination suppression evolved in the initial stages of sex chromosome differentiation, but this requires accurate sequence assembly of the X and Y (or Z and W) chromosomes, which may be difficult if they are recently diverged. Here we produce a haplotype-resolved genome assembly of zig-zag eel (Mastacembelus armatus), an aquaculture fish, at the chromosomal scale. The diploid assembly is nearly gap-free, and in most chromosomes, we resolve the centromeric and subtelomeric heterochromatic sequences. In particular, the Y chromosome, including its highly repetitive short arm, has zero gaps. Using resequencing data, we identify a ~7 Mb fully sex-linked region (SLR), spanning the sex chromosome centromere and almost entirely embedded in the pericentromeric heterochromatin. The SLRs on the X and Y chromosomes are almost identical in sequence and gene content, but both are repetitive and heterochromatic, consistent with zero or low recombination. We further identify an HMG-domain containing gene HMGN6 in the SLR as a candidate sex-determining gene that is expressed at the onset of testis development. Our study supports the idea that preexisting regions of low recombination, such as pericentromeric regions, can give rise to SLR in the absence of structural variations between the proto-sex chromosomes.

26 citations

Journal ArticleDOI
TL;DR: This genome assembly can serve as a valuable genetic resource for exploring fugu‐specific compact genome characteristics, and will provide essential genomic information for understanding molecular adaptations to salinity fluctuations and the evolution of osmoregulatory mechanisms.
Abstract: The Tetraodontidae family are known to have relatively small and compact genomes compared to other vertebrates. The obscure puffer fish Takifugu obscurus is an anadromous species that migrates to freshwater from the sea for spawning. Thus the euryhaline characteristics of T. obscurus have been investigated to gain understanding of their survival ability, osmoregulation, and other homeostatic mechanisms in both freshwater and seawater. In this study, a high quality chromosome-level reference genome for T. obscurus was constructed using long-read Pacific Biosciences (PacBio) Sequel sequencing and a Hi-C-based chromatin contact map platform. The final genome assembly of T. obscurus is 381 Mb, with a contig N50 length of 3,296 kb and longest length of 10.7 Mb, from a total of 62 Gb of raw reads generated using single-molecule real-time sequencing technology from a PacBio Sequel platform. The PacBio data were further clustered into chromosome-scale scaffolds using a Hi-C approach, resulting in a 373 Mb genome assembly with a contig N50 length of 15.2 Mb and and longest length of 28 Mb. When we directly compared the 22 longest scaffolds of T. obscurus to the 22 chromosomes of the tiger puffer Takifugu rubripes, a clear one-to-one orthologous relationship was observed between the two species, supporting the chromosome-level assembly of T. obscurus. This genome assembly can serve as a valuable genetic resource for exploring fugu-specific compact genome characteristics, and will provide essential genomic information for understanding molecular adaptations to salinity fluctuations and the evolution of osmoregulatory mechanisms.

20 citations


Cites background from "Chromosome genome assembly and anno..."

  • ...Recently, high quality genomes were published from T. bimaculatus (Zhou et al., 2019b) and T. flavidus (Zhou et al., 2019a)....

    [...]

  • ...To the best of our knowledge in teleosts, only a few fish species including Asian seabass (1.19 Mb; Vij et al., 2016), Nile tilapia (3.09 Mb; Conte, Gammerdinger, Bartie, Penman, & Kocher, 2017), Tibetan endemic fish (Yang et al., 2019), orange clownfish (1.86 Mb; Lehmann et al., 2019), two-spotted puffer (1.31 Mb; Zhou et al., 2019b), and yellowbelly puffer (4.4 Mb; Zhou et al., 2019a) showed contig N50 longer than 1 Mb when their genomes were analyzed with long-read PacBio sequencing platforms....

    [...]

  • ...…Penman, & Kocher, 2017), Tibetan endemic fish (Yang et al., 2019), orange clownfish (1.86 Mb; Lehmann et al., 2019), two-spotted puffer (1.31 Mb; Zhou et al., 2019b), and yellowbelly puffer (4.4 Mb; Zhou et al., 2019a) showed contig N50 longer than 1 Mb when their genomes were analyzed with…...

    [...]

  • ...…endemic fish (Yang et al., 2019), orange clownfish (1.86 Mb; Lehmann et al., 2019), two-spotted puffer (1.31 Mb; Zhou et al., 2019b), and yellowbelly puffer (4.4 Mb; Zhou et al., 2019a) showed contig N50 longer than 1 Mb when their genomes were analyzed with long-read PacBio sequencing platforms....

    [...]

Journal ArticleDOI
TL;DR: In this article, the authors used long-read sequencing to improve the contiguity of the threespine stickleback fish (Gasterosteus aculeatus) genome, a prominent genetic model species.
Abstract: While the cost and time for assembling a genome has drastically decreased, it still remains a challenge to assemble a highly contiguous genome. These challenges are rapidly being overcome by the integration of long-read sequencing technologies. Here, we use long-read sequencing to improve the contiguity of the threespine stickleback fish (Gasterosteus aculeatus) genome, a prominent genetic model species. Using Pacific Biosciences sequencing, we assembled a highly contiguous genome of a freshwater fish from Paxton Lake. Using contigs from this genome, we were able to fill over 76.7% of the gaps in the existing reference genome assembly, improving contiguity over fivefold. Our gap filling approach was highly accurate, validated by 10X Genomics long-distance linked-reads. In addition to closing a majority of gaps, we were able to assemble segments of telomeres and centromeres throughout the genome. This highlights the power of using long sequencing reads to assemble highly repetitive and difficult to assemble regions of genomes. This latest genome build has been released through a newly designed community genome browser that aims to consolidate the growing number of genomics datasets available for the threespine stickleback fish.

17 citations

Journal ArticleDOI
TL;DR: Adding more data to genome assemblies does not always result in better assemblies, so it is important to understand the nuances of genomic data integration explained here, in order to obtain cost-effective value for money when sequencing genomes.
Abstract: Background Whilst much sequencing effort has focused on key mammalian model organisms such as mouse and human, little is known about the relationship between genome sequencing techniques for non-model mammals and genome assembly quality. This is especially relevant to non-model mammals, where the samples to be sequenced are often degraded and of low quality. A key aspect when planning a genome project is the choice of sequencing data to generate. This decision is driven by several factors, including the biological questions being asked, the quality of DNA available, and the availability of funds. Cutting-edge sequencing technologies now make it possible to achieve highly contiguous, chromosome-level genome assemblies, but rely on high-quality high molecular weight DNA. However, funding is often insufficient for many independent research groups to use these techniques. Here we use a range of different genomic technologies generated from a roadkill European polecat (Mustela putorius) to assess various assembly techniques on this low-quality sample. We evaluated different approaches for de novo assemblies and discuss their value in relation to biological analyses. Results Generally, assemblies containing more data types achieved better scores in our ranking system. However, when accounting for misassemblies, this was not always the case for Bionano and low-coverage 10x Genomics (for scaffolding only). We also find that the extra cost associated with combining multiple data types is not necessarily associated with better genome assemblies. Conclusions The high degree of variability between each de novo assembly method (assessed from the 7 key metrics) highlights the importance of carefully devising the sequencing strategy to be able to carry out the desired analysis. Adding more data to genome assemblies does not always result in better assemblies, so it is important to understand the nuances of genomic data integration explained here, in order to obtain cost-effective value for money when sequencing genomes.

16 citations

References
More filters
Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

88,255 citations

Journal ArticleDOI
TL;DR: The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer.
Abstract: Motivation: A new protocol for sequencing the messenger RNA in a cell, known as RNA-Seq, generates millions of short sequence fragments in a single run. These fragments, or ‘reads’, can be used to measure levels of gene expression and to identify novel splice variants of genes. However, current software for aligning RNA-Seq data to a genome relies on known splice junctions and cannot identify novel ones. TopHat is an efficient read-mapping algorithm designed to align reads from an RNA-Seq experiment to a reference genome without relying on known splice sites. Results: We mapped the RNA-Seq reads from a recent mammalian RNA-Seq experiment and recovered more than 72% of the splice junctions reported by the annotation-based software from that study, along with nearly 20 000 previously unreported junctions. The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer. We describe several challenges unique to ab initio splice site discovery from RNA-Seq reads that will require further algorithm development. Availability: TopHat is free, open-source software available from http://tophat.cbcb.umd.edu Contact: ude.dmu.sc@eloc Supplementary information: Supplementary data are available at Bioinformatics online.

11,473 citations

Journal ArticleDOI
TL;DR: Blast2GO (B2G), a research tool designed with the main purpose of enabling Gene Ontology (GO) based data mining on sequence data for which no GO annotation is yet available, is presented.
Abstract: Summary: We present here Blast2GO (B2G), a research tool designed with the main purpose of enabling Gene Ontology (GO) based data mining on sequence data for which no GO annotation is yet available. B2G joints in one application GO annotation based on similarity searches with statistical analysis and highlighted visualization on directed acyclic graphs. This tool offers a suitable platform for functional genomics research in non-model species. B2G is an intuitive and interactive desktop application that allows monitoring and comprehension of the whole annotation and analysis process. Availability: Blast2GO is freely available via Java Web Start at http://www.blast2go.de Supplementary material:http://www.blast2go.de -> Evaluation Contact:[email protected]; [email protected]

10,092 citations

Journal ArticleDOI
TL;DR: Zdobnov et al. as discussed by the authors proposed a measure for quantitative assessment of genome assembly and annotation completeness based on evolutionarily informed expectations of gene content, and implemented the assessment procedure in open-source software, with sets of Benchmarking Universal Single-Copy Orthologs.
Abstract: Motivation Genomics has revolutionized biological research, but quality assessment of the resulting assembled sequences is complicated and remains mostly limited to technical measures like N50. Results We propose a measure for quantitative assessment of genome assembly and annotation completeness based on evolutionarily informed expectations of gene content. We implemented the assessment procedure in open-source software, with sets of Benchmarking Universal Single-Copy Orthologs, named BUSCO. Availability and implementation Software implemented in Python and datasets available for download from http://busco.ezlab.org. Contact evgeny.zdobnov@unige.ch Supplementary information Supplementary data are available at Bioinformatics online.

7,747 citations

Journal ArticleDOI
TL;DR: A new algorithm for finding tandem repeats which works without the need to specify either the pattern or pattern size is presented and its ability to detect tandem repeats that have undergone extensive mutational change is demonstrated.
Abstract: A tandem repeat in DNA is two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats have been shown to cause human disease, may play a variety of regulatory and evolutionary roles and are important laboratory and analytic tools. Extensive knowledge about pattern size, copy number, mutational history, etc. for tandem repeats has been limited by the inability to easily detect them in genomic sequence data. In this paper, we present a new algorithm for finding tandem repeats which works without the need to specify either the pattern or pattern size. We model tandem repeats by percent identity and frequency of indels between adjacent pattern copies and use statistically based recognition criteria. We demonstrate the algorithm’s speed and its ability to detect tandem repeats that have undergone extensive mutational change by analyzing four sequences: the human frataxin gene, the human β T cell receptor locus sequence and two yeast chromosomes. These sequences range in size from 3 kb up to 700 kb. A World Wide Web server interface at c3.biomath.mssm.edu/trf.html has been established for automated use of the program.

6,577 citations