scispace - formally typeset
Search or ask a question

Showing papers on "Genomics published in 2019"


Journal ArticleDOI
TL;DR: This work presents a method named HISAT2 (hierarchical indexing for spliced alignment of transcripts 2) that can align both DNA and RNA sequences using a graph Ferragina Manzini index, and uses it to represent and search an expanded model of the human reference genome.
Abstract: The human reference genome represents only a small number of individuals, which limits its usefulness for genotyping. We present a method named HISAT2 (hierarchical indexing for spliced alignment of transcripts 2) that can align both DNA and RNA sequences using a graph Ferragina Manzini index. We use HISAT2 to represent and search an expanded model of the human reference genome in which over 14.5 million genomic variants in combination with haplotypes are incorporated into the data structure used for searching and alignment. We benchmark HISAT2 using simulated and real datasets to demonstrate that our strategy of representing a population of genomes, together with a fast, memory-efficient search algorithm, provides more detailed and accurate variant analyses than other methods. We apply HISAT2 for HLA typing and DNA fingerprinting; both applications form part of the HISAT-genotype software that enables analysis of haplotype-resolved genes or genomic regions. HISAT-genotype outperforms other computational methods and matches or exceeds the performance of laboratory-based assays. A graph-based genome indexing scheme enables variant-aware alignment of sequences with very low memory requirements.

4,855 citations


Journal ArticleDOI
TL;DR: eggNOG as discussed by the authors is a public database of orthology relationships, gene evolutionary histories and functional annotations, with a major update of the underlying genome sets, which have been expanded to 4445 representative bacteria and 168 archaea derived from 25 038 genomes.
Abstract: eggNOG is a public database of orthology relationships, gene evolutionary histories and functional annotations. Here, we present version 5.0, featuring a major update of the underlying genome sets, which have been expanded to 4445 representative bacteria and 168 archaea derived from 25 038 genomes, as well as 477 eukaryotic organisms and 2502 viral proteomes that were selected for diversity and filtered by genome quality. In total, 4.4M orthologous groups (OGs) distributed across 379 taxonomic levels were computed together with their associated sequence alignments, phylogenies, HMM models and functional descriptors. Precomputed evolutionary analysis provides fine-grained resolution of duplication/speciation events within each OG. Our benchmarks show that, despite doubling the amount of genomes, the quality of orthology assignments and functional annotations (80% coverage) has persisted without significant changes across this update. Finally, we improved eggNOG online services for fast functional annotation and orthology prediction of custom genomics or metagenomics datasets. All precomputed data are publicly available for downloading or via API queries at http://eggnog.embl.de.

1,971 citations


Journal ArticleDOI
TL;DR: The latest understanding of long-range enhancer–promoter crosstalk is discussed, including target-gene specificity, interaction dynamics, protein and RNA architects of interactions, roles of 3D genome organization and the pathological consequences of regulatory rewiring.
Abstract: Spatiotemporal gene expression programmes are orchestrated by transcriptional enhancers, which are key regulatory DNA elements that engage in physical contacts with their target-gene promoters, often bridging considerable genomic distances. Recent progress in genomics, genome editing and microscopy methodologies have enabled the genome-wide mapping of enhancer-promoter contacts and their functional dissection. In this Review, we discuss novel concepts on how enhancer-promoter interactions are established and maintained, how the 3D architecture of mammalian genomes both facilitates and constrains enhancer-promoter contacts, and the role they play in gene expression control during normal development and disease.

646 citations


Journal ArticleDOI
TL;DR: This update features a major scaling up of the resource coverage, sampling the genomic diversity of 1271 eukaryotes, 6013 prokaryotes and 6488 viruses, and picking up the best sequenced and annotated representatives for each species or operational taxonomic unit.
Abstract: OrthoDB (https://www.orthodb.org) provides evolutionary and functional annotations of orthologs. This update features a major scaling up of the resource coverage, sampling the genomic diversity of 1271 eukaryotes, 6013 prokaryotes and 6488 viruses. These include putative orthologs among 448 metazoan, 117 plant, 549 fungal, 148 protist, 5609 bacterial, and 404 archaeal genomes, picking up the best sequenced and annotated representatives for each species or operational taxonomic unit. OrthoDB relies on a concept of hierarchy of levels-of-orthology to enable more finely resolved gene orthologies for more closely related species. Since orthologs are the most likely candidates to retain functions of their ancestor gene, OrthoDB is aimed at narrowing down hypotheses about gene functions and enabling comparative evolutionary studies. Optional registered-user sessions allow on-line BUSCO assessments of gene set completeness and mapping of the uploaded data to OrthoDB to enable further interactive exploration of related annotations and generation of comparative charts. The accelerating expansion of genomics data continues to add valuable information, and OrthoDB strives to provide orthologs from the broadest coverage of species, as well as to extensively collate available functional annotations and to compute evolutionary annotations. The data can be browsed online, downloaded or assessed via REST API or SPARQL RDF compatible with both UniProt and Ensembl.

608 citations


Journal ArticleDOI
Mark Chaisson1, Mark Chaisson2, Ashley D. Sanders, Xuefang Zhao3, Xuefang Zhao4, Ankit Malhotra, David Porubsky5, David Porubsky6, Tobias Rausch, Eugene J. Gardner7, Oscar L. Rodriguez8, Li Guo9, Ryan L. Collins4, Xian Fan10, Jia Wen11, Robert E. Handsaker12, Robert E. Handsaker4, Susan Fairley13, Zev N. Kronenberg2, Xiangmeng Kong14, Fereydoun Hormozdiari15, Dillon Lee16, Aaron M. Wenger17, Alex Hastie, Danny Antaki18, Thomas Anantharaman, Peter A. Audano2, Harrison Brand4, Stuart Cantsilieris2, Han Cao, Eliza Cerveira, Chong Chen10, Xintong Chen7, Chen-Shan Chin17, Zechen Chong10, Nelson T. Chuang7, Christine C. Lambert17, Deanna M. Church, Laura Clarke13, Andrew Farrell16, Joey Flores19, Timur R. Galeev14, David U. Gorkin20, David U. Gorkin18, Madhusudan Gujral18, Victor Guryev6, William Haynes Heaton, Jonas Korlach17, Sushant Kumar14, Jee Young Kwon21, Ernest T. Lam, Jong Eun Lee, Joyce V. Lee, Wan-Ping Lee, Sau Peng Lee, Shantao Li14, Patrick Marks, Karine A. Viaud-Martinez19, Sascha Meiers, Katherine M. Munson2, Fabio C. P. Navarro14, Bradley J. Nelson2, Conor Nodzak11, Amina Noor18, Sofia Kyriazopoulou-Panagiotopoulou, Andy Wing Chun Pang, Yunjiang Qiu18, Yunjiang Qiu20, Gabriel Rosanio18, Mallory Ryan, Adrian M. Stütz, Diana C.J. Spierings6, Alistair Ward16, Anne Marie E. Welch2, Ming Xiao22, Wei Xu, Chengsheng Zhang, Qihui Zhu, Xiangqun Zheng-Bradley13, Ernesto Lowy13, Sergei Yakneen, Steven A. McCarroll12, Steven A. McCarroll4, Goo Jun23, Li Ding24, Chong-Lek Koh25, Bing Ren20, Bing Ren18, Paul Flicek13, Ken Chen10, Mark Gerstein, Pui-Yan Kwok26, Peter M. Lansdorp27, Peter M. Lansdorp6, Peter M. Lansdorp28, Gabor T. Marth16, Jonathan Sebat18, Xinghua Shi11, Ali Bashir8, Kai Ye9, Scott E. Devine7, Michael E. Talkowski12, Michael E. Talkowski4, Ryan E. Mills3, Tobias Marschall5, Jan O. Korbel13, Evan E. Eichler2, Charles Lee21 
TL;DR: A suite of long-read, short- read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms are applied to comprehensively analyze three trios to define the full spectrum of human genetic variation in a haplotype-resolved manner.
Abstract: The incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, short-read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per genome. We also discover 156 inversions per genome and 58 of the inversions intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a three to sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The methods and the dataset presented serve as a gold standard for the scientific community allowing us to make recommendations for maximizing structural variation sensitivity for future genome sequencing studies.

606 citations


Journal ArticleDOI
TL;DR: Ultra-sensitive cell-free DNA (cfDNA) sequencing uncovers clonal hematopoiesis as a major source of somatic cfDNA variants in healthy individuals and patients with cancer, and underscores the importance of matched white blood cell DNA sequencing in liquid biopsy procedures.
Abstract: Accurate identification of tumor-derived somatic variants in plasma circulating cell-free DNA (cfDNA) requires understanding of the various biological compartments contributing to the cfDNA pool. We sought to define the technical feasibility of a high-intensity sequencing assay of cfDNA and matched white blood cell DNA covering a large genomic region (508 genes; 2 megabases; >60,000× raw depth) in a prospective study of 124 patients with metastatic cancer, with contemporaneous matched tumor tissue biopsies, and 47 controls without cancer. The assay displayed high sensitivity and specificity, allowing for de novo detection of tumor-derived mutations and inference of tumor mutational burden, microsatellite instability, mutational signatures and sources of somatic mutations identified in cfDNA. The vast majority of cfDNA mutations (81.6% in controls and 53.2% in patients with cancer) had features consistent with clonal hematopoiesis. This cfDNA sequencing approach revealed that clonal hematopoiesis constitutes a pervasive biological phenomenon, emphasizing the importance of matched cfDNA–white blood cell sequencing for accurate variant interpretation. Ultra-sensitive cell-free DNA (cfDNA) sequencing uncovers clonal hematopoiesis as a major source of somatic cfDNA variants in healthy individuals and patients with cancer, and underscores the importance of matched white blood cell DNA sequencing in liquid biopsy procedures.

448 citations


Journal ArticleDOI
13 Mar 2019-Nature
TL;DR: Draft prokaryotic genomes from faecal metagenomes of diverse human populations enrich the understanding of the human gut microbiome by identifying over two thousand new species-level taxa that have numerous disease associations.
Abstract: The genome sequences of many species of the human gut microbiome remain unknown, largely owing to challenges in cultivating microorganisms under laboratory conditions. Here we address this problem by reconstructing 60,664 draft prokaryotic genomes from 3,810 faecal metagenomes, from geographically and phenotypically diverse humans. These genomes provide reference points for 2,058 newly identified species-level operational taxonomic units (OTUs), which represents a 50% increase over the previously known phylogenetic diversity of sequenced gut bacteria. On average, the newly identified OTUs comprise 33% of richness and 28% of species abundance per individual, and are enriched in humans from rural populations. A meta-analysis of clinical gut-microbiome studies pinpointed numerous disease associations for the newly identified OTUs, which have the potential to improve predictive models. Finally, our analysis revealed that uncultured gut species have undergone genome reduction that has resulted in the loss of certain biosynthetic pathways, which may offer clues for improving cultivation strategies in the future.

438 citations


Journal ArticleDOI
TL;DR: An overview of the theoretical models of tumour evolution is provided and what to consider when inferring evolutionary dynamics from genomic data is discussed.
Abstract: To a large extent, cancer conforms to evolutionary rules defined by the rates at which clones mutate, adapt and grow. Next-generation sequencing has provided a snapshot of the genetic landscape of most cancer types, and cancer genomics approaches are driving new insights into cancer evolutionary patterns in time and space. In contrast to species evolution, cancer is a particular case owing to the vast size of tumour cell populations, chromosomal instability and its potential for phenotypic plasticity. Nevertheless, an evolutionary framework is a powerful aid to understand cancer progression and therapy failure. Indeed, such a framework could be applied to predict individual tumour behaviour and support treatment strategies.

400 citations


Journal ArticleDOI
TL;DR: A collection of 1,520 nonredundant, high-quality draft genomes generated from >6,000 bacteria cultivated from fecal samples of healthy humans, chosen to cover all major bacterial phyla and genera in the human gut.
Abstract: Reference genomes are essential for metagenomic analyses and functional characterization of the human gut microbiota. We present the Culturable Genome Reference (CGR), a collection of 1,520 nonredundant, high-quality draft genomes generated from >6,000 bacteria cultivated from fecal samples of healthy humans. Of the 1,520 genomes, which were chosen to cover all major bacterial phyla and genera in the human gut, 264 are not represented in existing reference genome catalogs. We show that this increase in the number of reference bacterial genomes improves the rate of mapping metagenomic sequencing reads from 50% to >70%, enabling higher-resolution descriptions of the human gut microbiome. We use the CGR genomes to annotate functions of 338 bacterial species, showing the utility of this resource for functional studies. We also carry out a pan-genome analysis of 38 important human gut species, which reveals the diversity and specificity of functional enrichment between their core and dispensable genomes.

343 citations


Journal ArticleDOI
07 Mar 2019-Cell
TL;DR: This work shows that somatic mutations in mtDNA can be tracked by single-cell RNA or assay for transposase accessible chromatin (ATAC) sequencing and leverages somatic mtDNA mutations as natural genetic barcodes and demonstrates their utility as highly accurate clonal markers to infer cellular relationships.

302 citations


Journal ArticleDOI
01 Nov 2019-Science
TL;DR: Tests to distinguish incomplete lineage sorting from introgression indicate that gene flow has obscured several ancient phylogenetic relationships in this group over large swathes of the genome, and a hitherto unknown inversion that traps a color pattern switch locus is identified.
Abstract: We used 20 de novo genome assemblies to probe the speciation history and architecture of gene flow in rapidly radiating Heliconius butterflies. Our tests to distinguish incomplete lineage sorting from introgression indicate that gene flow has obscured several ancient phylogenetic relationships in this group over large swathes of the genome. Introgressed loci are underrepresented in low-recombination and gene-rich regions, consistent with the purging of foreign alleles more tightly linked to incompatibility loci. Here, we identify a hitherto unknown inversion that traps a color pattern switch locus. We infer that this inversion was transferred between lineages by introgression and is convergent with a similar rearrangement in another part of the genus. These multiple de novo genome sequences enable improved understanding of the importance of introgression and selective processes in adaptive radiation.

Journal ArticleDOI
TL;DR: The first annotated chromosome-level reference genome assembly for pea, Gregor Mendel’s original genetic model, provides insights into legume genome evolution and the molecular basis of agricultural traits forpea improvement.
Abstract: We report the first annotated chromosome-level reference genome assembly for pea, Gregor Mendel’s original genetic model. Phylogenetics and paleogenomics show genomic rearrangements across legumes and suggest a major role for repetitive elements in pea genome evolution. Compared to other sequenced Leguminosae genomes, the pea genome shows intense gene dynamics, most likely associated with genome size expansion when the Fabeae diverged from its sister tribes. During Pisum evolution, translocation and transposition differentially occurred across lineages. This reference sequence will accelerate our understanding of the molecular basis of agronomically important traits and support crop improvement.

Journal ArticleDOI
TL;DR: A deeply sequenced dataset of 910 individuals, all of African descent, is used to construct a set of DNA sequences that is present in these individuals but missing from the reference human genome, demonstrating that the African pan-genome contains ~10% more DNA than the current human reference genome.
Abstract: We used a deeply sequenced dataset of 910 individuals, all of African descent, to construct a set of DNA sequences that is present in these individuals but missing from the reference human genome. We aligned 1.19 trillion reads from the 910 individuals to the reference genome (GRCh38), collected all reads that failed to align, and assembled these reads into contiguous sequences (contigs). We then compared all contigs to one another to identify a set of unique sequences representing regions of the African pan-genome missing from the reference genome. Our analysis revealed 296,485,284 bp in 125,715 distinct contigs present in the populations of African descent, demonstrating that the African pan-genome contains ~10% more DNA than the current human reference genome. Although the functional significance of nearly all of this sequence is unknown, 387 of the novel contigs fall within 315 distinct protein-coding genes, and the rest appear to be intergenic. Assembly of a pan-genome from 910 humans of African descent identifies 296.5 Mb of novel DNA mapping to 125,715 distinct contigs. This African pan-genome contains ~10% more DNA than the current human reference genome.

Journal ArticleDOI
TL;DR: Some of the common themes emerging from initial studies of single-cell RNA sequencing in cancer are discussed and challenges in cancer biology are highlighted for which emerging single- cell genomics methods may provide a compelling approach.

Journal ArticleDOI
TL;DR: Mapping long-range chromatin interactions in 27 human cell/tissue types identifies candidate target genes of 70,329 putative regulatory elements and suggests potential regulatory function for 27,325 noncoding sequence variants associated with 2,117 physiological traits and diseases.
Abstract: A large number of putative cis-regulatory sequences have been annotated in the human genome, but the genes they control remain poorly defined. To bridge this gap, we generate maps of long-range chromatin interactions centered on 18,943 well-annotated promoters for protein-coding genes in 27 human cell/tissue types. We use this information to infer the target genes of 70,329 candidate regulatory elements and suggest potential regulatory function for 27,325 noncoding sequence variants associated with 2,117 physiological traits and diseases. Integrative analysis of these promoter-centered interactome maps reveals widespread enhancer-like promoters involved in gene regulation and common molecular pathways underlying distinct groups of human traits and diseases.

Journal ArticleDOI
21 Jun 2019-Science
TL;DR: The controversies in the ruminant phylogeny are resolved and the genetic basis underpinning the evolutionary innovations in ruminants is revealed, demonstrating the power of using comparative phylogenomic approaches in resolving the deep branches of phylogeny that result from rapid radiations.
Abstract: The ruminants are one of the most successful mammalian lineages, exhibiting morphological and habitat diversity and containing several key livestock species. To better understand their evolution, we generated and analyzed de novo assembled genomes of 44 ruminant species, representing all six Ruminantia families. We used these genomes to create a time-calibrated phylogeny to resolve topological controversies, overcoming the challenges of incomplete lineage sorting. Population dynamic analyses show that population declines commenced between 100,000 and 50,000 years ago, which is concomitant with expansion in human populations. We also reveal genes and regulatory elements that possibly contribute to the evolution of the digestive system, cranial appendages, immune system, metabolism, body size, cursorial locomotion, and dentition of the ruminants.

Journal ArticleDOI
TL;DR: The treasure trove of disease resistance genes present in wild relatives of domesticated crops is rapidly discovered using association genetics and enrichment sequencing.
Abstract: Disease resistance (R) genes from wild relatives could be used to engineer broad-spectrum resistance in domesticated crops. We combined association genetics with R gene enrichment sequencing (AgRenSeq) to exploit pan-genome variation in wild diploid wheat and rapidly clone four stem rust resistance genes. AgRenSeq enables R gene cloning in any crop that has a diverse germplasm panel.

Journal ArticleDOI
TL;DR: Interestingly, a long terminal repeat (LTR) retrotransposon insertion upstream of MdMYB1, a core transcriptional activator of anthocyanin biosynthesis, is associated with red-skinned phenotype and provides insights into the molecular mechanisms underlying red fruit coloration.
Abstract: A complete and accurate genome sequence provides a fundamental tool for functional genomics and DNA-informed breeding. Here, we assemble a high-quality genome (contig N50 of 6.99 Mb) of the apple anther-derived homozygous line HFTH1, including 22 telomere sequences, using a combination of PacBio single-molecule real-time (SMRT) sequencing, chromosome conformation capture (Hi-C) sequencing, and optical mapping. In comparison to the Golden Delicious reference genome, we identify 18,047 deletions, 12,101 insertions and 14 large inversions. We reveal that these extensive genomic variations are largely attributable to activity of transposable elements. Interestingly, we find that a long terminal repeat (LTR) retrotransposon insertion upstream of MdMYB1, a core transcriptional activator of anthocyanin biosynthesis, is associated with red-skinned phenotype. This finding provides insights into the molecular mechanisms underlying red fruit coloration, and highlights the utility of this high-quality genome assembly in deciphering agriculturally important trait in apple. Existing apple genome assemblies all derive from Golden Delicious. Here, the authors combine different sequencing technologies to assemble a high quality genome of an anther-derived homozygous genotype HFTH1 and find the association of a retrotransposon and red fruit colour.

Journal ArticleDOI
TL;DR: The results identify variants of strong impact associated with 16 phenotypes, including body weight variation which, when combined with existing data, explain greater than 90% of body size variation in dogs.
Abstract: Domestic dog breeds are characterized by an unrivaled diversity of morphologic traits and breed-associated behaviors resulting from human selective pressures. To identify the genetic underpinnings of such traits, we analyze 722 canine whole genome sequences (WGS), documenting over 91 million single nucleotide and small indels, creating a large catalog of genomic variation for a companion animal species. We undertake both selective sweep analyses and genome wide association studies (GWAS) inclusive of over 144 modern breeds, 54 wild canids and a hundred village dogs. Our results identify variants of strong impact associated with 16 phenotypes, including body weight variation which, when combined with existing data, explain greater than 90% of body size variation in dogs. We thus demonstrate that GWAS and selection scans performed with WGS are powerful complementary methods for expanding the utility of companion animal systems for the study of mammalian growth and biology.

Journal ArticleDOI
TL;DR: It is argued that standardization of WGS-AST by challenge with consistently phenotyped strain sets of defined genetic diversity is necessary to compare the efficacy of methods of prediction of antibiotic resistance based on genome sequences.
Abstract: Clinical microbiology has long relied on growing bacteria in culture to determine antimicrobial susceptibility profiles, but the use of whole-genome sequencing for antibiotic susceptibility testing (WGS-AST) is now a powerful alternative. This review discusses the technologies that made this possible and presents results from recent studies to predict resistance based on genome sequences. We examine differences between calling antibiotic resistance profiles by the simple presence or absence of previously known genes and single-nucleotide polymorphisms (SNPs) against approaches that deploy machine learning and statistical models. Often, the limitations to genome-based prediction arise from limitations of accuracy of culture-based AST in addition to an incomplete knowledge of the genetic basis of resistance. However, we need to maintain phenotypic testing even as genome-based prediction becomes more widespread to ensure that the results do not diverge over time. We argue that standardization of WGS-AST by challenge with consistently phenotyped strain sets of defined genetic diversity is necessary to compare the efficacy of methods of prediction of antibiotic resistance based on genome sequences.

Journal ArticleDOI
TL;DR: While the genome sequencing revolution has led to the sequencing and assembly of many thousands of new genomes, genome annotation still uses very nearly the same technology that has been used for the past two decades.
Abstract: While the genome sequencing revolution has led to the sequencing and assembly of many thousands of new genomes, genome annotation still uses very nearly the same technology that we have used for the past two decades. The sheer number of genomes necessitates the use of fully automated procedures for annotation, but errors in annotation are just as prevalent as they were in the past, if not more so. How are we to solve this growing problem?

Journal ArticleDOI
TL;DR: AnnoTree is introduced—an interactive, functionally annotated bacterial tree of life that integrates taxonomic, phylogenetic and functional annotation data from over 27 000 bacterial and 1500 archaeal genomes, and is expected to be a valuable resource for exploring prokaryotic gene histories.
Abstract: Bacterial genomics has revolutionized our understanding of the microbial tree of life; however, mapping and visualizing the distribution of functional traits across bacteria remains a challenge. Here, we introduce AnnoTree-an interactive, functionally annotated bacterial tree of life that integrates taxonomic, phylogenetic and functional annotation data from over 27 000 bacterial and 1500 archaeal genomes. AnnoTree enables visualization of millions of precomputed genome annotations across the bacterial and archaeal phylogenies, thereby allowing users to explore gene distributions as well as patterns of gene gain and loss in prokaryotes. Using AnnoTree, we examined the phylogenomic distributions of 28 311 gene/protein families, and measured their phylogenetic conservation, patchiness, and lineage-specificity within bacteria. Our analyses revealed widespread phylogenetic patchiness among bacterial gene families, reflecting the dynamic evolution of prokaryotic genomes. Genes involved in phage infection/defense, mobile elements, and antibiotic resistance dominated the list of most patchy traits, as well as numerous intriguing metabolic enzymes that appear to have undergone frequent horizontal transfer. We anticipate that AnnoTree will be a valuable resource for exploring prokaryotic gene histories, and will act as a catalyst for biological and evolutionary hypothesis generation. AnnoTree is freely available at http://annotree.uwaterloo.ca.

Journal ArticleDOI
TL;DR: A review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses, to raise the awareness level within the community of database users and alert scientists working in the underlying workflow of database creation.
Abstract: The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with 'ready-to-use' deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others.

Journal ArticleDOI
TL;DR: A high-quality reference genome of the maize SK inbred line and analyses between the tropical SK line and two other maize genomes, B73 and Mo17, provide insights into structural variation and crop improvement.
Abstract: Maize is one of the most important crops globally, and it shows remarkable genetic diversity. Knowledge of this diversity could help in crop improvement; however, gold-standard genomes have been elucidated only for modern temperate varieties. Here, we present a high-quality reference genome (contig N50 of 15.78 megabases) of the maize small-kernel inbred line, which is derived from a tropical landrace. Using haplotype maps derived from B73, Mo17 and SK, we identified 80,614 polymorphic structural variants across 521 diverse lines. Approximately 22% of these variants could not be detected by traditional single-nucleotide-polymorphism-based approaches, and some of them could affect gene expression and trait performance. To illustrate the utility of the diverse SK line, we used it to perform map-based cloning of a major effect quantitative trait locus controlling kernel weight—a key trait selected during maize improvement. The underlying candidate gene ZmBARELY ANY MERISTEM1d provides a target for increasing crop yields. A high-quality reference genome of the maize SK inbred line and analyses between the tropical SK line and two other maize genomes, B73 and Mo17, provide insights into structural variation and crop improvement.

Journal ArticleDOI
TL;DR: TRITEX, an open-source computational workflow that combines paired-end, mate-pair, 10X Genomics linked-read with chromosome conformation capture sequencing data to construct sequence scaffolds with megabase-scale contiguity ordered into chromosomal pseudomolecules is presented.
Abstract: Chromosome-scale genome sequence assemblies underpin pan-genomic studies. Recent genome assembly efforts in the large-genome Triticeae crops wheat and barley have relied on the commercial closed-source assembly algorithm DeNovoMagic. We present TRITEX, an open-source computational workflow that combines paired-end, mate-pair, 10X Genomics linked-read with chromosome conformation capture sequencing data to construct sequence scaffolds with megabase-scale contiguity ordered into chromosomal pseudomolecules. We evaluate the performance of TRITEX on publicly available sequence data of tetraploid wild emmer and hexaploid bread wheat, and construct an improved annotated reference genome sequence assembly of the barley cultivar Morex as a community resource.

Journal ArticleDOI
TL;DR: StLFR represents an easily automatable solution that enables high-quality sequencing, phasing, SV detection, scaffolding, cost-effective diploid de novo genome assembly, and other long DNA sequencing applications.
Abstract: Here, we describe single-tube long fragment read (stLFR), a technology that enables sequencing of data from long DNA molecules using economical second-generation sequencing technology. It is based on adding the same barcode sequence to subfragments of the original long DNA molecule (DNA cobarcoding). To achieve this efficiently, stLFR uses the surface of microbeads to create millions of miniaturized barcoding reactions in a single tube. Using a combinatorial process, up to 3.6 billion unique barcode sequences were generated on beads, enabling practically nonredundant cobarcoding with 50 million barcodes per sample. Using stLFR, we demonstrate efficient unique cobarcoding of more than 8 million 20- to 300-kb genomic DNA fragments. Analysis of the human genome NA12878 with stLFR demonstrated high-quality variant calling and phase block lengths up to N50 34 Mb. We also demonstrate detection of complex structural variants and complete diploid de novo assembly of NA12878. These analyses were all performed using single stLFR libraries, and their construction did not significantly add to the time or cost of whole-genome sequencing (WGS) library preparation. stLFR represents an easily automatable solution that enables high-quality sequencing, phasing, SV detection, scaffolding, cost-effective diploid de novo genome assembly, and other long DNA sequencing applications.

Journal ArticleDOI
TL;DR: The Graph Genome Pipeline as discussed by the authors is a graph reference genome implementation that enables read alignment across 2,800 diploid genomes encompassing 12.6 million SNPs and 4.0 million insertions and deletions (indels).
Abstract: The human reference genome serves as the foundation for genomics by providing a scaffold for alignment of sequencing reads, but currently only reflects a single consensus haplotype, thus impairing analysis accuracy. Here we present a graph reference genome implementation that enables read alignment across 2,800 diploid genomes encompassing 12.6 million SNPs and 4.0 million insertions and deletions (indels). The pipeline processes one whole-genome sequencing sample in 6.5 h using a system with 36 CPU cores. We show that using a graph genome reference improves read mapping sensitivity and produces a 0.5% increase in variant calling recall, with unaffected specificity. Structural variations incorporated into a graph genome can be genotyped accurately under a unified framework. Finally, we show that iterative augmentation of graph genomes yields incremental gains in variant calling accuracy. Our implementation is an important advance toward fulfilling the promise of graph genomes to radically enhance the scalability and accuracy of genomic analyses. Graph Genome Pipeline is a read-alignment and variant-calling pipeline based on graph genomes that offers improved read-mapping and variant-calling accuracy while achieving speed comparable to those of linear reference genome pipelines.

Journal ArticleDOI
07 May 2019-Mycology
TL;DR: Here, it is proposed to separate the fungal types into physical type based on specimen, genome DNA (gDNA) typebased on complete genome sequence of culturable and uncluturable fungal specimen and digitaltype based on environmental DNA sequence data.
Abstract: The global bio-diversity of fungi has been extensively investigated and their species number has been estimated. Notably, the development of molecular phylogeny has revealed an unexpected fungal diversity and utilisation of culture-independent approaches including high-throughput amplicon sequencing has dramatically increased number of fungal operational taxonomic units. A number of novel taxa including new divisions, classes, orders and new families have been established in last decade. Many cryptic species were identified by molecular phylogeny. Based on recently generated data from culture-dependent and -independent survey on same samples, the fungal species on the earth were estimated to be 12 (11.7-13.2) million compared to 2.2-3.8 million species recently estimated by a variety of the estimation techniques. Moreover, it has been speculated that the current use of high-throughput sequencing techniques would reveal an even higher diversity than our current estimation. Recently, the formal classification of environmental sequences and permission of DNA sequence data as fungal names' type were proposed but strongly objected by the mycologist community. Surveys on fungi in unusual niches have indicated that many previously regarded "unculturable fungi" could be cultured on certain substrates under specific conditions. Moreover, the high-throughput amplicon sequencing, shotgun metagenomics and a single-cell genomics could be a powerful means to detect novel taxa. Here, we propose to separate the fungal types into physical type based on specimen, genome DNA (gDNA) type based on complete genome sequence of culturable and uncluturable fungal specimen and digital type based on environmental DNA sequence data. The physical and gDNA type should have priority, while the digital type can be temporal supplementary before the physical type and gDNA type being available. The fungal name based on the "digital type" could be assigned as the "clade" name + species name. The "clade" name could be the name of genus, family or order, etc. which the sequence of digital type affiliates to. Facilitating future cultivation efforts should be encouraged. Also, with the advancement in knowledge of fungi inhabiting various environments mostly because of rapid development of new detection technologies, more information should be expected for fungal diversity on our planet.

Journal ArticleDOI
TL;DR: A review of the genetics, genomics and breeding of cowpea is presented and several informative markers associated with quantitative trait loci related to desirable attributes ofcowpea were generated.
Abstract: Communicated by: C. Ojiewo Abstract Cowpea, Vigna unguiculata (L.), is an important grain legume grown in the tropics where it constitutes a valuable source of protein in the diets of millions of people. Some abiotic and biotic stresses adversely affect its productivity. A review of the genetics, genomics and breeding of cowpea is presented in this article. Cowpea breeding programmes have studied intensively qualitative and quantitative genetics of the crop to better enhance its improvement. A number of initiatives including Tropical Legumes projects have contributed to the development of cowpea genomic resources. Recent progress in the development of consensus genetic map containing 37,372 SNPs mapped to 3,280 bins will strengthen cowpea trait discovery pipeline. Several informative markers associated with quantitative trait loci (QTL) related to desirable attributes of cowpea were generated. Cowpea genetic improvement activities aim at the development of drought tolerant, phosphorus use efficient, bacterial blight and virus resistant lines through exploiting available genetic resources as well as deployment of modern breeding tools that will enhance genetic gain when grown by sub-Saharan Africa farmers.

Journal ArticleDOI
TL;DR: The University of California Santa Cruz Genome Browser website enters its 20th year of providing high-quality genomics data visualization and genome annotations to the research community, with new datasets including Tabula Muris single-cell expression data.
Abstract: The University of California Santa Cruz Genome Browser website (https://genome.ucsc.edu) enters its 20th year of providing high-quality genomics data visualization and genome annotations to the research community. In the past year, we have added a new option to our web BLAT tool that allows search against all genomes, a single-cell expression viewer (https://cells.ucsc.edu), a 'lollipop' plot display mode for high-density variation data, a RESTful API for data extraction and a custom-track backup feature. New datasets include Tabula Muris single-cell expression data, GeneHancer regulatory annotations, The Cancer Genome Atlas Pan-Cancer variants, Genome Reference Consortium Patch sequences, new ENCODE transcription factor binding site peaks and clusters, the Database of Genomic Variants Gold Standard Variants, Genomenon Mastermind variants and three new multi-species alignment tracks.