scispace - formally typeset
Search or ask a question

Showing papers on "Genome published in 2019"


Journal ArticleDOI
TL;DR: This work presents a method named HISAT2 (hierarchical indexing for spliced alignment of transcripts 2) that can align both DNA and RNA sequences using a graph Ferragina Manzini index, and uses it to represent and search an expanded model of the human reference genome.
Abstract: The human reference genome represents only a small number of individuals, which limits its usefulness for genotyping. We present a method named HISAT2 (hierarchical indexing for spliced alignment of transcripts 2) that can align both DNA and RNA sequences using a graph Ferragina Manzini index. We use HISAT2 to represent and search an expanded model of the human reference genome in which over 14.5 million genomic variants in combination with haplotypes are incorporated into the data structure used for searching and alignment. We benchmark HISAT2 using simulated and real datasets to demonstrate that our strategy of representing a population of genomes, together with a fast, memory-efficient search algorithm, provides more detailed and accurate variant analyses than other methods. We apply HISAT2 for HLA typing and DNA fingerprinting; both applications form part of the HISAT-genotype software that enables analysis of haplotype-resolved genes or genomic regions. HISAT-genotype outperforms other computational methods and matches or exceeds the performance of laboratory-based assays. A graph-based genome indexing scheme enables variant-aware alignment of sequences with very low memory requirements.

4,855 citations


Journal ArticleDOI
TL;DR: The accuracy of the GTDB-Tk taxonomic assignments is demonstrated by evaluating its performance on a phylogenetically diverse set of 10 156 bacterial and archaeal metagenome-assembled genomes.
Abstract: A Summary: The Genome Taxonomy Database Toolkit (GTDB-Tk) provides objective taxonomic assignments for bacterial and archaeal genomes based on the GTDB. GTDB-Tk is computationally efficient and able to classify thousands of draft genomes in parallel. Here we demonstrate the accuracy of the GTDB-Tk taxonomic assignments by evaluating its performance on a phylogenetically diverse set of 10 156 bacterial and archaeal metagenome-assembled genomes.

2,053 citations


Journal ArticleDOI
TL;DR: The phylogenetic analysis complemented with synteny analyses suggests that Bmp2, -4 and -16 are remnants of a gene quartet that originated during the two rounds of whole-genome duplication (2R-WGD) early in vertebrate evolution.
Abstract: The vertebrate gene repertoire is characterized by “cryptic” genes whose identification has been hampered by their absence from the genomes of well-studied species. One example is the Bmp16 gene, a paralog of the developmental key genes Bmp2 and -4. We focus on the Bmp2/4/16 group of genes to study the evolutionary dynamics following gen(om)e duplications with special emphasis on the poorly studied Bmp16 gene. We reveal the presence of Bmp16 in chondrichthyans in addition to previously reported teleost fishes and reptiles. Using comprehensive, vertebrate-wide gene sampling, our phylogenetic analysis complemented with synteny analyses suggests that Bmp2, -4 and -16 are remnants of a gene quartet that originated during the two rounds of whole-genome duplication (2R-WGD) early in vertebrate evolution. We confirm that Bmp16 genes were lost independently in at least three lineages (mammals, archelosaurs and amphibians) and report that they have elevated rates of sequence evolution. This finding agrees with their more “flexible” deployment during development; while Bmp16 has limited embryonic expression domains in the cloudy catshark, it is broadly expressed in the green anole lizard. Our study illustrates the dynamics of gene family evolution by integrating insights from sequence diversification, gene repertoire changes, and shuffling of expression domains.

1,376 citations


Journal ArticleDOI
TL;DR: TYGS, the Type (Strain) Genome Server, a user-friendly high-throughput web server for genome-based prokaryote taxonomy and analysis connected to a large, continuously growing database of genomic, taxonomic and nomenclatural information.
Abstract: Microbial taxonomy is increasingly influenced by genome-based computational methods. Yet such analyses can be complex and require expert knowledge. Here we introduce TYGS, the Type (Strain) Genome Server, a user-friendly high-throughput web server for genome-based prokaryote taxonomy, connected to a large, continuously growing database of genomic, taxonomic and nomenclatural information. It infers genome-scale phylogenies and state-of-the-art estimates for species and subspecies boundaries from user-defined and automatically determined closest type genome sequences. TYGS also provides comprehensive access to nomenclature, synonymy and associated taxonomic literature. Clinically important examples demonstrate how TYGS can yield new insights into microbial classification, such as evidence for a species-level separation of previously proposed subspecies of Salmonella enterica. TYGS is an integrated approach for the classification of microbes that unlocks novel scientific approaches to microbiologists worldwide and is particularly helpful for the rapidly expanding field of genome-based taxonomic descriptions of new genera, species or subspecies.

1,202 citations


Posted ContentDOI
Konrad J. Karczewski1, Konrad J. Karczewski2, Laurent C. Francioli2, Laurent C. Francioli1, Grace Tiao2, Grace Tiao1, Beryl B. Cummings1, Beryl B. Cummings2, Jessica Alföldi2, Jessica Alföldi1, Qingbo Wang1, Qingbo Wang2, Ryan L. Collins1, Ryan L. Collins2, Kristen M. Laricchia1, Kristen M. Laricchia2, Andrea Ganna1, Andrea Ganna2, Andrea Ganna3, Daniel P. Birnbaum1, Laura D. Gauthier1, Harrison Brand2, Harrison Brand1, Matthew Solomonson2, Matthew Solomonson1, Nicholas A. Watts1, Nicholas A. Watts2, Daniel R. Rhodes4, Moriel Singer-Berk1, Eleanor G. Seaby2, Eleanor G. Seaby1, Jack A. Kosmicki2, Jack A. Kosmicki1, Raymond K. Walters1, Raymond K. Walters2, Katherine Tashman2, Katherine Tashman1, Yossi Farjoun1, Eric Banks1, Timothy Poterba2, Timothy Poterba1, Arcturus Wang1, Arcturus Wang2, Cotton Seed1, Cotton Seed2, Nicola Whiffin1, Nicola Whiffin5, Jessica X. Chong6, Kaitlin E. Samocha7, Emma Pierce-Hoffman1, Zachary Zappala1, Zachary Zappala8, Anne H. O’Donnell-Luria1, Anne H. O’Donnell-Luria2, Anne H. O’Donnell-Luria9, Eric Vallabh Minikel1, Ben Weisburd1, Monkol Lek1, Monkol Lek10, James S. Ware5, James S. Ware1, Christopher Vittal1, Christopher Vittal2, Irina M. Armean1, Irina M. Armean2, Irina M. Armean11, Louis Bergelson1, Kristian Cibulskis1, Kristen M. Connolly1, Miguel Covarrubias1, Stacey Donnelly1, Steven Ferriera1, Stacey Gabriel1, Jeff Gentry1, Namrata Gupta1, Thibault Jeandet1, Diane Kaplan1, Christopher Llanwarne1, Ruchi Munshi1, Sam Novod1, Nikelle Petrillo1, David Roazen1, Valentin Ruano-Rubio1, Andrea Saltzman1, Molly Schleicher1, Jose Soto1, Kathleen Tibbetts1, Charlotte Tolonen1, Gordon Wade1, Michael E. Talkowski2, Michael E. Talkowski1, Benjamin M. Neale1, Benjamin M. Neale2, Mark J. Daly1, Daniel G. MacArthur2, Daniel G. MacArthur1 
30 Jan 2019-bioRxiv
TL;DR: Using an improved human mutation rate model, human protein-coding genes are classified along a spectrum representing tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve gene discovery power for both common and rare diseases.
Abstract: Summary Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes critical for an organism’s function will be depleted for such variants in natural populations, while non-essential genes will tolerate their accumulation. However, predicted loss-of-function (pLoF) variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes. Here, we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence pLoF variants in this cohort after filtering for sequencing and annotation artifacts. Using an improved model of human mutation, we classify human protein-coding genes along a spectrum representing intolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve gene discovery power for both common and rare diseases.

1,128 citations


Journal ArticleDOI
TL;DR: This major update of CHOPCHOP introduces functionality for targeting RNA with Cas13, which includes support for alternative transcript isoforms and RNA accessibility predictions, and incorporates new DNA targeting modes, including CRISPR activation/repression, targeted enrichment of loci for long-read sequencing, and prediction of Cas9 repair outcomes.
Abstract: The CRISPR-Cas system is a powerful genome editing tool that functions in a diverse array of organisms and cell types. The technology was initially developed to induce targeted mutations in DNA, but CRISPR-Cas has now been adapted to target nucleic acids for a range of purposes. CHOPCHOP is a web tool for identifying CRISPR-Cas single guide RNA (sgRNA) targets. In this major update of CHOPCHOP, we expand our toolbox beyond knockouts. We introduce functionality for targeting RNA with Cas13, which includes support for alternative transcript isoforms and RNA accessibility predictions. We incorporate new DNA targeting modes, including CRISPR activation/repression, targeted enrichment of loci for long-read sequencing, and prediction of Cas9 repair outcomes. Finally, we expand our results page visualization to reveal alternative isoforms and downstream ATG sites, which will aid users in avoiding the expression of truncated proteins. The CHOPCHOP web tool now supports over 200 genomes and we have released a command-line script for running larger jobs and handling unsupported genomes. CHOPCHOP v3 can be found at https://chopchop.cbu.uib.no.

879 citations


Journal ArticleDOI
TL;DR: The ENCODE blacklist is defined- a comprehensive set of regions in the human, mouse, worm, and fly genomes that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment.
Abstract: Functional genomics assays based on high-throughput sequencing greatly expand our ability to understand the genome. Here, we define the ENCODE blacklist- a comprehensive set of regions in the human, mouse, worm, and fly genomes that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The removal of the ENCODE blacklist is an essential quality measure when analyzing functional genomics data.

850 citations


Posted ContentDOI
Daniel Taliun1, Daniel N. Harris2, Michael D. Kessler2, Jedidiah Carlson3  +191 moreInstitutions (61)
06 Mar 2019-bioRxiv
TL;DR: The nearly complete catalog of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and non-coding sequence variants to phenotypic variation as well as resources and early insights from the sequence data.
Abstract: Summary paragraph The Trans-Omics for Precision Medicine (TOPMed) program seeks to elucidate the genetic architecture and disease biology of heart, lung, blood, and sleep disorders, with the ultimate goal of improving diagnosis, treatment, and prevention. The initial phases of the program focus on whole genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here, we describe TOPMed goals and design as well as resources and early insights from the sequence data. The resources include a variant browser, a genotype imputation panel, and sharing of genomic and phenotypic data via dbGaP. In 53,581 TOPMed samples, >400 million single-nucleotide and insertion/deletion variants were detected by alignment with the reference genome. Additional novel variants are detectable through assembly of unmapped reads and customized analysis in highly variable loci. Among the >400 million variants detected, 97% have frequency

662 citations


Journal ArticleDOI
TL;DR: A new tool is added that lets users interactively arrange existing graphing tracks into new groups and create a 30-way primate alignment on the human genome in the UCSC Genome Browser.
Abstract: The UCSC Genome Browser (https://genome.ucsc.edu) is a graphical viewer for exploring genome annotations. For almost two decades, the Browser has provided visualization tools for genetics and molecular biology and continues to add new data and features. This year, we added a new tool that lets users interactively arrange existing graphing tracks into new groups. Other software additions include new formats for chromosome interactions, a ChIP-Seq peak display for track hubs and improved support for HGVS. On the annotation side, we have added gnomAD, TCGA expression, RefSeq Functional elements, GTEx eQTLs, CRISPR Guides, SNPpedia and created a 30-way primate alignment on the human genome. Nine assemblies now have RefSeq-mapped gene models.

649 citations


Journal ArticleDOI
Mark Chaisson1, Mark Chaisson2, Ashley D. Sanders, Xuefang Zhao3, Xuefang Zhao4, Ankit Malhotra, David Porubsky5, David Porubsky6, Tobias Rausch, Eugene J. Gardner7, Oscar L. Rodriguez8, Li Guo9, Ryan L. Collins4, Xian Fan10, Jia Wen11, Robert E. Handsaker12, Robert E. Handsaker4, Susan Fairley13, Zev N. Kronenberg2, Xiangmeng Kong14, Fereydoun Hormozdiari15, Dillon Lee16, Aaron M. Wenger17, Alex Hastie, Danny Antaki18, Thomas Anantharaman, Peter A. Audano2, Harrison Brand4, Stuart Cantsilieris2, Han Cao, Eliza Cerveira, Chong Chen10, Xintong Chen7, Chen-Shan Chin17, Zechen Chong10, Nelson T. Chuang7, Christine C. Lambert17, Deanna M. Church, Laura Clarke13, Andrew Farrell16, Joey Flores19, Timur R. Galeev14, David U. Gorkin18, David U. Gorkin20, Madhusudan Gujral18, Victor Guryev6, William Haynes Heaton, Jonas Korlach17, Sushant Kumar14, Jee Young Kwon21, Ernest T. Lam, Jong Eun Lee, Joyce V. Lee, Wan-Ping Lee, Sau Peng Lee, Shantao Li14, Patrick Marks, Karine A. Viaud-Martinez19, Sascha Meiers, Katherine M. Munson2, Fabio C. P. Navarro14, Bradley J. Nelson2, Conor Nodzak11, Amina Noor18, Sofia Kyriazopoulou-Panagiotopoulou, Andy Wing Chun Pang, Yunjiang Qiu18, Yunjiang Qiu20, Gabriel Rosanio18, Mallory Ryan, Adrian M. Stütz, Diana C.J. Spierings6, Alistair Ward16, Anne Marie E. Welch2, Ming Xiao22, Wei Xu, Chengsheng Zhang, Qihui Zhu, Xiangqun Zheng-Bradley13, Ernesto Lowy13, Sergei Yakneen, Steven A. McCarroll12, Steven A. McCarroll4, Goo Jun23, Li Ding24, Chong-Lek Koh25, Bing Ren18, Bing Ren20, Paul Flicek13, Ken Chen10, Mark Gerstein, Pui-Yan Kwok26, Peter M. Lansdorp27, Peter M. Lansdorp28, Peter M. Lansdorp6, Gabor T. Marth16, Jonathan Sebat18, Xinghua Shi11, Ali Bashir8, Kai Ye9, Scott E. Devine7, Michael E. Talkowski4, Michael E. Talkowski12, Ryan E. Mills3, Tobias Marschall5, Jan O. Korbel13, Evan E. Eichler2, Charles Lee21 
TL;DR: A suite of long-read, short- read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms are applied to comprehensively analyze three trios to define the full spectrum of human genetic variation in a haplotype-resolved manner.
Abstract: The incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, short-read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per genome. We also discover 156 inversions per genome and 58 of the inversions intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a three to sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The methods and the dataset presented serve as a gold standard for the scientific community allowing us to make recommendations for maximizing structural variation sensitivity for future genome sequencing studies.

606 citations


Journal ArticleDOI
TL;DR: Significant enhancements to MGD are described, including two new graphical user interfaces: the Multi Genome Viewer for exploring the genomes of multiple mouse strains and the Phenotype-Gene Expression matrix which was developed in collaboration with the Gene Expression Database (GXD) and allows researchers to compare gene expression and phenotype annotations for mouse genes.
Abstract: The Mouse Genome Database (MGD; http://www.informatics.jax.org) is the community model organism genetic and genome resource for the laboratory mouse. MGD is the authoritative source for biological reference data sets related to mouse genes, gene functions, phenotypes, and mouse models of human disease. MGD is the primary outlet for official gene, allele and mouse strain nomenclature based on the guidelines set by the International Committee on Standardized Nomenclature for Mice. In this report we describe significant enhancements to MGD, including two new graphical user interfaces: (i) the Multi Genome Viewer for exploring the genomes of multiple mouse strains and (ii) the Phenotype-Gene Expression matrix which was developed in collaboration with the Gene Expression Database (GXD) and allows researchers to compare gene expression and phenotype annotations for mouse genes. Other recent improvements include enhanced efficiency of our literature curation processes and the incorporation of Transcriptional Start Site (TSS) annotations from RIKEN's FANTOM 5 initiative.

Journal ArticleDOI
TL;DR: A simple activity-by-contact model substantially outperformed previous methods at predicting the complex connections in the CRISPR dataset and allows systematic mapping of enhancer–gene connections in a given cell type, on the basis of chromatin-state measurements.
Abstract: Enhancer elements in the human genome control how genes are expressed in specific cell types and harbor thousands of genetic variants that influence risk for common diseases1-4. Yet, we still do not know how enhancers regulate specific genes, and we lack general rules to predict enhancer-gene connections across cell types5,6. We developed an experimental approach, CRISPRi-FlowFISH, to perturb enhancers in the genome, and we applied it to test >3,500 potential enhancer-gene connections for 30 genes. We found that a simple activity-by-contact model substantially outperformed previous methods at predicting the complex connections in our CRISPR dataset. This activity-by-contact model allows us to construct genome-wide maps of enhancer-gene connections in a given cell type, on the basis of chromatin state measurements. Together, CRISPRi-FlowFISH and the activity-by-contact model provide a systematic approach to map and predict which enhancers regulate which genes, and will help to interpret the functions of the thousands of disease risk variants in the noncoding genome.

Journal ArticleDOI
TL;DR: A comprehensive landscape of different modes of gene duplication across the plant kingdom is identified by comparing 141 genomes, which provides a solid foundation for further investigation of the dynamic evolution of duplicate genes.
Abstract: The sharp increase of plant genome and transcriptome data provide valuable resources to investigate evolutionary consequences of gene duplication in a range of taxa, and unravel common principles underlying duplicate gene retention. We survey 141 sequenced plant genomes to elucidate consequences of gene and genome duplication, processes central to the evolution of biodiversity. We develop a pipeline named DupGen_finder to identify different modes of gene duplication in plants. Genes derived from whole-genome, tandem, proximal, transposed, or dispersed duplication differ in abundance, selection pressure, expression divergence, and gene conversion rate among genomes. The number of WGD-derived duplicate genes decreases exponentially with increasing age of duplication events—transposed duplication- and dispersed duplication-derived genes declined in parallel. In contrast, the frequency of tandem and proximal duplications showed no significant decrease over time, providing a continuous supply of variants available for adaptation to continuously changing environments. Moreover, tandem and proximal duplicates experienced stronger selective pressure than genes formed by other modes and evolved toward biased functional roles involved in plant self-defense. The rate of gene conversion among WGD-derived gene pairs declined over time, peaking shortly after polyploidization. To provide a platform for accessing duplicated gene pairs in different plants, we constructed the Plant Duplicate Gene Database. We identify a comprehensive landscape of different modes of gene duplication across the plant kingdom by comparing 141 genomes, which provides a solid foundation for further investigation of the dynamic evolution of duplicate genes.

Journal ArticleDOI
13 Mar 2019-Nature
TL;DR: Draft prokaryotic genomes from faecal metagenomes of diverse human populations enrich the understanding of the human gut microbiome by identifying over two thousand new species-level taxa that have numerous disease associations.
Abstract: The genome sequences of many species of the human gut microbiome remain unknown, largely owing to challenges in cultivating microorganisms under laboratory conditions. Here we address this problem by reconstructing 60,664 draft prokaryotic genomes from 3,810 faecal metagenomes, from geographically and phenotypically diverse humans. These genomes provide reference points for 2,058 newly identified species-level operational taxonomic units (OTUs), which represents a 50% increase over the previously known phylogenetic diversity of sequenced gut bacteria. On average, the newly identified OTUs comprise 33% of richness and 28% of species abundance per individual, and are enriched in humans from rural populations. A meta-analysis of clinical gut-microbiome studies pinpointed numerous disease associations for the newly identified OTUs, which have the potential to improve predictive models. Finally, our analysis revealed that uncultured gut species have undergone genome reduction that has resulted in the loss of certain biosynthetic pathways, which may offer clues for improving cultivation strategies in the future.

Journal ArticleDOI
TL;DR: This work presents vConTACT v.2.0, a network-based application utilizing whole genome gene-sharing profiles for virus taxonomy that integrates distance-based hierarchical clustering and confidence scores for all taxonomic predictions, and applies it to analyze 15,280 Global Ocean Virome genome fragments.
Abstract: Microbiomes from every environment contain a myriad of uncultivated archaeal and bacterial viruses, but studying these viruses is hampered by the lack of a universal, scalable taxonomic framework. We present vConTACT v.2.0, a network-based application utilizing whole genome gene-sharing profiles for virus taxonomy that integrates distance-based hierarchical clustering and confidence scores for all taxonomic predictions. We report near-identical (96%) replication of existing genus-level viral taxonomy assignments from the International Committee on Taxonomy of Viruses for National Center for Biotechnology Information virus RefSeq. Application of vConTACT v.2.0 to 1,364 previously unclassified viruses deposited in virus RefSeq as reference genomes produced automatic, high-confidence genus assignments for 820 of the 1,364. We applied vConTACT v.2.0 to analyze 15,280 Global Ocean Virome genome fragments and were able to provide taxonomic assignments for 31% of these data, which shows that our algorithm is scalable to very large metagenomic datasets. Our taxonomy tool can be automated and applied to metagenomes from any environment for virus classification.

Journal ArticleDOI
Hui Zheng1, Wei Xie1
TL;DR: This Review discusses recent progress in understanding of the general principles of chromatin folding, its regulation and its functions in mammalian development, and discusses the dynamics of 3D chromatin and genome organization during gametogenesis, embryonic development, lineage commitment and stem cell differentiation.
Abstract: In eukaryotes, the genome does not exist as a linear molecule but instead is hierarchically packaged inside the nucleus. This complex genome organization includes multiscale structural units of chromosome territories, compartments, topologically associating domains, which are often demarcated by architectural proteins such as CTCF and cohesin, and chromatin loops. The 3D organization of chromatin modulates biological processes such as transcription, DNA replication, cell division and meiosis, which are crucial for cell differentiation and animal development. In this Review, we discuss recent progress in our understanding of the general principles of chromatin folding, its regulation and its functions in mammalian development. Specifically, we discuss the dynamics of 3D chromatin and genome organization during gametogenesis, embryonic development, lineage commitment and stem cell differentiation, and focus on the functions of chromatin architecture in transcription regulation. Finally, we discuss the role of 3D genome alterations in the aetiology of developmental disorders and human diseases.

Journal ArticleDOI
05 Jul 2019-Science
TL;DR: This work expands the understanding of the functional diversity of CRISPR-Cas systems and establishes a paradigm for precision DNA insertion.
Abstract: CRISPR-Cas nucleases are powerful tools for manipulating nucleic acids; however, targeted insertion of DNA remains a challenge, as it requires host cell repair machinery. Here we characterize a CRISPR-associated transposase from cyanobacteria Scytonema hofmanni (ShCAST) that consists of Tn7-like transposase subunits and the type V-K CRISPR effector (Cas12k). ShCAST catalyzes RNA-guided DNA transposition by unidirectionally inserting segments of DNA 60 to 66 base pairs downstream of the protospacer. ShCAST integrates DNA into targeted sites in the Escherichia coli genome with frequencies of up to 80% without positive selection. This work expands our understanding of the functional diversity of CRISPR-Cas systems and establishes a paradigm for precision DNA insertion.

Journal ArticleDOI
10 Jan 2019-Cell
TL;DR: A multiplex, expression quantitative trait locus (eQTL)-inspired framework for mapping enhancer-gene pairs by introducing random combinations of CRISPR/Cas9-mediated perturbations to each of many cells, followed by single-cell RNA sequencing (RNA-seq).

Journal ArticleDOI
TL;DR: The main features of topologically associating domains across species are depicted and the relation between chromatin structure, genome activity, and epigenome is discussed, highlighting mechanistic principles of TAD formation.
Abstract: Understanding the mechanisms that underlie chromosome folding within cell nuclei is essential to determine the relationship between genome structure and function. The recent application of "chromosome conformation capture" techniques has revealed that the genome of many species is organized into domains of preferential internal chromatin interactions called "topologically associating domains" (TADs). This chromosome chromosome folding has emerged as a key feature of higher-order genome organization and function through evolution. Although TADs have now been described in a wide range of organisms, they appear to have specific characteristics in terms of size, structure, and proteins involved in their formation. Here, we depict the main features of these domains across species and discuss the relation between chromatin structure, genome activity, and epigenome, highlighting mechanistic principles of TAD formation. We also consider the potential influence of TADs in genome evolution.

Journal ArticleDOI
TL;DR: The improved resource of gastrointestinal bacterial reference sequences circumvents dependence on de novo assembly of metagenomes and enables accurate and cost-effective shotgun metagenomic analyses of human gastrointestinal microbiota.
Abstract: Understanding gut microbiome functions requires cultivated bacteria for experimental validation and reference bacterial genome sequences to interpret metagenome datasets and guide functional analyses. We present the Human Gastrointestinal Bacteria Culture Collection (HBC), a comprehensive set of 737 whole-genome-sequenced bacterial isolates, representing 273 species (105 novel species) from 31 families found in the human gastrointestinal microbiota. The HBC increases the number of bacterial genomes derived from human gastrointestinal microbiota by 37%. The resulting global Human Gastrointestinal Bacteria Genome Collection (HGG) classifies 83% of genera by abundance across 13,490 shotgun-sequenced metagenomic samples, improves taxonomic classification by 61% compared to the Human Microbiome Project (HMP) genome collection and achieves subspecies-level classification for almost 50% of sequences. The improved resource of gastrointestinal bacterial reference sequences circumvents dependence on de novo assembly of metagenomes and enables accurate and cost-effective shotgun metagenomic analyses of human gastrointestinal microbiota.

Journal ArticleDOI
TL;DR: The genome sequence of segmental allotetraploid peanut is reported and suggests that diversity generated by genetic deletions and homeologous recombination helped to favor the domestication of Arachis hypogaea over its diploid relatives.
Abstract: Like many other crops, the cultivated peanut (Arachis hypogaea L.) is of hybrid origin and has a polyploid genome that contains essentially complete sets of chromosomes from two ancestral species. Here we report the genome sequence of peanut and show that after its polyploid origin, the genome has evolved through mobile-element activity, deletions and by the flow of genetic information between corresponding ancestral chromosomes (that is, homeologous recombination). Uniformity of patterns of homeologous recombination at the ends of chromosomes favors a single origin for cultivated peanut and its wild counterpart A. monticola. However, through much of the genome, homeologous recombination has created diversity. Using new polyploid hybrids made from the ancestral species, we show how this can generate phenotypic changes such as spontaneous changes in the color of the flowers. We suggest that diversity generated by these genetic mechanisms helped to favor the domestication of the polyploid A. hypogaea over other diploid Arachis species cultivated by humans.

Journal ArticleDOI
12 Jun 2019-Nature
TL;DR: A programmable transposaseintegrates donor DNA at user-defined genomic target sites with high fidelity, revealing a new approach for genetic engineering that obviates the need for DNA double-strand breaks and homologous recombination.
Abstract: Conventional CRISPR-Cas systems maintain genomic integrity by leveraging guide RNAs for the nuclease-dependent degradation of mobile genetic elements, including plasmids and viruses. Here we describe a notable inversion of this paradigm, in which bacterial Tn7-like transposons have co-opted nuclease-deficient CRISPR-Cas systems to catalyse RNA-guided integration of mobile genetic elements into the genome. Programmable transposition of Vibrio cholerae Tn6677 in Escherichia coli requires CRISPR- and transposon-associated molecular machineries, including a co-complex between the DNA-targeting complex Cascade and the transposition protein TniQ. Integration of donor DNA occurs in one of two possible orientations at a fixed distance downstream of target DNA sequences, and can accommodate variable length genetic payloads. Deep-sequencing experiments reveal highly specific, genome-wide DNA insertion across dozens of unique target sites. This discovery of a fully programmable, RNA-guided integrase lays the foundation for genomic manipulations that obviate the requirements for double-strand breaks and homology-directed repair.

Journal ArticleDOI
TL;DR: A tomato pan-genome constructed using genome sequences of 725 phylogenetically and geographically representative accessions captures 4,873 genes absent from the reference genome and identifies a rare allele of TomLoxC regulating fruit flavor.
Abstract: Modern tomatoes have narrow genetic diversity limiting their improvement potential. We present a tomato pan-genome constructed using genome sequences of 725 phylogenetically and geographically representative accessions, revealing 4,873 genes absent from the reference genome. Presence/absence variation analyses reveal substantial gene loss and intense negative selection of genes and promoters during tomato domestication and improvement. Lost or negatively selected genes are enriched for important traits, especially disease resistance. We identify a rare allele in the TomLoxC promoter selected against during domestication. Quantitative trait locus mapping and analysis of transgenic plants reveal a role for TomLoxC in apocarotenoid production, which contributes to desirable tomato flavor. In orange-stage fruit, accessions harboring both the rare and common TomLoxC alleles (heterozygotes) have higher TomLoxC expression than those homozygous for either and are resurgent in modern tomatoes. The tomato pan-genome adds depth and completeness to the reference genome, and is useful for future biological discovery and breeding.

Journal ArticleDOI
TL;DR: Improved genome assemblies of allotetraploid cotton species Gossypium hirsutum and GOSSypium barbadense provide insights into cotton evolution and inform the construction of introgression lines used to identify loci associated with fiber quality.
Abstract: Allotetraploid cotton species (Gossypium hirsutum and Gossypium barbadense) have long been cultivated worldwide for natural renewable textile fibers. The draft genome sequences of both species are available but they are highly fragmented and incomplete1-4. Here we report reference-grade genome assemblies and annotations for G. hirsutum accession Texas Marker-1 (TM-1) and G. barbadense accession 3-79 by integrating single-molecule real-time sequencing, BioNano optical mapping and high-throughput chromosome conformation capture techniques. Compared with previous assembled draft genomes1,3, these genome sequences show considerable improvements in contiguity and completeness for regions with high content of repeats such as centromeres. Comparative genomics analyses identify extensive structural variations that probably occurred after polyploidization, highlighted by large paracentric/pericentric inversions in 14 chromosomes. We constructed an introgression line population to introduce favorable chromosome segments from G. barbadense to G. hirsutum, allowing us to identify 13 quantitative trait loci associated with superior fiber quality. These resources will accelerate evolutionary and functional genomic studies in cotton and inform future breeding programs for fiber improvement.

Journal ArticleDOI
TL;DR: An analysis of glycyl radical enzyme superfamily found in the human gut microbiome demonstrates that SwissProt annotations are not always correct, large-scale genome context analyses allow the prediction of novel metabolic pathways, and metagenome abundance can be used to identify/prioritize uncharacterized proteins for functional investigation.
Abstract: The assignment of functions to uncharacterized proteins discovered in genome projects requires easily accessible tools and computational resources for large-scale, user-friendly leveraging of the protein, genome, and metagenome databases by experimentalists. This article describes the web resource developed by the Enzyme Function Initiative (EFI; accessed at https://efi.igb.illinois.edu/) that provides “genomic enzymology” tools (“web tools”) for (1) generating sequence similarity networks (SSNs) for protein families (EFI-EST); (2) analyzing and visualizing genome context of the proteins in clusters in SSNs (in genome neighborhood networks, GNNs, and genome neighborhood diagrams, GNDs) (EFI-GNT); and (3) prioritizing uncharacterized SSN clusters for functional assignment based on metagenome abundance (chemically guided functional profiling, CGFP) (EFI-CGFP). The SSNs generated by EFI-EST are used as the input for EFI-GNT and EFI-CGFP, enabling easy transfer of information among the tools. The networks are...

Journal ArticleDOI
TL;DR: A collection of 1,520 nonredundant, high-quality draft genomes generated from >6,000 bacteria cultivated from fecal samples of healthy humans, chosen to cover all major bacterial phyla and genera in the human gut.
Abstract: Reference genomes are essential for metagenomic analyses and functional characterization of the human gut microbiota. We present the Culturable Genome Reference (CGR), a collection of 1,520 nonredundant, high-quality draft genomes generated from >6,000 bacteria cultivated from fecal samples of healthy humans. Of the 1,520 genomes, which were chosen to cover all major bacterial phyla and genera in the human gut, 264 are not represented in existing reference genome catalogs. We show that this increase in the number of reference bacterial genomes improves the rate of mapping metagenomic sequencing reads from 50% to >70%, enabling higher-resolution descriptions of the human gut microbiome. We use the CGR genomes to annotate functions of 338 bacterial species, showing the utility of this resource for functional studies. We also carry out a pan-genome analysis of 38 important human gut species, which reveals the diversity and specificity of functional enrichment between their core and dispensable genomes.

Journal ArticleDOI
24 Jan 2019-Cell
TL;DR: A ninefold SV bias toward the last 5 Mbp of human chromosomes is reported with nearly 55% of all VNTRs (variable number of tandem repeats) mapping to this portion of the genome.

Journal ArticleDOI
07 Mar 2019-Cell
TL;DR: High-throughput optical mapping of several hundred intra-chromosomal interactions in individual human fibroblasts demonstrates low association frequencies, which are determined by genomic distance, higher-order chromatin architecture, and chromatin environment.

Journal ArticleDOI
15 May 2019-Nature
TL;DR: Three-dimensional genome architecture has important roles in the regulation of gene expression and is therefore a key determinant of cell identity in normal development and in disease states.
Abstract: How cells adopt different identities has long fascinated biologists. Signal transduction in response to environmental cues results in the activation of transcription factors that determine the gene-expression program characteristic of each cell type. Technological advances in the study of 3D chromatin folding are bringing the role of genome conformation in transcriptional regulation to the fore. Characterizing this role of genome architecture has profound implications, not only for differentiation and development but also for diseases including developmental malformations and cancer. Here we review recent studies indicating that the interplay between transcription and genome conformation is a driving force for cell-fate decisions.

Journal ArticleDOI
TL;DR: High-quality genome sequence of cultivated peanut provides insights into genome evolution and the genetic mechanisms underlying seed size and leaf resistance in peanut, providing a cornerstone for functional genomics and peanut improvement.
Abstract: High oil and protein content make tetraploid peanut a leading oil and food legume. Here we report a high-quality peanut genome sequence, comprising 2.54 Gb with 20 pseudomolecules and 83,709 protein-coding gene models. We characterize gene functional groups implicated in seed size evolution, seed oil content, disease resistance and symbiotic nitrogen fixation. The peanut B subgenome has more genes and general expression dominance, temporally associated with long-terminal-repeat expansion in the A subgenome that also raises questions about the A-genome progenitor. The polyploid genome provided insights into the evolution of Arachis hypogaea and other legume chromosomes. Resequencing of 52 accessions suggests that independent domestications formed peanut ecotypes. Whereas 0.42–0.47 million years ago (Ma) polyploidy constrained genetic variation, the peanut genome sequence aids mapping and candidate-gene discovery for traits such as seed size and color, foliar disease resistance and others, also providing a cornerstone for functional genomics and peanut improvement.