scispace - formally typeset
Search or ask a question

Showing papers on "Genome published in 2019"


Journal ArticleDOI
TL;DR: This work presents a method named HISAT2 (hierarchical indexing for spliced alignment of transcripts 2) that can align both DNA and RNA sequences using a graph Ferragina Manzini index, and uses it to represent and search an expanded model of the human reference genome.
Abstract: The human reference genome represents only a small number of individuals, which limits its usefulness for genotyping. We present a method named HISAT2 (hierarchical indexing for spliced alignment of transcripts 2) that can align both DNA and RNA sequences using a graph Ferragina Manzini index. We use HISAT2 to represent and search an expanded model of the human reference genome in which over 14.5 million genomic variants in combination with haplotypes are incorporated into the data structure used for searching and alignment. We benchmark HISAT2 using simulated and real datasets to demonstrate that our strategy of representing a population of genomes, together with a fast, memory-efficient search algorithm, provides more detailed and accurate variant analyses than other methods. We apply HISAT2 for HLA typing and DNA fingerprinting; both applications form part of the HISAT-genotype software that enables analysis of haplotype-resolved genes or genomic regions. HISAT-genotype outperforms other computational methods and matches or exceeds the performance of laboratory-based assays. A graph-based genome indexing scheme enables variant-aware alignment of sequences with very low memory requirements.

4,855 citations


Journal ArticleDOI
21 Oct 2019-Nature
TL;DR: A new DNA-editing technique called prime editing offers improved versatility and efficiency with reduced byproducts compared with existing techniques, and shows potential for correcting disease-associated mutations.
Abstract: Most genetic variants that contribute to disease1 are challenging to correct efficiently and without excess byproducts2-5. Here we describe prime editing, a versatile and precise genome editing method that directly writes new genetic information into a specified DNA site using a catalytically impaired Cas9 endonuclease fused to an engineered reverse transcriptase, programmed with a prime editing guide RNA (pegRNA) that both specifies the target site and encodes the desired edit. We performed more than 175 edits in human cells, including targeted insertions, deletions, and all 12 types of point mutation, without requiring double-strand breaks or donor DNA templates. We used prime editing in human cells to correct, efficiently and with few byproducts, the primary genetic causes of sickle cell disease (requiring a transversion in HBB) and Tay-Sachs disease (requiring a deletion in HEXA); to install a protective transversion in PRNP; and to insert various tags and epitopes precisely into target loci. Four human cell lines and primary post-mitotic mouse cortical neurons support prime editing with varying efficiencies. Prime editing shows higher or similar efficiency and fewer byproducts than homology-directed repair, has complementary strengths and weaknesses compared to base editing, and induces much lower off-target editing than Cas9 nuclease at known Cas9 off-target sites. Prime editing substantially expands the scope and capabilities of genome editing, and in principle could correct up to 89% of known genetic variants associated with human diseases.

2,260 citations


Journal ArticleDOI
TL;DR: The accuracy of the GTDB-Tk taxonomic assignments is demonstrated by evaluating its performance on a phylogenetically diverse set of 10 156 bacterial and archaeal metagenome-assembled genomes.
Abstract: A Summary: The Genome Taxonomy Database Toolkit (GTDB-Tk) provides objective taxonomic assignments for bacterial and archaeal genomes based on the GTDB. GTDB-Tk is computationally efficient and able to classify thousands of draft genomes in parallel. Here we demonstrate the accuracy of the GTDB-Tk taxonomic assignments by evaluating its performance on a phylogenetically diverse set of 10 156 bacterial and archaeal metagenome-assembled genomes.

2,053 citations


Journal ArticleDOI
TL;DR: The phylogenetic analysis complemented with synteny analyses suggests that Bmp2, -4 and -16 are remnants of a gene quartet that originated during the two rounds of whole-genome duplication (2R-WGD) early in vertebrate evolution.
Abstract: The vertebrate gene repertoire is characterized by “cryptic” genes whose identification has been hampered by their absence from the genomes of well-studied species. One example is the Bmp16 gene, a paralog of the developmental key genes Bmp2 and -4. We focus on the Bmp2/4/16 group of genes to study the evolutionary dynamics following gen(om)e duplications with special emphasis on the poorly studied Bmp16 gene. We reveal the presence of Bmp16 in chondrichthyans in addition to previously reported teleost fishes and reptiles. Using comprehensive, vertebrate-wide gene sampling, our phylogenetic analysis complemented with synteny analyses suggests that Bmp2, -4 and -16 are remnants of a gene quartet that originated during the two rounds of whole-genome duplication (2R-WGD) early in vertebrate evolution. We confirm that Bmp16 genes were lost independently in at least three lineages (mammals, archelosaurs and amphibians) and report that they have elevated rates of sequence evolution. This finding agrees with their more “flexible” deployment during development; while Bmp16 has limited embryonic expression domains in the cloudy catshark, it is broadly expressed in the green anole lizard. Our study illustrates the dynamics of gene family evolution by integrating insights from sequence diversification, gene repertoire changes, and shuffling of expression domains.

1,376 citations


Journal ArticleDOI
TL;DR: TYGS, the Type (Strain) Genome Server, a user-friendly high-throughput web server for genome-based prokaryote taxonomy and analysis connected to a large, continuously growing database of genomic, taxonomic and nomenclatural information.
Abstract: Microbial taxonomy is increasingly influenced by genome-based computational methods. Yet such analyses can be complex and require expert knowledge. Here we introduce TYGS, the Type (Strain) Genome Server, a user-friendly high-throughput web server for genome-based prokaryote taxonomy, connected to a large, continuously growing database of genomic, taxonomic and nomenclatural information. It infers genome-scale phylogenies and state-of-the-art estimates for species and subspecies boundaries from user-defined and automatically determined closest type genome sequences. TYGS also provides comprehensive access to nomenclature, synonymy and associated taxonomic literature. Clinically important examples demonstrate how TYGS can yield new insights into microbial classification, such as evidence for a species-level separation of previously proposed subspecies of Salmonella enterica. TYGS is an integrated approach for the classification of microbes that unlocks novel scientific approaches to microbiologists worldwide and is particularly helpful for the rapidly expanding field of genome-based taxonomic descriptions of new genera, species or subspecies.

1,202 citations


Posted ContentDOI
Konrad J. Karczewski1, Konrad J. Karczewski2, Laurent C. Francioli1, Laurent C. Francioli2, Grace Tiao1, Grace Tiao2, Beryl B. Cummings1, Beryl B. Cummings2, Jessica Alföldi2, Jessica Alföldi1, Qingbo Wang2, Qingbo Wang1, Ryan L. Collins2, Ryan L. Collins1, Kristen M. Laricchia1, Kristen M. Laricchia2, Andrea Ganna1, Andrea Ganna3, Andrea Ganna2, Daniel P. Birnbaum1, Laura D. Gauthier1, Harrison Brand2, Harrison Brand1, Matthew Solomonson2, Matthew Solomonson1, Nicholas A. Watts2, Nicholas A. Watts1, Daniel R. Rhodes4, Moriel Singer-Berk1, Eleanor G. Seaby1, Eleanor G. Seaby2, Jack A. Kosmicki2, Jack A. Kosmicki1, Raymond K. Walters1, Raymond K. Walters2, Katherine Tashman2, Katherine Tashman1, Yossi Farjoun1, Eric Banks1, Timothy Poterba1, Timothy Poterba2, Arcturus Wang2, Arcturus Wang1, Cotton Seed1, Cotton Seed2, Nicola Whiffin5, Nicola Whiffin1, Jessica X. Chong6, Kaitlin E. Samocha7, Emma Pierce-Hoffman1, Zachary Zappala1, Zachary Zappala8, Anne H. O’Donnell-Luria1, Anne H. O’Donnell-Luria2, Anne H. O’Donnell-Luria9, Eric Vallabh Minikel1, Ben Weisburd1, Monkol Lek10, Monkol Lek1, James S. Ware1, James S. Ware5, Christopher Vittal2, Christopher Vittal1, Irina M. Armean11, Irina M. Armean1, Irina M. Armean2, Louis Bergelson1, Kristian Cibulskis1, Kristen M. Connolly1, Miguel Covarrubias1, Stacey Donnelly1, Steven Ferriera1, Stacey Gabriel1, Jeff Gentry1, Namrata Gupta1, Thibault Jeandet1, Diane Kaplan1, Christopher Llanwarne1, Ruchi Munshi1, Sam Novod1, Nikelle Petrillo1, David Roazen1, Valentin Ruano-Rubio1, Andrea Saltzman1, Molly Schleicher1, Jose Soto1, Kathleen Tibbetts1, Charlotte Tolonen1, Gordon Wade1, Michael E. Talkowski2, Michael E. Talkowski1, Benjamin M. Neale1, Benjamin M. Neale2, Mark J. Daly1, Daniel G. MacArthur1, Daniel G. MacArthur2 
30 Jan 2019-bioRxiv
TL;DR: Using an improved human mutation rate model, human protein-coding genes are classified along a spectrum representing tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve gene discovery power for both common and rare diseases.
Abstract: Summary Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes critical for an organism’s function will be depleted for such variants in natural populations, while non-essential genes will tolerate their accumulation. However, predicted loss-of-function (pLoF) variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes. Here, we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence pLoF variants in this cohort after filtering for sequencing and annotation artifacts. Using an improved model of human mutation, we classify human protein-coding genes along a spectrum representing intolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve gene discovery power for both common and rare diseases.

1,128 citations


Journal ArticleDOI
Bo Liu1, Dandan Zheng1, Qi Jin1, Lihong Chen1, Jian Yang1 
TL;DR: An integrated and automatic pipeline, VFanalyzer, is introduced to VFDB to systematically identify known/potential VFs in complete/draft bacterial genomes through a context-based data refinement process for VFs encoded by gene clusters that can achieve relatively high specificity and sensitivity without manual curation.
Abstract: The virulence factor database (VFDB, http://www.mgc.ac.cn/VFs/) is devoted to providing the scientific community with a comprehensive warehouse and online platform for deciphering bacterial pathogenesis. The various combinations, organizations and expressions of virulence factors (VFs) are responsible for the diverse clinical symptoms of pathogen infections. Currently, whole-genome sequencing is widely used to decode potential novel or variant pathogens both in emergent outbreaks and in routine clinical practice. However, the efficient characterization of pathogenomic compositions remains a challenge for microbiologists or physicians with limited bioinformatics skills. Therefore, we introduced to VFDB an integrated and automatic pipeline, VFanalyzer, to systematically identify known/potential VFs in complete/draft bacterial genomes. VFanalyzer first constructs orthologous groups within the query genome and preanalyzed reference genomes from VFDB to avoid potential false positives due to paralogs. Then, it conducts iterative and exhaustive sequence similarity searches among the hierarchical prebuilt datasets of VFDB to accurately identify potential untypical/strain-specific VFs. Finally, via a context-based data refinement process for VFs encoded by gene clusters, VFanalyzer can achieve relatively high specificity and sensitivity without manual curation. In addition, a thoroughly optimized interactive web interface is introduced to present VFanalyzer reports in comparative pathogenomic style for easy online analysis.

1,008 citations


Journal ArticleDOI
TL;DR: This Review comprehensively assess the benefits and limitations of GWAS in human populations and discusses the relevance of performing more GWAS, with a focus on the cardiometabolic field.
Abstract: Genome-wide association studies (GWAS) involve testing genetic variants across the genomes of many individuals to identify genotype–phenotype associations. GWAS have revolutionized the field of complex disease genetics over the past decade, providing numerous compelling associations for human complex traits and diseases. Despite clear successes in identifying novel disease susceptibility genes and biological pathways and in translating these findings into clinical care, GWAS have not been without controversy. Prominent criticisms include concerns that GWAS will eventually implicate the entire genome in disease predisposition and that most association signals reflect variants and genes with no direct biological relevance to disease. In this Review, we comprehensively assess the benefits and limitations of GWAS in human populations and discuss the relevance of performing more GWAS. Despite the success of human genome-wide association studies (GWAS) in associating genetic variants and complex diseases or traits, criticisms of the usefulness of this study design remain. This Review assesses the pros and cons of GWAS, with a focus on the cardiometabolic field.

1,002 citations


Journal ArticleDOI
TL;DR: A new version of OGDRAW equipped with a new front end enables the user to easily visualize large sets of organellar genomes spanning entire taxonomic clades.
Abstract: Organellar (plastid and mitochondrial) genomes play an important role in resolving phylogenetic relationships, and next-generation sequencing technologies have led to a burst in their availability. The ongoing massive sequencing efforts require software tools for routine assembly and annotation of organellar genomes as well as their display as physical maps. OrganellarGenomeDRAW (OGDRAW) has become the standard tool to draw graphical maps of plastid and mitochondrial genomes. Here, we present a new version of OGDRAW equipped with a new front end. Besides several new features, OGDRAW now has access to a local copy of the organelle genome database of the NCBI RefSeq project. Together with batch processing of (multi-)GenBank files, this enables the user to easily visualize large sets of organellar genomes spanning entire taxonomic clades. The new OGDRAW server can be accessed at https://chlorobox.mpimp-golm.mpg.de/OGDraw.html.

888 citations


Journal ArticleDOI
TL;DR: This major update of CHOPCHOP introduces functionality for targeting RNA with Cas13, which includes support for alternative transcript isoforms and RNA accessibility predictions, and incorporates new DNA targeting modes, including CRISPR activation/repression, targeted enrichment of loci for long-read sequencing, and prediction of Cas9 repair outcomes.
Abstract: The CRISPR-Cas system is a powerful genome editing tool that functions in a diverse array of organisms and cell types. The technology was initially developed to induce targeted mutations in DNA, but CRISPR-Cas has now been adapted to target nucleic acids for a range of purposes. CHOPCHOP is a web tool for identifying CRISPR-Cas single guide RNA (sgRNA) targets. In this major update of CHOPCHOP, we expand our toolbox beyond knockouts. We introduce functionality for targeting RNA with Cas13, which includes support for alternative transcript isoforms and RNA accessibility predictions. We incorporate new DNA targeting modes, including CRISPR activation/repression, targeted enrichment of loci for long-read sequencing, and prediction of Cas9 repair outcomes. Finally, we expand our results page visualization to reveal alternative isoforms and downstream ATG sites, which will aid users in avoiding the expression of truncated proteins. The CHOPCHOP web tool now supports over 200 genomes and we have released a command-line script for running larger jobs and handling unsupported genomes. CHOPCHOP v3 can be found at https://chopchop.cbu.uib.no.

879 citations


Journal ArticleDOI
TL;DR: The ENCODE blacklist is defined- a comprehensive set of regions in the human, mouse, worm, and fly genomes that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment.
Abstract: Functional genomics assays based on high-throughput sequencing greatly expand our ability to understand the genome. Here, we define the ENCODE blacklist- a comprehensive set of regions in the human, mouse, worm, and fly genomes that have anomalous, unstructured, or high signal in next-generation sequencing experiments independent of cell line or experiment. The removal of the ENCODE blacklist is an essential quality measure when analyzing functional genomics data.

Journal ArticleDOI
TL;DR: This structure illuminates the assembly of the coronavirus core RNA-synthesis machinery, provides key insights into nsp12 polymerase catalysis and fidelity and acts as a template for the design of novel antiviral therapeutics.
Abstract: Recent history is punctuated by the emergence of highly pathogenic coronaviruses such as SARS- and MERS-CoV into human circulation. Upon infecting host cells, coronaviruses assemble a multi-subunit RNA-synthesis complex of viral non-structural proteins (nsp) responsible for the replication and transcription of the viral genome. Here, we present the 3.1 A resolution structure of the SARS-CoV nsp12 polymerase bound to its essential co-factors, nsp7 and nsp8, using single particle cryo-electron microscopy. nsp12 possesses an architecture common to all viral polymerases as well as a large N-terminal extension containing a kinase-like fold and is bound by two nsp8 co-factors. This structure illuminates the assembly of the coronavirus core RNA-synthesis machinery, provides key insights into nsp12 polymerase catalysis and fidelity and acts as a template for the design of novel antiviral therapeutics. The pathogenic human coronaviruses SARS- and MERS-CoV can cause severe respiratory disease. Here the authors present the 3.1A cryo-EM structure of the SARS-CoV RNA polymerase nsp12 bound to its essential co-factors nsp7 and nsp8, which is of interest for antiviral drug development.

Posted ContentDOI
Daniel Taliun1, Daniel N. Harris2, Michael D. Kessler2, Jedidiah Carlson3  +191 moreInstitutions (61)
06 Mar 2019-bioRxiv
TL;DR: The nearly complete catalog of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and non-coding sequence variants to phenotypic variation as well as resources and early insights from the sequence data.
Abstract: Summary paragraph The Trans-Omics for Precision Medicine (TOPMed) program seeks to elucidate the genetic architecture and disease biology of heart, lung, blood, and sleep disorders, with the ultimate goal of improving diagnosis, treatment, and prevention. The initial phases of the program focus on whole genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here, we describe TOPMed goals and design as well as resources and early insights from the sequence data. The resources include a variant browser, a genotype imputation panel, and sharing of genomic and phenotypic data via dbGaP. In 53,581 TOPMed samples, >400 million single-nucleotide and insertion/deletion variants were detected by alignment with the reference genome. Additional novel variants are detectable through assembly of unmapped reads and customized analysis in highly variable loci. Among the >400 million variants detected, 97% have frequency

Journal ArticleDOI
TL;DR: A new tool is added that lets users interactively arrange existing graphing tracks into new groups and create a 30-way primate alignment on the human genome in the UCSC Genome Browser.
Abstract: The UCSC Genome Browser (https://genome.ucsc.edu) is a graphical viewer for exploring genome annotations. For almost two decades, the Browser has provided visualization tools for genetics and molecular biology and continues to add new data and features. This year, we added a new tool that lets users interactively arrange existing graphing tracks into new groups. Other software additions include new formats for chromosome interactions, a ChIP-Seq peak display for track hubs and improved support for HGVS. On the annotation side, we have added gnomAD, TCGA expression, RefSeq Functional elements, GTEx eQTLs, CRISPR Guides, SNPpedia and created a 30-way primate alignment on the human genome. Nine assemblies now have RefSeq-mapped gene models.

Journal ArticleDOI
Mark Chaisson1, Mark Chaisson2, Ashley D. Sanders, Xuefang Zhao3, Xuefang Zhao4, Ankit Malhotra, David Porubsky5, David Porubsky6, Tobias Rausch, Eugene J. Gardner7, Oscar L. Rodriguez8, Li Guo9, Ryan L. Collins4, Xian Fan10, Jia Wen11, Robert E. Handsaker4, Robert E. Handsaker12, Susan Fairley13, Zev N. Kronenberg2, Xiangmeng Kong14, Fereydoun Hormozdiari15, Dillon Lee16, Aaron M. Wenger17, Alex Hastie, Danny Antaki18, Thomas Anantharaman, Peter A. Audano2, Harrison Brand4, Stuart Cantsilieris2, Han Cao, Eliza Cerveira, Chong Chen10, Xintong Chen7, Chen-Shan Chin17, Zechen Chong10, Nelson T. Chuang7, Christine C. Lambert17, Deanna M. Church, Laura Clarke13, Andrew Farrell16, Joey Flores19, Timur R. Galeev14, David U. Gorkin18, David U. Gorkin20, Madhusudan Gujral18, Victor Guryev5, William Haynes Heaton, Jonas Korlach17, Sushant Kumar14, Jee Young Kwon21, Ernest T. Lam, Jong Eun Lee, Joyce V. Lee, Wan-Ping Lee, Sau Peng Lee, Shantao Li14, Patrick Marks, Karine A. Viaud-Martinez19, Sascha Meiers, Katherine M. Munson2, Fabio C. P. Navarro14, Bradley J. Nelson2, Conor Nodzak11, Amina Noor18, Sofia Kyriazopoulou-Panagiotopoulou, Andy Wing Chun Pang, Yunjiang Qiu18, Yunjiang Qiu20, Gabriel Rosanio18, Mallory Ryan, Adrian M. Stütz, Diana C.J. Spierings5, Alistair Ward16, Anne Marie E. Welch2, Ming Xiao22, Wei Xu, Chengsheng Zhang, Qihui Zhu, Xiangqun Zheng-Bradley13, Ernesto Lowy13, Sergei Yakneen, Steven A. McCarroll12, Steven A. McCarroll4, Goo Jun23, Li Ding24, Chong-Lek Koh25, Bing Ren20, Bing Ren18, Paul Flicek13, Ken Chen10, Mark Gerstein, Pui-Yan Kwok26, Peter M. Lansdorp27, Peter M. Lansdorp5, Peter M. Lansdorp28, Gabor T. Marth16, Jonathan Sebat18, Xinghua Shi11, Ali Bashir8, Kai Ye9, Scott E. Devine7, Michael E. Talkowski4, Michael E. Talkowski12, Ryan E. Mills3, Tobias Marschall6, Jan O. Korbel13, Evan E. Eichler2, Charles Lee21 
TL;DR: A suite of long-read, short- read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms are applied to comprehensively analyze three trios to define the full spectrum of human genetic variation in a haplotype-resolved manner.
Abstract: The incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, short-read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per genome. We also discover 156 inversions per genome and 58 of the inversions intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a three to sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The methods and the dataset presented serve as a gold standard for the scientific community allowing us to make recommendations for maximizing structural variation sensitivity for future genome sequencing studies.

Journal ArticleDOI
TL;DR: Significant enhancements to MGD are described, including two new graphical user interfaces: the Multi Genome Viewer for exploring the genomes of multiple mouse strains and the Phenotype-Gene Expression matrix which was developed in collaboration with the Gene Expression Database (GXD) and allows researchers to compare gene expression and phenotype annotations for mouse genes.
Abstract: The Mouse Genome Database (MGD; http://www.informatics.jax.org) is the community model organism genetic and genome resource for the laboratory mouse. MGD is the authoritative source for biological reference data sets related to mouse genes, gene functions, phenotypes, and mouse models of human disease. MGD is the primary outlet for official gene, allele and mouse strain nomenclature based on the guidelines set by the International Committee on Standardized Nomenclature for Mice. In this report we describe significant enhancements to MGD, including two new graphical user interfaces: (i) the Multi Genome Viewer for exploring the genomes of multiple mouse strains and (ii) the Phenotype-Gene Expression matrix which was developed in collaboration with the Gene Expression Database (GXD) and allows researchers to compare gene expression and phenotype annotations for mouse genes. Other recent improvements include enhanced efficiency of our literature curation processes and the incorporation of Transcriptional Start Site (TSS) annotations from RIKEN's FANTOM 5 initiative.

Journal ArticleDOI
20 Jun 2019-Nature
TL;DR: An approach to evaluate fragmentation patterns of cell-free DNA across the genome was developed, and found that profiles of healthy individuals reflected nucleosomal patterns of white blood cells, whereas patients with cancer had altered fragmentation profiles.
Abstract: Cell-free DNA in the blood provides a non-invasive diagnostic avenue for patients with cancer1. However, characteristics of the origins and molecular features of cell-free DNA are poorly understood. Here we developed an approach to evaluate fragmentation patterns of cell-free DNA across the genome, and found that profiles of healthy individuals reflected nucleosomal patterns of white blood cells, whereas patients with cancer had altered fragmentation profiles. We used this method to analyse the fragmentation profiles of 236 patients with breast, colorectal, lung, ovarian, pancreatic, gastric or bile duct cancer and 245 healthy individuals. A machine learning model that incorporated genome-wide fragmentation features had sensitivities of detection ranging from 57% to more than 99% among the seven cancer types at 98% specificity, with an overall area under the curve value of 0.94. Fragmentation profiles could be used to identify the tissue of origin of the cancers to a limited number of sites in 75% of cases. Combining our approach with mutation-based cell-free DNA analyses detected 91% of patients with cancer. The results of these analyses highlight important properties of cell-free DNA and provide a proof-of-principle approach for the screening, early detection and monitoring of human cancer.

Journal ArticleDOI
TL;DR: A simple activity-by-contact model substantially outperformed previous methods at predicting the complex connections in the CRISPR dataset and allows systematic mapping of enhancer–gene connections in a given cell type, on the basis of chromatin-state measurements.
Abstract: Enhancer elements in the human genome control how genes are expressed in specific cell types and harbor thousands of genetic variants that influence risk for common diseases1-4. Yet, we still do not know how enhancers regulate specific genes, and we lack general rules to predict enhancer-gene connections across cell types5,6. We developed an experimental approach, CRISPRi-FlowFISH, to perturb enhancers in the genome, and we applied it to test >3,500 potential enhancer-gene connections for 30 genes. We found that a simple activity-by-contact model substantially outperformed previous methods at predicting the complex connections in our CRISPR dataset. This activity-by-contact model allows us to construct genome-wide maps of enhancer-gene connections in a given cell type, on the basis of chromatin state measurements. Together, CRISPRi-FlowFISH and the activity-by-contact model provide a systematic approach to map and predict which enhancers regulate which genes, and will help to interpret the functions of the thousands of disease risk variants in the noncoding genome.

Journal ArticleDOI
01 Nov 2019-Science
TL;DR: The results highlight that endophytic root microbiomes harbor a wealth of as yet unknown functional traits that, in concert, can protect the plant inside out.
Abstract: Microorganisms living inside plants can promote plant growth and health, but their genomic and functional diversity remain largely elusive. Here, metagenomics and network inference show that fungal infection of plant roots enriched for Chitinophagaceae and Flavobacteriaceae in the root endosphere and for chitinase genes and various unknown biosynthetic gene clusters encoding the production of nonribosomal peptide synthetases (NRPSs) and polyketide synthases (PKSs). After strain-level genome reconstruction, a consortium of Chitinophaga and Flavobacterium was designed that consistently suppressed fungal root disease. Site-directed mutagenesis then revealed that a previously unidentified NRPS-PKS gene cluster from Flavobacterium was essential for disease suppression by the endophytic consortium. Our results highlight that endophytic root microbiomes harbor a wealth of as yet unknown functional traits that, in concert, can protect the plant inside out.

Journal ArticleDOI
TL;DR: The results of two case studies show that CPGAVAS2 annotates better than several other servers, and will likely become an indispensible tool for plastome research.
Abstract: We previously developed a web server CPGAVAS for annotation, visualization and GenBank submission of plastome sequences. Here, we upgrade the server into CPGAVAS2 to address the following challenges: (i) inaccurate annotation in the reference sequence likely causing the propagation of errors; (ii) difficulty in the annotation of small exons of genes petB, petD and rps16 and trans-splicing gene rps12; (iii) lack of annotation for other genome features and their visualization, such as repeat elements; and (iv) lack of modules for diversity analysis of plastomes. In particular, CPGAVAS2 provides two reference datasets for plastome annotation. The first dataset contains 43 plastomes whose annotation have been validated or corrected by RNA-seq data. The second one contains 2544 plastomes curated with sequence alignment. Two new algorithms are also implemented to correctly annotate small exons and trans-splicing genes. Tandem and dispersed repeats are identified, whose results are displayed on a circular map together with the annotated genes. DNA-seq and RNA-seq data can be uploaded for identification of single-nucleotide polymorphism sites and RNA-editing sites. The results of two case studies show that CPGAVAS2 annotates better than several other servers. CPGAVAS2 will likely become an indispensible tool for plastome research and can be accessed from http://www.herbalgenomics.org/cpgavas2.

Journal ArticleDOI
TL;DR: The Cistrome DB has a new Toolkit module with several features that allow users to better utilize the large-scale ChIP-seq, DNase-seq and ATAC-seq data, and the new tools will greatly benefit the biomedical research community.
Abstract: The Cistrome Data Browser (DB) is a resource of human and mouse cis-regulatory information derived from ChIP-seq, DNase-seq and ATAC-seq chromatin profiling assays, which map the genome-wide locations of transcription factor binding sites, histone post-translational modifications and regions of chromatin accessible to endonuclease activity. Currently, the Cistrome DB contains approximately 47,000 human and mouse samples with about 24,000 newly collected datasets compared to the previous release two years ago. Furthermore, the Cistrome DB has a new Toolkit module with several features that allow users to better utilize the large-scale ChIP-seq, DNase-seq, and ATAC-seq data. First, users can query the factors which are likely to regulate a specific gene of interest. Second, the Cistrome DB Toolkit facilitates searches for factor binding, histone modifications, and chromatin accessibility in any given genomic interval shorter than 2Mb. Third, the Toolkit can determine the most similar ChIP-seq, DNase-seq, and ATAC-seq samples in terms of genomic interval overlaps with user-provided genomic interval sets. The Cistrome DB is a user-friendly, up-to-date, and well maintained resource, and the new tools will greatly benefit the biomedical research community. The database is freely available at http://cistrome.org/db, and the Toolkit is at http://dbtoolkit.cistrome.org.

Journal ArticleDOI
TL;DR: A comprehensive landscape of different modes of gene duplication across the plant kingdom is identified by comparing 141 genomes, which provides a solid foundation for further investigation of the dynamic evolution of duplicate genes.
Abstract: The sharp increase of plant genome and transcriptome data provide valuable resources to investigate evolutionary consequences of gene duplication in a range of taxa, and unravel common principles underlying duplicate gene retention. We survey 141 sequenced plant genomes to elucidate consequences of gene and genome duplication, processes central to the evolution of biodiversity. We develop a pipeline named DupGen_finder to identify different modes of gene duplication in plants. Genes derived from whole-genome, tandem, proximal, transposed, or dispersed duplication differ in abundance, selection pressure, expression divergence, and gene conversion rate among genomes. The number of WGD-derived duplicate genes decreases exponentially with increasing age of duplication events—transposed duplication- and dispersed duplication-derived genes declined in parallel. In contrast, the frequency of tandem and proximal duplications showed no significant decrease over time, providing a continuous supply of variants available for adaptation to continuously changing environments. Moreover, tandem and proximal duplicates experienced stronger selective pressure than genes formed by other modes and evolved toward biased functional roles involved in plant self-defense. The rate of gene conversion among WGD-derived gene pairs declined over time, peaking shortly after polyploidization. To provide a platform for accessing duplicated gene pairs in different plants, we constructed the Plant Duplicate Gene Database. We identify a comprehensive landscape of different modes of gene duplication across the plant kingdom by comparing 141 genomes, which provides a solid foundation for further investigation of the dynamic evolution of duplicate genes.

Journal ArticleDOI
13 Mar 2019-Nature
TL;DR: Draft prokaryotic genomes from faecal metagenomes of diverse human populations enrich the understanding of the human gut microbiome by identifying over two thousand new species-level taxa that have numerous disease associations.
Abstract: The genome sequences of many species of the human gut microbiome remain unknown, largely owing to challenges in cultivating microorganisms under laboratory conditions. Here we address this problem by reconstructing 60,664 draft prokaryotic genomes from 3,810 faecal metagenomes, from geographically and phenotypically diverse humans. These genomes provide reference points for 2,058 newly identified species-level operational taxonomic units (OTUs), which represents a 50% increase over the previously known phylogenetic diversity of sequenced gut bacteria. On average, the newly identified OTUs comprise 33% of richness and 28% of species abundance per individual, and are enriched in humans from rural populations. A meta-analysis of clinical gut-microbiome studies pinpointed numerous disease associations for the newly identified OTUs, which have the potential to improve predictive models. Finally, our analysis revealed that uncultured gut species have undergone genome reduction that has resulted in the loss of certain biosynthetic pathways, which may offer clues for improving cultivation strategies in the future.

Journal ArticleDOI
TL;DR: This work presents vConTACT v.2.0, a network-based application utilizing whole genome gene-sharing profiles for virus taxonomy that integrates distance-based hierarchical clustering and confidence scores for all taxonomic predictions, and applies it to analyze 15,280 Global Ocean Virome genome fragments.
Abstract: Microbiomes from every environment contain a myriad of uncultivated archaeal and bacterial viruses, but studying these viruses is hampered by the lack of a universal, scalable taxonomic framework. We present vConTACT v.2.0, a network-based application utilizing whole genome gene-sharing profiles for virus taxonomy that integrates distance-based hierarchical clustering and confidence scores for all taxonomic predictions. We report near-identical (96%) replication of existing genus-level viral taxonomy assignments from the International Committee on Taxonomy of Viruses for National Center for Biotechnology Information virus RefSeq. Application of vConTACT v.2.0 to 1,364 previously unclassified viruses deposited in virus RefSeq as reference genomes produced automatic, high-confidence genus assignments for 820 of the 1,364. We applied vConTACT v.2.0 to analyze 15,280 Global Ocean Virome genome fragments and were able to provide taxonomic assignments for 31% of these data, which shows that our algorithm is scalable to very large metagenomic datasets. Our taxonomy tool can be automated and applied to metagenomes from any environment for virus classification.

Journal ArticleDOI
05 Jun 2019-Nature
TL;DR: Attractions between heterochromatic regions are essential for phase separation of the active and inactive genome in inverted and conventional nuclei, whereas chromatin–lamina interactions are necessary to build the conventional genomic architecture from these segregated phases.
Abstract: The nucleus of mammalian cells displays a distinct spatial segregation of active euchromatic and inactive heterochromatic regions of the genome1,2. In conventional nuclei, microscopy shows that euchromatin is localized in the nuclear interior and heterochromatin at the nuclear periphery1,2. Genome-wide chromosome conformation capture (Hi-C) analyses show this segregation as a plaid pattern of contact enrichment within euchromatin and heterochromatin compartments3, and depletion between them. Many mechanisms for the formation of compartments have been proposed, such as attraction of heterochromatin to the nuclear lamina2,4, preferential attraction of similar chromatin to each other1,4–12, higher levels of chromatin mobility in active chromatin13–15 and transcription-related clustering of euchromatin16,17. However, these hypotheses have remained inconclusive, owing to the difficulty of disentangling intra-chromatin and chromatin–lamina interactions in conventional nuclei18. The marked reorganization of interphase chromosomes in the inverted nuclei of rods in nocturnal mammals19,20 provides an opportunity to elucidate the mechanisms that underlie spatial compartmentalization. Here we combine Hi-C analysis of inverted rod nuclei with microscopy and polymer simulations. We find that attractions between heterochromatic regions are crucial for establishing both compartmentalization and the concentric shells of pericentromeric heterochromatin, facultative heterochromatin and euchromatin in the inverted nucleus. When interactions between heterochromatin and the lamina are added, the same model recreates the conventional nuclear organization. In addition, our models allow us to rule out mechanisms of compartmentalization that involve strong euchromatin interactions. Together, our experiments and modelling suggest that attractions between heterochromatic regions are essential for the phase separation of the active and inactive genome in inverted and conventional nuclei, whereas interactions of the chromatin with the lamina are necessary to build the conventional architecture from these segregated phases. Attractions between heterochromatic regions are essential for phase separation of the active and inactive genome in inverted and conventional nuclei, whereas chromatin–lamina interactions are necessary to build the conventional genomic architecture from these segregated phases.

Journal ArticleDOI
Hui Zheng1, Wei Xie1
TL;DR: This Review discusses recent progress in understanding of the general principles of chromatin folding, its regulation and its functions in mammalian development, and discusses the dynamics of 3D chromatin and genome organization during gametogenesis, embryonic development, lineage commitment and stem cell differentiation.
Abstract: In eukaryotes, the genome does not exist as a linear molecule but instead is hierarchically packaged inside the nucleus. This complex genome organization includes multiscale structural units of chromosome territories, compartments, topologically associating domains, which are often demarcated by architectural proteins such as CTCF and cohesin, and chromatin loops. The 3D organization of chromatin modulates biological processes such as transcription, DNA replication, cell division and meiosis, which are crucial for cell differentiation and animal development. In this Review, we discuss recent progress in our understanding of the general principles of chromatin folding, its regulation and its functions in mammalian development. Specifically, we discuss the dynamics of 3D chromatin and genome organization during gametogenesis, embryonic development, lineage commitment and stem cell differentiation, and focus on the functions of chromatin architecture in transcription regulation. Finally, we discuss the role of 3D genome alterations in the aetiology of developmental disorders and human diseases.

Journal ArticleDOI
05 Jul 2019-Science
TL;DR: This work expands the understanding of the functional diversity of CRISPR-Cas systems and establishes a paradigm for precision DNA insertion.
Abstract: CRISPR-Cas nucleases are powerful tools for manipulating nucleic acids; however, targeted insertion of DNA remains a challenge, as it requires host cell repair machinery. Here we characterize a CRISPR-associated transposase from cyanobacteria Scytonema hofmanni (ShCAST) that consists of Tn7-like transposase subunits and the type V-K CRISPR effector (Cas12k). ShCAST catalyzes RNA-guided DNA transposition by unidirectionally inserting segments of DNA 60 to 66 base pairs downstream of the protospacer. ShCAST integrates DNA into targeted sites in the Escherichia coli genome with frequencies of up to 80% without positive selection. This work expands our understanding of the functional diversity of CRISPR-Cas systems and establishes a paradigm for precision DNA insertion.

Journal ArticleDOI
10 Jan 2019-Cell
TL;DR: A multiplex, expression quantitative trait locus (eQTL)-inspired framework for mapping enhancer-gene pairs by introducing random combinations of CRISPR/Cas9-mediated perturbations to each of many cells, followed by single-cell RNA sequencing (RNA-seq).

Journal ArticleDOI
TL;DR: The main features of topologically associating domains across species are depicted and the relation between chromatin structure, genome activity, and epigenome is discussed, highlighting mechanistic principles of TAD formation.
Abstract: Understanding the mechanisms that underlie chromosome folding within cell nuclei is essential to determine the relationship between genome structure and function. The recent application of "chromosome conformation capture" techniques has revealed that the genome of many species is organized into domains of preferential internal chromatin interactions called "topologically associating domains" (TADs). This chromosome chromosome folding has emerged as a key feature of higher-order genome organization and function through evolution. Although TADs have now been described in a wide range of organisms, they appear to have specific characteristics in terms of size, structure, and proteins involved in their formation. Here, we depict the main features of these domains across species and discuss the relation between chromatin structure, genome activity, and epigenome, highlighting mechanistic principles of TAD formation. We also consider the potential influence of TADs in genome evolution.

Journal ArticleDOI
TL;DR: The improved resource of gastrointestinal bacterial reference sequences circumvents dependence on de novo assembly of metagenomes and enables accurate and cost-effective shotgun metagenomic analyses of human gastrointestinal microbiota.
Abstract: Understanding gut microbiome functions requires cultivated bacteria for experimental validation and reference bacterial genome sequences to interpret metagenome datasets and guide functional analyses. We present the Human Gastrointestinal Bacteria Culture Collection (HBC), a comprehensive set of 737 whole-genome-sequenced bacterial isolates, representing 273 species (105 novel species) from 31 families found in the human gastrointestinal microbiota. The HBC increases the number of bacterial genomes derived from human gastrointestinal microbiota by 37%. The resulting global Human Gastrointestinal Bacteria Genome Collection (HGG) classifies 83% of genera by abundance across 13,490 shotgun-sequenced metagenomic samples, improves taxonomic classification by 61% compared to the Human Microbiome Project (HMP) genome collection and achieves subspecies-level classification for almost 50% of sequences. The improved resource of gastrointestinal bacterial reference sequences circumvents dependence on de novo assembly of metagenomes and enables accurate and cost-effective shotgun metagenomic analyses of human gastrointestinal microbiota.