scispace - formally typeset
Search or ask a question

Showing papers in "Genome Research in 2017"


Journal ArticleDOI
TL;DR: Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences, is presented, demonstrating that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences or Oxford Nanopore technologies.
Abstract: Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human and Drosophila melanogaster PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.

4,806 citations


Journal ArticleDOI
TL;DR: MetaSPAdes as mentioned in this paper addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes.
Abstract: While metagenomics has emerged as a technology of choice for analyzing bacterial populations, the assembly of metagenomic data remains challenging, thus stifling biological discoveries. Moreover, recent studies revealed that complex bacterial populations may be composed from dozens of related strains, thus further amplifying the challenge of metagenomic assembly. metaSPAdes addresses various challenges of metagenomic assembly by capitalizing on computational ideas that proved to be useful in assemblies of single cells and highly polymorphic diploid genomes. We benchmark metaSPAdes against other state-of-the-art metagenome assemblers and demonstrate that it results in high-quality assemblies across diverse data sets.

2,295 citations


Journal ArticleDOI
TL;DR: It is shown that the error-correction step can be omitted and that high-quality consensus sequences can be generated efficiently with a SIMD-accelerated, partial-order alignment-based, stand-alone consensus module called Racon.
Abstract: The assembly of long reads from Pacific Biosciences and Oxford Nanopore Technologies typically requires resource-intensive error-correction and consensus-generation steps to obtain high-quality assemblies. We show that the error-correction step can be omitted and that high-quality consensus sequences can be generated efficiently with a SIMD-accelerated, partial-order alignment-based, stand-alone consensus module called Racon. Based on tests with PacBio and Oxford Nanopore data sets, we show that Racon coupled with miniasm enables consensus genomes with similar or better quality than state-of-the-art methods while being an order of magnitude faster.

1,716 citations


Journal ArticleDOI
TL;DR: It is shown that errors in the UMI sequence are common and network-based methods to account for these errors when identifying PCR duplicates are introduced, demonstrating the value of properly accounting for errors in UMIs.
Abstract: Unique Molecular Identifiers (UMIs) are random oligonucleotide barcodes that are increasingly used in high-throughput sequencing experiments. Through a UMI, identical copies arising from distinct molecules can be distinguished from those arising through PCR amplification of the same molecule. However, bioinformatic methods to leverage the information from UMIs have yet to be formalized. In particular, sequencing errors in the UMI sequence are often ignored or else resolved in an ad hoc manner. We show that errors in the UMI sequence are common and introduce network-based methods to account for these errors when identifying PCR duplicates. Using these methods, we demonstrate improved quantification accuracy both under simulated conditions and real iCLIP and single-cell RNA-seq data sets. Reproducibility between iCLIP replicates and single-cell RNA-seq clustering are both improved using our proposed network-based method, demonstrating the value of properly accounting for errors in UMIs. These methods are implemented in the open source UMI-tools software package.

1,147 citations


Journal ArticleDOI
TL;DR: This work demonstrates a straightforward and low-cost method for creating true diploid de novo assemblies, and provides a scalable capability for determining the actual diploids genome sequence in a sample, opening the door to new approaches in genomic biology and medicine.
Abstract: Determining the genome sequence of an organism is challenging, yet fundamental to understanding its biology. Over the past decade, thousands of human genomes have been sequenced, contributing deeply to biomedical research. In the vast majority of cases, these have been analyzed by aligning sequence reads to a single reference genome, biasing the resulting analyses, and in general, failing to capture sequences novel to a given genome. Some de novo assemblies have been constructed free of reference bias, but nearly all were constructed by merging homologous loci into single "consensus" sequences, generally absent from nature. These assemblies do not correctly represent the diploid biology of an individual. In exactly two cases, true diploid de novo assemblies have been made, at great expense. One was generated using Sanger sequencing, and one using thousands of clone pools. Here, we demonstrate a straightforward and low-cost method for creating true diploid de novo assemblies. We make a single library from ∼1 ng of high molecular weight DNA, using the 10x Genomics microfluidic platform to partition the genome. We applied this technique to seven human samples, generating low-cost HiSeq X data, then assembled these using a new "pushbutton" algorithm, Supernova. Each computation took 2 d on a single server. Each yielded contigs longer than 100 kb, phase blocks longer than 2.5 Mb, and scaffolds longer than 15 Mb. Our method provides a scalable capability for determining the actual diploid genome sequence in a sample, opening the door to new approaches in genomic biology and medicine.

707 citations


Journal ArticleDOI
TL;DR: It is asserted that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote the understanding of human biology and advance the efforts to improve health.
Abstract: The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.

643 citations


Journal ArticleDOI
TL;DR: StrainPhlAn is introduced, a novel metagenomic strain identification approach, and applied to characterize the genetic structure of thousands of strains from more than 125 species in more than 1500 gut metagenomes drawn from populations spanning North and South American, European, Asian, and African countries, providing a comprehensive strain-level genetic overview of the gut microbial diversity.
Abstract: Among the human health conditions linked to microbial communities, phenotypes are often associated with only a subset of strains within causal microbial groups. Although it has been critical for decades in microbial physiology to characterize individual strains, this has been challenging when using culture-independent high-throughput metagenomics. We introduce StrainPhlAn, a novel metagenomic strain identification approach, and apply it to characterize the genetic structure of thousands of strains from more than 125 species in more than 1500 gut metagenomes drawn from populations spanning North and South American, European, Asian, and African countries. The method relies on per-sample dominant sequence variant reconstruction within species-specific marker genes. It identified primarily subject-specific strain variants ( 70% of species). Microbial population structure was correlated in several distinct ways with the geographic structure of the host population. In some cases, discrete subspecies (e.g., for Eubacterium rectale and Prevotella copri) or continuous microbial genetic variations (e.g., for Faecalibacterium prausnitzii) were associated with geographically distinct human populations, whereas few strains occurred in multiple unrelated cohorts. We further estimated the genetic variability of gut microbes, with Bacteroides species appearing remarkably consistent (0.45% median number of nucleotide variants between strains), whereas P. copri was among the most plastic gut colonizers. We thus characterize here the population genetics of previously inaccessible intestinal microbes, providing a comprehensive strain-level genetic overview of the gut microbial diversity.

489 citations


Journal ArticleDOI
TL;DR: ABySS 2.0 is benchmarked using a Genome in a Bottle data set of 250-bp Illumina paired-end and 6-kbp mate-pair libraries from a single individual and implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements.
Abstract: The assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps toward elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depend on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely. With ABySS 1.0, we originally showed that assembling the human genome using short 50-bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its redesign, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements. We benchmarked ABySS 2.0 human genome assembly using a Genome in a Bottle data set of 250-bp Illumina paired-end and 6-kbp mate-pair libraries from a single individual. Our assembly yielded a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using <35 GB of RAM. This is a modest memory requirement by today's standards and is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics' Chromium data to further improve the scaffold NG50 (NGA50) of this assembly to 42 (15) Mbp.

458 citations


Journal ArticleDOI
TL;DR: It is revealed that delta cells specifically express receptors that receive and coordinate systemic cues from the leptin, ghrelin, and dopamine signaling pathways implicating them as integrators of central and peripheral metabolic signals into the pancreatic islet.
Abstract: Blood glucose levels are tightly controlled by the coordinated action of at least four cell types constituting pancreatic islets. Changes in the proportion and/or function of these cells are associated with genetic and molecular pathophysiology of monogenic, type 1, and type 2 (T2D) diabetes. Cellular heterogeneity impedes precise understanding of the molecular components of each islet cell type that govern islet (dys)function, particularly the less abundant delta and gamma/pancreatic polypeptide (PP) cells. Here, we report single-cell transcriptomes for 638 cells from nondiabetic (ND) and T2D human islet samples. Analyses of ND single-cell transcriptomes identified distinct alpha, beta, delta, and PP/gamma cell-type signatures. Genes linked to rare and common forms of islet dysfunction and diabetes were expressed in the delta and PP/gamma cell types. Moreover, this study revealed that delta cells specifically express receptors that receive and coordinate systemic cues from the leptin, ghrelin, and dopamine signaling pathways implicating them as integrators of central and peripheral metabolic signals into the pancreatic islet. Finally, single-cell transcriptome profiling revealed genes differentially regulated between T2D and ND alpha, beta, and delta cells that were undetectable in paired whole islet analyses. This study thus identifies fundamental cell-type-specific features of pancreatic islet (dys)function and provides a critical resource for comprehensive understanding of islet biology and diabetes pathogenesis.

399 citations


Journal ArticleDOI
TL;DR: Analysis of 334,652 SNVs revealed that the majority of these variants are de novo and cell-line mutations or reside within previously unidentified duplications and deletions, which are a resource for objective assessment of the accuracy of variant calls throughout genomes.
Abstract: Improvement of variant calling in next-generation sequence data requires a comprehensive, genome-wide catalog of high-confidence variants called in a set of genomes for use as a benchmark. We generated deep, whole-genome sequence data of 17 individuals in a three-generation pedigree and called variants in each genome using a range of currently available algorithms. We used haplotype transmission information to create a phased "Platinum" variant catalog of 4.7 million single-nucleotide variants (SNVs) plus 0.7 million small (1-50 bp) insertions and deletions (indels) that are consistent with the pattern of inheritance in the parents and 11 children of this pedigree. Platinum genotypes are highly concordant with the current catalog of the National Institute of Standards and Technology for both SNVs (>99.99%) and indels (99.92%) and add a validated truth catalog that has 26% more SNVs and 45% more indels. Analysis of 334,652 SNVs that were consistent between informatics pipelines yet inconsistent with haplotype transmission ("nonplatinum") revealed that the majority of these variants are de novo and cell-line mutations or reside within previously unidentified duplications and deletions. The reference materials from this study are a resource for objective assessment of the accuracy of variant calls throughout genomes.

351 citations


Journal ArticleDOI
TL;DR: A new wheat whole-genome shotgun sequence assembly is generated using a combination of optimized data types and an assembly algorithm designed to deal with large and complex genomes.
Abstract: Advances in genome sequencing and assembly technologies are generating many high-quality genome sequences, but assemblies of large, repeat-rich polyploid genomes, such as that of bread wheat, remain fragmented and incomplete. We have generated a new wheat whole-genome shotgun sequence assembly using a combination of optimized data types and an assembly algorithm designed to deal with large and complex genomes. The new assembly represents >78% of the genome with a scaffold N50 of 88.8 kb that has a high fidelity to the input data. Our new annotation combines strand-specific Illumina RNA-seq and Pacific Biosciences (PacBio) full-length cDNAs to identify 104,091 high-confidence protein-coding genes and 10,156 noncoding RNA genes. We confirmed three known and identified one novel genome rearrangements. Our approach enables the rapid and scalable assembly of wheat genomes, the identification of structural variants, and the definition of complete gene models, all powerful resources for trait analysis and breeding of this key global crop.

Journal ArticleDOI
TL;DR: This work describes an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%.
Abstract: Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here, we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads, which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii, a large and extremely repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807 nucleotides. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.

Journal ArticleDOI
TL;DR: Interestingly, when the authors repeat SV detection on a pseudodiploid genome constructed in silico by merging the two haploids, it is found that ∼59% of the heterozygous SVs are no longer detected by SMRT-SV, indicating that haploid resolution of long-read sequencing data will significantly increase sensitivity of SV detection.
Abstract: In an effort to more fully understand the full spectrum of human genetic variation, we generated deep single-molecule, real-time (SMRT) sequencing data from two haploid human genomes. By using an assembly-based approach (SMRT-SV), we systematically assessed each genome independently for structural variants (SVs) and indels resolving the sequence structure of 461,553 genetic variants from 2 bp to 28 kbp in length. We find that >89% of these variants have been missed as part of analysis of the 1000 Genomes Project even after adjusting for more common variants (MAF > 1%). We estimate that this theoretical human diploid differs by as much as ∼16 Mbp with respect to the human reference, with long-read sequencing data providing a fivefold increase in sensitivity for genetic variants ranging in size from 7 bp to 1 kbp compared with short-read sequence data. Although a large fraction of genetic variants were not detected by short-read approaches, once the alternate allele is sequence-resolved, we show that 61% of SVs can be genotyped in short-read sequence data sets with high accuracy. Uncoupling discovery from genotyping thus allows for the majority of this missed common variation to be genotyped in the human population. Interestingly, when we repeat SV detection on a pseudodiploid genome constructed in silico by merging the two haploids, we find that ∼59% of the heterozygous SVs are no longer detected by SMRT-SV. These results indicate that haploid resolution of long-read sequencing data will significantly increase sensitivity of SV detection.

Journal ArticleDOI
TL;DR: A novel similarity measure, the stratum adjusted correlation coefficient (SCC), is introduced, for quantifying the similarity between Hi-C interaction matrices, and consistently shows higher accuracy than existing approaches in distinguishing subtle differences in reproducibility and depicting interrelationships of cell lineages.
Abstract: Hi-C is a powerful technology for studying genome-wide chromatin interactions. However, current methods for assessing Hi-C data reproducibility can produce misleading results because they ignore spatial features in Hi-C data, such as domain structure and distance dependence. We present HiCRep, a framework for assessing the reproducibility of Hi-C data that systematically accounts for these features. In particular, we introduce a novel similarity measure, the stratum adjusted correlation coefficient (SCC), for quantifying the similarity between Hi-C interaction matrices. Not only does it provide a statistically sound and reliable evaluation of reproducibility, SCC can also be used to quantify differences between Hi-C contact matrices and to determine the optimal sequencing depth for a desired resolution. The measure consistently shows higher accuracy than existing approaches in distinguishing subtle differences in reproducibility and depicting interrelationships of cell lineages. The proposed measure is straightforward to interpret and easy to compute, making it well-suited for providing standardized, interpretable, automatable, and scalable quality control. The freely available R package HiCRep implements our approach.

Journal ArticleDOI
TL;DR: VastDB provides readily accessible quantitative information on the inclusion levels and functional associations of AS events detected in RNA-seq data from diverse vertebrate cell and tissue types, as well as developmental stages, and reveals extensive new intergenic and intragenic regulatory relationships among different classes of AS and previously unknown and conserved landscapes of tissue-regulated exons.
Abstract: Alternative splicing (AS) generates remarkable regulatory and proteomic complexity in metazoans. However, the functions of most AS events are not known, and programs of regulated splicing remain to be identified. To address these challenges, we describe the Vertebrate Alternative Splicing and Transcription Database (VastDB), the largest resource of genome-wide, quantitative profiles of AS events assembled to date. VastDB provides readily accessible quantitative information on the inclusion levels and functional associations of AS events detected in RNA-seq data from diverse vertebrate cell and tissue types, as well as developmental stages. The VastDB profiles reveal extensive new intergenic and intragenic regulatory relationships among different classes of AS and previously unknown and conserved landscapes of tissue-regulated exons. Contrary to recent reports concluding that nearly all human genes express a single major isoform, VastDB provides evidence that at least 48% of multiexonic protein-coding genes express multiple splice variants that are highly regulated in a cell/tissue-specific manner, and that >18% of genes simultaneously express multiple major isoforms across diverse cell and tissue types. Isoforms encoded by the latter set of genes are generally coexpressed in the same cells and are often engaged by translating ribosomes. Moreover, they are encoded by genes that are significantly enriched in functions associated with transcriptional control, implying they may have an important and wide-ranging role in controlling cellular activities. VastDB thus provides an unprecedented resource for investigations of AS function and regulation.

Journal ArticleDOI
TL;DR: It is shown that HapCUT2 rapidly assembles haplotypes with best-in-class accuracy for all data types and scales well for high sequencing coverage and rapidly assembled haplotypes for two long-read WGS data sets on which other methods struggled.
Abstract: Many tools have been developed for haplotype assembly-the reconstruction of individual haplotypes using reads mapped to a reference genome sequence. Due to increasing interest in obtaining haplotype-resolved human genomes, a range of new sequencing protocols and technologies have been developed to enable the reconstruction of whole-genome haplotypes. However, existing computational methods designed to handle specific technologies do not scale well on data from different protocols. We describe a new algorithm, HapCUT2, that extends our previous method (HapCUT) to handle multiple sequencing technologies. Using simulations and whole-genome sequencing (WGS) data from multiple different data types-dilution pool sequencing, linked-read sequencing, single molecule real-time (SMRT) sequencing, and proximity ligation (Hi-C) sequencing-we show that HapCUT2 rapidly assembles haplotypes with best-in-class accuracy for all data types. In particular, HapCUT2 scales well for high sequencing coverage and rapidly assembled haplotypes for two long-read WGS data sets on which other methods struggled. Further, HapCUT2 directly models Hi-C specific error modalities, resulting in significant improvements in error rates compared to HapCUT, the only other method that could assemble haplotypes from Hi-C data. Using HapCUT2, haplotype assembly from a 90× coverage whole-genome Hi-C data set yielded high-resolution haplotypes (78.6% of variants phased in a single block) with high pairwise phasing accuracy (∼98% across chromosomes). Our results demonstrate that HapCUT2 is a robust tool for haplotype assembly applicable to data from diverse sequencing technologies.

Journal ArticleDOI
TL;DR: This work surveys various projects underway to build and apply graph-based structures-which it is referred to as genome graphs-and discusses the improvements in read mapping, variant calling, and haplotype determination that genome graphs are expected to produce.
Abstract: The human reference genome is part of the foundation of modern human biology and a monumental scientific achievement. However, because it excludes a great deal of common human variation, it introduces a pervasive reference bias into the field of human genomics. To reduce this bias, it makes sense to draw on representative collections of human genomes, brought together into reference cohorts. There are a number of techniques to represent and organize data gleaned from these cohorts, many using ideas implicitly or explicitly borrowed from graph-based models. Here, we survey various projects underway to build and apply these graph-based structures-which we collectively refer to as genome graphs-and discuss the improvements in read mapping, variant calling, and haplotype determination that genome graphs are expected to produce.

Journal ArticleDOI
TL;DR: This study provides the most comprehensive map of MEIs to date spanning chimpanzees, ancient hominids, and modern humans and reveals new aspects of MEI biology in these lineages and is a robust platform for MEI discovery and analysis in a variety of experimental settings.
Abstract: Mobile element insertions (MEIs) represent ∼25% of all structural variants in human genomes. Moreover, when they disrupt genes, MEIs can influence human traits and diseases. Therefore, MEIs should be fully discovered along with other forms of genetic variation in whole genome sequencing (WGS) projects involving population genetics, human diseases, and clinical genomics. Here, we describe the Mobile Element Locator Tool (MELT), which was developed as part of the 1000 Genomes Project to perform MEI discovery on a population scale. Using both Illumina WGS data and simulations, we demonstrate that MELT outperforms existing MEI discovery tools in terms of speed, scalability, specificity, and sensitivity, while also detecting a broader spectrum of MEI-associated features. Several run modes were developed to perform MEI discovery on local and cloud systems. In addition to using MELT to discover MEIs in modern humans as part of the 1000 Genomes Project, we also used it to discover MEIs in chimpanzees and ancient (Neanderthal and Denisovan) hominids. We detected diverse patterns of MEI stratification across these populations that likely were caused by (1) diverse rates of MEI production from source elements, (2) diverse patterns of MEI inheritance, and (3) the introgression of ancient MEIs into modern human genomes. Overall, our study provides the most comprehensive map of MEIs to date spanning chimpanzees, ancient hominids, and modern humans and reveals new aspects of MEI biology in these lineages. We also demonstrate that MELT is a robust platform for MEI discovery and analysis in a variety of experimental settings.

Journal ArticleDOI
TL;DR: A software tool called ExpansionHunter is developed that, using PCR-free WGS short-read data, can genotype repeats at the locus of interest, even if the expanded repeat is larger than the read length, and provides researchers with a tool that can be used to identify new pathogenic repeat expansions.
Abstract: Identifying large expansions of short tandem repeats (STRs), such as those that cause amyotrophic lateral sclerosis (ALS) and fragile X syndrome, is challenging for short-read whole-genome sequencing (WGS) data. A solution to this problem is an important step toward integrating WGS into precision medicine. We developed a software tool called ExpansionHunter that, using PCR-free WGS short-read data, can genotype repeats at the locus of interest, even if the expanded repeat is larger than the read length. We applied our algorithm to WGS data from 3001 ALS patients who have been tested for the presence of the C9orf72 repeat expansion with repeat-primed PCR (RP-PCR). Compared against this truth data, ExpansionHunter correctly classified all (212/212, 95% CI [0.98, 1.00]) of the expanded samples as either expansions (208) or potential expansions (4). Additionally, 99.9% (2786/2789, 95% CI [0.997, 1.00]) of the wild-type samples were correctly classified as wild type by this method with the remaining three samples identified as possible expansions. We further applied our algorithm to a set of 152 samples in which every sample had one of eight different pathogenic repeat expansions, including those associated with fragile X syndrome, Friedreich's ataxia, and Huntington's disease, and correctly flagged all but one of the known repeat expansions. Thus, ExpansionHunter can be used to accurately detect known pathogenic repeat expansions and provides researchers with a tool that can be used to identify new pathogenic repeat expansions.

Journal ArticleDOI
TL;DR: A novel lentivirus-based massively parallel reporter assay (lentiMPRA) is developed and applied to directly compare the functional activities of 2236 candidate liver enhancers in an episomal versus a chromosomally integrated context and finds that the activities of chromosomeally integrated sequences are substantially different from the activities from the identical sequences assayed on episomes, and furthermore are correlated with different subsets of ENCODE annotations.
Abstract: Candidate enhancers can be identified on the basis of chromatin modifications, the binding of chromatin modifiers and transcription factors and cofactors, or chromatin accessibility. However, validating such candidates as bona fide enhancers requires functional characterization, typically achieved through reporter assays that test whether a sequence can increase expression of a transcriptional reporter via a minimal promoter. A longstanding concern is that reporter assays are mainly implemented on episomes, which are thought to lack physiological chromatin. However, the magnitude and determinants of differences in cis-regulation for regulatory sequences residing in episomes versus chromosomes remain almost completely unknown. To address this systematically, we developed and applied a novel lentivirus-based massively parallel reporter assay (lentiMPRA) to directly compare the functional activities of 2236 candidate liver enhancers in an episomal versus a chromosomally integrated context. We find that the activities of chromosomally integrated sequences are substantially different from the activities of the identical sequences assayed on episomes, and furthermore are correlated with different subsets of ENCODE annotations. The results of chromosomally based reporter assays are also more reproducible and more strongly predictable by both ENCODE annotations and sequence-based models. With a linear model that combines chromatin annotations and sequence information, we achieve a Pearson's R2 of 0.362 for predicting the results of chromosomally integrated reporter assays. This level of prediction is better than with either chromatin annotations or sequence information alone and also outperforms predictive models of episomal assays. Our results have broad implications for how cis-regulatory elements are identified, prioritized and functionally validated.

Journal ArticleDOI
TL;DR: The results support a model in which YY1 acts as an architectural protein to connect developmentally regulated looping interactions; the location of YY 1-mediated interactions may be demarcated in development by a preexisting topological framework created by constitutive CTCF- mediated interactions.
Abstract: CTCF is an architectural protein with a critical role in connecting higher-order chromatin folding in pluripotent stem cells Recent reports have suggested that CTCF binding is more dynamic during development than previously appreciated Here, we set out to understand the extent to which shifts in genome-wide CTCF occupancy contribute to the 3D reconfiguration of fine-scale chromatin folding during early neural lineage commitment Unexpectedly, we observe a sharp decrease in CTCF occupancy during the transition from naive/primed pluripotency to multipotent primary neural progenitor cells (NPCs) Many pluripotency gene-enhancer interactions are anchored by CTCF, and its occupancy is lost in parallel with loop decommissioning during differentiation Conversely, CTCF binding sites in NPCs are largely preexisting in pluripotent stem cells Only a small number of CTCF sites arise de novo in NPCs We identify another zinc finger protein, Yin Yang 1 (YY1), at the base of looping interactions between NPC-specific genes and enhancers Putative NPC-specific enhancers exhibit strong YY1 signal when engaged in 3D contacts and negligible YY1 signal when not in loops Moreover, siRNA knockdown of Yy1 specifically disrupts interactions between key NPC enhancers and their target genes YY1-mediated interactions between NPC regulatory elements are often nested within constitutive loops anchored by CTCF Together, our results support a model in which YY1 acts as an architectural protein to connect developmentally regulated looping interactions; the location of YY1-mediated interactions may be demarcated in development by a preexisting topological framework created by constitutive CTCF-mediated interactions

Journal ArticleDOI
TL;DR: This work studied the cell-cell interactome of fetal placental trophoblast cells and maternal endometrial stromal cells, using single-cell transcriptomics and finds a highly cell-type-specific expression of G-protein-coupled receptors, implying that ligand-receptor profiles could be a reliable tool for cell type identification.
Abstract: Organismal function is, to a great extent, determined by interactions among their fundamental building blocks, the cells. In this work, we studied the cell-cell interactome of fetal placental trophoblast cells and maternal endometrial stromal cells, using single-cell transcriptomics. The placental interface mediates the interaction between two semiallogenic individuals, the mother and the fetus, and is thus the epitome of cell interactions. To study these, we inferred the cell-cell interactome by assessing the gene expression of receptor-ligand pairs across cell types. We find a highly cell-type-specific expression of G-protein-coupled receptors, implying that ligand-receptor profiles could be a reliable tool for cell type identification. Furthermore, we find that uterine decidual cells represent a cell-cell interaction hub with a large number of potential incoming and outgoing signals. Decidual cells differentiate from their precursors, the endometrial stromal fibroblasts, during uterine preparation for pregnancy. We show that decidualization (even in vitro) enhances the ability to communicate with the fetus, as most of the receptors and ligands up-regulated during decidualization have their counterpart expressed in trophoblast cells. Among the signals transmitted, growth factors and immune signals dominate, and suggest a delicate balance of enhancing and suppressive signals. Finally, this study provides a rich resource of gene expression profiles of term intravillous and extravillous trophoblasts, including the transcriptome of the multinucleated syncytiotrophoblast.

Journal ArticleDOI
TL;DR: A new method for detecting rearrangements, GRIDSS (Genome Rearrangement IDentification Software Suite), is a multithreaded structural variant (SV) caller that performs efficient genome-wide break-end assembly prior to variant calling using a novel positional de Bruijn graph-based assembler.
Abstract: The identification of genomic rearrangements with high sensitivity and specificity using massively parallel sequencing remains a major challenge, particularly in precision medicine and cancer research. Here, we describe a new method for detecting rearrangements, GRIDSS (Genome Rearrangement IDentification Software Suite). GRIDSS is a multithreaded structural variant (SV) caller that performs efficient genome-wide break-end assembly prior to variant calling using a novel positional de Bruijn graph-based assembler. By combining assembly, split read, and read pair evidence using a probabilistic scoring, GRIDSS achieves high sensitivity and specificity on simulated, cell line, and patient tumor data, recently winning SV subchallenge #5 of the ICGC-TCGA DREAM8.5 Somatic Mutation Calling Challenge. On human cell line data, GRIDSS halves the false discovery rate compared to other recent methods while matching or exceeding their sensitivity. GRIDSS identifies nontemplate sequence insertions, microhomologies, and large imperfect homologies, estimates a quality score for each breakpoint, stratifies calls into high or low confidence, and supports multisample analysis.

Journal ArticleDOI
TL;DR: The data suggest that the role of microRNAs in normal and pathological conditions has been overestimated due to the frequent overlooking of false positive rates, and that human haplo-insufficient genes tend to bear the most highly conserved microRNA binding sites.
Abstract: According to the current view, each microRNA regulates hundreds of genes. Computational tools aim at identifying microRNA targets, usually selecting evolutionarily conserved microRNA binding sites. While the false positive rates have been evaluated for some prediction programs, that information is rarely put forward in studies making use of their predictions. Here, we provide evidence that such predictions are often biologically irrelevant. Focusing on miR-223-guided repression, we observed that it is often smaller than inter-individual variability in gene expression among wild-type mice, suggesting that most predicted targets are functionally insensitive to that microRNA. Furthermore, we found that human haplo-insufficient genes tend to bear the most highly conserved microRNA binding sites. It thus appears that biological functionality of microRNA binding sites depends on the dose-sensitivity of their host gene and that, conversely, it is unlikely that every predicted microRNA target is dose-sensitive enough to be functionally regulated by microRNAs. We also observed that some mRNAs can efficiently titrate microRNAs, providing a reason for microRNA binding site conservation for inefficiently repressed targets. Finally, many conserved microRNA binding sites are conserved in a microRNA-independent fashion: Sequence elements may be conserved for other reasons, while being fortuitously complementary to microRNAs. Collectively, our data suggest that the role of microRNAs in normal and pathological conditions has been overestimated due to the frequent overlooking of false positive rates.

Journal ArticleDOI
TL;DR: These findings suggest that the frequency of E. coli lineages in invasive disease is driven by negative frequency-dependent selection occurring outside of the hospital, most probably in the commensal niche, and that drug resistance is not a primary determinant of success in this niche.
Abstract: Escherichia coli associated with urinary tract infections and bacteremia has been intensively investigated, including recent work focusing on the virulent, globally disseminated, multidrug-resistant lineage ST131. To contextualize ST131 within the broader E. coli population associated with disease, we used genomics to analyze a systematic 11-yr hospital-based survey of E. coli associated with bacteremia using isolates collected from across England by the British Society for Antimicrobial Chemotherapy and from the Cambridge University Hospitals NHS Foundation Trust. Population dynamics analysis of the most successful lineages identified the emergence of ST131 and ST69 and their establishment as two of the five most common lineages along with ST73, ST95, and ST12. The most frequently identified lineage was ST73. Compared to ST131, ST73 was susceptible to most antibiotics, indicating that multidrug resistance was not the dominant reason for prevalence of E. coli lineages in this population. Temporal phylogenetic analysis of the emergence of ST69 and ST131 identified differences in the dynamics of emergence and showed that expansion of ST131 in this population was not driven by sequential emergence of increasingly resistant subclades. We showed that over time, the E. coli population was only transiently disturbed by the introduction of new lineages before a new equilibrium was rapidly achieved. Together, these findings suggest that the frequency of E. coli lineages in invasive disease is driven by negative frequency-dependent selection occurring outside of the hospital, most probably in the commensal niche, and that drug resistance is not a primary determinant of success in this niche.

Journal ArticleDOI
TL;DR: A highly multiplexed single-cell DNA sequencing approach was developed to trace the metastatic lineages of two CRC patients with matched liver metastases and revealed an unexpected independent tumor lineage that did not metastasize, and early progenitor clones with the "first hit" mutation in APC that subsequently gave rise to both the primary and metastatic tumors.
Abstract: Metastasis is a complex biological process that has been difficult to delineate in human colorectal cancer (CRC) patients. A major obstacle in understanding metastatic lineages is the extensive intra-tumor heterogeneity at the primary and metastatic tumor sites. To address this problem, we developed a highly multiplexed single-cell DNA sequencing approach to trace the metastatic lineages of two CRC patients with matched liver metastases. Single-cell copy number or mutational profiling was performed, in addition to bulk exome and targeted deep-sequencing. In the first patient, we observed monoclonal seeding, in which a single clone evolved a large number of mutations prior to migrating to the liver to establish the metastatic tumor. In the second patient, we observed polyclonal seeding, in which two independent clones seeded the metastatic liver tumor after having diverged at different time points from the primary tumor lineage. The single-cell data also revealed an unexpected independent tumor lineage that did not metastasize, and early progenitor clones with the "first hit" mutation in APC that subsequently gave rise to both the primary and metastatic tumors. Collectively, these data reveal a late-dissemination model of metastasis in two CRC patients and provide an unprecedented view of metastasis at single-cell genomic resolution.

Journal ArticleDOI
TL;DR: The critical role of TEs in primate gene regulation is demonstrated and potential mechanisms underlying evolutionary divergence among the primate species through the noncoding genome are illustrated.
Abstract: Gene regulation shapes the evolution of phenotypic diversity. We investigated the evolution of liver promoters and enhancers in six primate species using ChIP-seq (H3K27ac and H3K4me1) to profile cis-regulatory elements (CREs) and using RNA-seq to characterize gene expression in the same individuals. To quantify regulatory divergence, we compared CRE activity across species by testing differential ChIP-seq read depths directly measured for orthologous sequences. We show that the primate regulatory landscape is largely conserved across the lineage, with 63% of the tested human liver CREs showing similar activity across species. Conserved CRE function is associated with sequence conservation, proximity to coding genes, cell-type specificity, and transcription factor binding. Newly evolved CREs are enriched in immune response and neurodevelopmental functions. We further demonstrate that conserved CREs bind master regulators, suggesting that while CREs contribute to species adaptation to the environment, core functions remain intact. Newly evolved CREs are enriched in young transposable elements (TEs), including Long-Terminal-Repeats (LTRs) and SINE-VNTR-Alus (SVAs), that significantly affect gene expression. Conversely, only 16% of conserved CREs overlap TEs. We tested the cis-regulatory activity of 69 TE subfamilies by luciferase reporter assays, spanning all major TE classes, and showed that 95.6% of tested TEs can function as either transcriptional activators or repressors. In conclusion, we demonstrated the critical role of TEs in primate gene regulation and illustrated potential mechanisms underlying evolutionary divergence among the primate species through the noncoding genome.

Journal ArticleDOI
TL;DR: It is found that lincRNAs promoters are depleted of transcription factor (TF) binding sites, yet enriched for some specific factors such as GATA and FOS relative to mRNA promoters, and H3K9me3-a histone modification typically associated with transcriptional repression-is more enriched at the promoters of active lincRNA loci than at those of active mRNAs.
Abstract: While long intergenic noncoding RNAs (lincRNAs) and mRNAs share similar biogenesis pathways, these transcript classes differ in many regards. LincRNAs are less evolutionarily conserved, less abundant, and more tissue-specific, suggesting that their pre- and post-transcriptional regulation is different from that of mRNAs. Here, we perform an in-depth characterization of the features that contribute to lincRNA regulation in multiple human cell lines. We find that lincRNA promoters are depleted of transcription factor (TF) binding sites, yet enriched for some specific factors such as GATA and FOS relative to mRNA promoters. Surprisingly, we find that H3K9me3-a histone modification typically associated with transcriptional repression-is more enriched at the promoters of active lincRNA loci than at those of active mRNAs. Moreover, H3K9me3-marked lincRNA genes are more tissue-specific. The most discriminant differences between lincRNAs and mRNAs involve splicing. LincRNAs are less efficiently spliced, which cannot be explained by differences in U1 binding or the density of exonic splicing enhancers but may be partially attributed to lower U2AF65 binding and weaker splicing-related motifs. Conversely, the stability of lincRNAs and mRNAs is similar, differing only with regard to the location of stabilizing protein binding sites. Finally, we find that certain transcriptional properties are correlated with higher evolutionary conservation in both DNA and RNA motifs and are enriched in lincRNAs that have been functionally characterized.

Journal ArticleDOI
TL;DR: A novel algorithm named CaTCH is presented that identifies hierarchical trees of chromosomal domains in Hi-C maps, stratified through their reciprocal physical insulation, which is a single and biologically relevant parameter and shows that previously reported folding layers appear at different insulation levels.
Abstract: Understanding how regulatory sequences interact in the context of chromosomal architecture is a central challenge in biology. Chromosome conformation capture revealed that mammalian chromosomes possess a rich hierarchy of structural layers, from multi-megabase compartments to sub-megabase topologically associating domains (TADs) and sub-TAD contact domains. TADs appear to act as regulatory microenvironments by constraining and segregating regulatory interactions across discrete chromosomal regions. However, it is unclear whether other (or all) folding layers share similar properties, or rather TADs constitute a privileged folding scale with maximal impact on the organization of regulatory interactions. Here, we present a novel algorithm named CaTCH that identifies hierarchical trees of chromosomal domains in Hi-C maps, stratified through their reciprocal physical insulation, which is a single and biologically relevant parameter. By applying CaTCH to published Hi-C data sets, we show that previously reported folding layers appear at different insulation levels. We demonstrate that although no structurally privileged folding level exists, TADs emerge as a functionally privileged scale defined by maximal boundary enrichment in CTCF and maximal cell-type conservation. By measuring transcriptional output in embryonic stem cells and neural precursor cells, we show that the likelihood that genes in a domain are coregulated during differentiation is also maximized at the scale of TADs. Finally, we observe that regulatory sequences occur at genomic locations corresponding to optimized mutual interactions at the same scale. Our analysis suggests that the architectural functionality of TADs arises from the interplay between their ability to partition interactions and the specific genomic position of regulatory sequences.

Journal ArticleDOI
TL;DR: A model is generated that predicts the protein expression of the 5' untranslated region (UTR) of mRNAs in the yeast Saccharomyces cerevisiae and it is confirmed experimentally that the great majority of the evolved sequences led to higher protein expression rates than the starting sequences.
Abstract: Our ability to predict protein expression from DNA sequence alone remains poor, reflecting our limited understanding of cis-regulatory grammar and hampering the design of engineered genes for synthetic biology applications. Here, we generate a model that predicts the protein expression of the 5' untranslated region (UTR) of mRNAs in the yeast Saccharomyces cerevisiae. We constructed a library of half a million 50-nucleotide-long random 5' UTRs and assayed their activity in a massively parallel growth selection experiment. The resulting data allow us to quantify the impact on protein expression of Kozak sequence composition, upstream open reading frames (uORFs), and secondary structure. We trained a convolutional neural network (CNN) on the random library and showed that it performs well at predicting the protein expression of both a held-out set of the random 5' UTRs as well as native S. cerevisiae 5' UTRs. The model additionally was used to computationally evolve highly active 5' UTRs. We confirmed experimentally that the great majority of the evolved sequences led to higher protein expression rates than the starting sequences, demonstrating the predictive power of this model.