scispace - formally typeset
Search or ask a question

Showing papers in "Genome Research in 2014"


Journal ArticleDOI
Sojung Kim1, Daesik Kim1, Seung Woo Cho1, Jung-Eun Kim1, Jin-Soo Kim1 
TL;DR: Delivery of purified recombinant Cas9 protein and guide RNA into cultured human cells including hard-to-transfect fibroblasts and pluripotent stem cells is delivered and RGEN ribonucleoproteins (RNPs) induce site-specific mutations at frequencies of up to 79%, while reducing off- target mutations associated with plasmid transfection at off-target sites.
Abstract: RNA-guided engineered nucleases (RGENs) derived from the prokaryotic adaptive immune system known as CRISPR (clustered, regularly interspaced, short palindromic repeat)/Cas (CRISPR-associated) enable genome editing in human cell lines, animals, and plants, but are limited by off-target effects and unwanted integration of DNA segments derived from plasmids encoding Cas9 and guide RNA at both on-target and off-target sites in the genome. Here, we deliver purified recombinant Cas9 protein and guide RNA into cultured human cells including hard-to-transfect fibroblasts and pluripotent stem cells. RGEN ribonucleoproteins (RNPs) induce site-specific mutations at frequencies of up to 79%, while reducing off-target mutations associated with plasmid transfection at off-target sites that differ by one or two nucleotides from on-target sites. RGEN RNPs cleave chromosomal DNA almost immediately after delivery and are degraded rapidly in cells, reducing off-target effects. Furthermore, RNP delivery is less stressful to human embryonic stem cells, producing at least twofold more colonies than does plasmid transfection.

1,526 citations


Journal ArticleDOI
TL;DR: Off-target effects of RGENs can be reduced below the detection limits of deep sequencing by choosing unique target sequences in the genome and modifying both guide RNA and Cas9, and paired nickases induced chromosomal deletions in a targeted manner without causing unwanted translocations.
Abstract: RNA-guided endonucleases (RGENs), derived from the prokaryotic adaptive immune system known as CRISPR/Cas, enable targeted genome engineering in cells and organisms. RGENs are ribonucleoproteins that consist of guide RNA and Cas9, a protein component originated from Streptococcus pyogenes. These enzymes cleave chromosomal DNA, whose sequence is complementary, to guide RNA in a targeted manner, producing site-specific DNA double-strand breaks (DSBs), the repair of which gives rise to targeted genome modifications. Despite broad interest in RGEN-mediated genome editing, these nucleases are limited by off-target mutations and unwanted chromosomal translocations associated with off-target DNA cleavages. Here, we show that off-target effects of RGENs can be reduced below the detection limits of deep sequencing by choosing unique target sequences in the genome and modifying both guide RNA and Cas9. We found that both the composition and structure of guide RNA can affect RGEN activities in cells to reduce off-target effects. RGENs efficiently discriminated on-target sites from off-target sites that differ by two bases. Furthermore, exome sequencing analysis showed that no off-target mutations were induced by two RGENs in four clonal populations of mutant cells. In addition, paired Cas9 nickases, composed of D10A Cas9 and guide RNA, which generate two single-strand breaks (SSBs) or nicks on different DNA strands, were highly specific in human cells, avoiding off-target mutations without sacrificing genome-editing efficiency. Interestingly, paired nickases induced chromosomal deletions in a targeted manner without causing unwanted translocations. Our results highlight the importance of choosing unique target sequences and optimizing guide RNA and Cas9 to avoid or reduce RGEN-induced off-target mutations.

1,332 citations


Journal ArticleDOI
TL;DR: Platanus provides a novel and efficient approach for the assembly of gigabase-sized highly heterozygous genomes and is an attractive alternative to the existing assemblers designed for genomes of lower heterozygosity.
Abstract: Although many de novo genome assembly projects have recently been conducted using high-throughput sequencers, assembling highly heterozygous diploid genomes is a substantial challenge due to the increased complexity of the de Bruijn graph structure predominantly used. To address the increasing demand for sequencing of nonmodel and/or wild-type samples, in most cases inbred lines or fosmid-based hierarchical sequencing methods are used to overcome such problems. However, these methods are costly and time consuming, forfeiting the advantages of massive parallel sequencing. Here, we describe a novel de novo assembler, Platanus, that can effectively manage high-throughput data from heterozygous samples. Platanus assembles DNA fragments (reads) into contigs by constructing de Bruijn graphs with automatically optimized k-mer sizes followed by the scaffolding of contigs based on paired-end information. The complicated graph structures that result from the heterozygosity are simplified during not only the contig assembly step but also the scaffolding step. We evaluated the assembly results on eukaryotic samples with various levels of heterozygosity. Compared with other assemblers, Platanus yields assembly results that have a larger scaffold NG50 length without any accompanying loss of accuracy in both simulated and real data. In addition, Platanus recorded the largest scaffold NG50 values for two of the three low-heterozygosity species used in the de novo assembly contest, Assemblathon 2. Platanus therefore provides a novel and efficient approach for the assembly of gigabase-sized highly heterozygous genomes and is an attractive alternative to the existing assemblers designed for genomes of lower heterozygosity.

924 citations


Journal ArticleDOI
TL;DR: This work presents simple and robust procedures for Tn5 transposase production and optimized reaction conditions for tagmentation-based sequencing library construction and shows how molecular crowding agents both modulate library lengths and enable efficient tagmentation from subpicogram amounts of cDNA.
Abstract: Massively parallel DNA sequencing of thousands of samples in a single machine-run is now possible, but the preparation of the individual sequencing libraries is expensive and time-consuming. Tagmentation-based library construction, using the Tn5 transposase, is efficient for generating sequencing libraries but currently relies on undisclosed reagents, which severely limits development of novel applications and the execution of large-scale projects. Here, we present simple and robust procedures for Tn5 transposase production and optimized reaction conditions for tagmentation-based sequencing library construction. We further show how molecular crowding agents both modulate library lengths and enable efficient tagmentation from subpicogram amounts of cDNA. The comparison of single-cell RNA-sequencing libraries generated using produced and commercial Tn5 demonstrated equal performances in terms of gene detection and library characteristics. Finally, because naked Tn5 can be annealed to any oligonucleotide of choice, for example, molecular barcodes in single-cell assays or methylated oligonucleotides for bisulfite sequencing, custom Tn5 production and tagmentation enable innovation in sequencing-based applications.

690 citations


Journal ArticleDOI
TL;DR: This work shows that simple treatment with cell-penetrating peptide (CPP)-conjugated recombinant Cas9 protein and CPP-complexed guide RNAs leads to endogenous gene disruptions in human cell lines, and envisages that this method will facilitate RGEN-directed genome editing.
Abstract: .RNA-guided endonucleases (RGENs) derived from the CRISPR/Cas system represent an efficient tool for genome editing. RGENs consist of two components: Cas9 protein and guide RNA. Plasmid-mediated delivery of these components into cells can result in uncontrolled integration of the plasmid sequence into the host genome, and unwanted immune responses and potential safety problems that can be caused by the bacterial sequences. Furthermore, this delivery method requires transfectiontools.Hereweshowthatsimple treatment with cell-penetratingpeptide (CPP)–conjugatedrecombinant Cas9 protein and CPP-complexed guide RNAs leads to endogenous gene disruptions in human cell lines. The Cas9 protein was conjugated to CPP via a thioether bond, whereas the guide RNA was complexed with CPP, forming condensed, positively charged nanoparticles. Simultaneous and sequential treatment of human cells, including embryonic stem cells, dermal fibroblasts, HEK293T cells, HeLa cells, and embryonic carcinoma cells, with the modified Cas9 and guide RNA, leads to efficient gene disruptions with reduced off-target mutations relative to plasmid transfections, resulting in the generation of clones containing RGEN-induced mutations. Our CPP-mediated RGEN delivery process provides a plasmidfree and additional transfection reagent–free method to use this tool with reduced off-target effects. We envision that our method will facilitate RGEN-directed genome editing.

654 citations


Journal ArticleDOI
TL;DR: This work provides a direct window into the regulatory consequences of genetic variation by sequencing RNA from 922 genotyped individuals, and presents a comprehensive description of the distribution of regulatory variation--by the specific expression phenotypes altered, the properties of affected genes, and the genomic characteristics of regulatory variants.
Abstract: Understanding the consequences of regulatory variation in the human genome remains a major challenge, with important implications for understanding gene regulation and interpreting the many disease-risk variants that fall outside of protein-coding regions. Here, we provide a direct window into the regulatory consequences of genetic variation by sequencing RNA from 922 genotyped individuals. We present a comprehensive description of the distribution of regulatory variation--by the specific expression phenotypes altered, the properties of affected genes, and the genomic characteristics of regulatory variants. We detect variants influencing expression of over ten thousand genes, and through the enhanced resolution offered by RNA-sequencing, for the first time we identify thousands of variants associated with specific phenotypes including splicing and allelic expression. Evaluating the effects of both long-range intra-chromosomal and trans (cross-chromosomal) regulation, we observe modularity in the regulatory network, with three-dimensional chromosomal configuration playing a particular role in regulatory modules within each chromosome. We also observe a significant depletion of regulatory variants affecting central and critical genes, along with a trend of reduced effect sizes as variant frequency increases, providing evidence that purifying selection and buffering have limited the deleterious impact of regulatory variation on the cell. Further, generalizing beyond observed variants, we have analyzed the genomic properties of variants associated with expression and splicing and developed a Bayesian model to predict regulatory consequences of genetic variants, applicable to the interpretation of individual genomes and disease studies. Together, these results represent a critical step toward characterizing the complete landscape of human regulatory variation.

577 citations


Journal ArticleDOI
TL;DR: An integrated genotyping strategy was used to identify 4,853,802 single nucleotide polymorphisms (SNPs) and 1,296,080 non-SNP variants and identified 16 polymorphic inversions in the DGRP, finding variation in genome size and many quantitative traits are significantly associated with inversions.
Abstract: The Drosophila melanogaster Genetic Reference Panel (DGRP) is a community resource of 205 sequenced inbred lines, derived to improve our understanding of the effects of naturally occurring genetic variation on molecular and organismal phenotypes. We used an integrated genotyping strategy to identify 4,853,802 single nucleotide polymorphisms (SNPs) and 1,296,080 non-SNP variants. Our molecular population genomic analyses show higher deletion than insertion mutation rates and stronger purifying selection on deletions. Weaker selection on insertions than deletions is consistent with our observed distribution of genome size determined by flow cytometry, which is skewed toward larger genomes. Insertion/deletion and single nucleotide polymorphisms are positively correlated with each other and with local recombination, suggesting that their nonrandom distributions are due to hitchhiking and background selection. Our cytogenetic analysis identified 16 polymorphic inversions in the DGRP. Common inverted and standard karyotypes are genetically divergent and account for most of the variation in relatedness among the DGRP lines. Intriguingly, variation in genome size and many quantitative traits are significantly associated with inversions. Approximately 50% of the DGRP lines are infected with Wolbachia, and four lines have germline insertions of Wolbachia sequences, but effects of Wolbachia infection on quantitative traits are rarely significant. The DGRP complements ongoing efforts to functionally annotate the Drosophila genome. Indeed, 15% of all D. melanogaster genes segregate for potentially damaged proteins in the DGRP, and genome-wide analyses of quantitative traits identify novel candidate genes. The DGRP lines, sequence data, genotypes, quality scores, phenotypes, and analysis and visualization tools are publicly available.

569 citations


Journal ArticleDOI
TL;DR: CRISPR/Cas9-mediated knock-in of DNA cassettes into the zebrafish genome at a very high rate by homology-independent double-strand break (DSB) repair pathways is reported and the possibility of easily targeting DNA integration at endogenous loci is shown, thus greatly facilitating the creation of reporter and loss-of-function alleles.
Abstract: Sequence-specific nucleases like TALENs and the CRISPR/Cas9 system have greatly expanded the genome editing possibilities in model organisms such as zebrafish. Both systems have recently been used to create knock-out alleles with great efficiency, and TALENs have also been successfully employed in knock-in of DNA cassettes at defined loci via homologous recombination (HR). Here we report CRISPR/Cas9-mediated knock-in of DNA cassettes into the zebrafish genome at a very high rate by homology-independent double-strand break (DSB) repair pathways. After co-injection of a donor plasmid with a short guide RNA (sgRNA) and Cas9 nuclease mRNA, concurrent cleavage of donor plasmid DNA and the selected chromosomal integration site resulted in efficient targeted integration of donor DNA. We successfully employed this approach to convert eGFP into Gal4 transgenic lines, and the same plasmids and sgRNAs can be applied in any species where eGFP lines were generated as part of enhancer and gene trap screens. In addition, we show the possibility of easily targeting DNA integration at endogenous loci, thus greatly facilitating the creation of reporter and loss-of-function alleles. Due to its simplicity, flexibility, and very high efficiency, our method greatly expands the repertoire for genome editing in zebrafish and can be readily adapted to many other organisms.

565 citations


Journal ArticleDOI
TL;DR: For 515 patients from six tumor sites, RNA-seq data from The Cancer Genome Atlas was used to identify mutations that were predicted to be immunogenic in that they yielded mutational epitopes presented by the MHC proteins encoded by each patient's autologous HLA-A alleles that were associated with increased patient survival.
Abstract: Somatic missense mutations can initiate tumorogenesis and, conversely, anti-tumor cytotoxic T cell (CTL) responses. Tumor genome analysis has revealed extreme heterogeneity among tumor missense mutation profiles, but their relevance to tumor immunology and patient outcomes has awaited comprehensive evaluation. Here, for 515 patients from six tumor sites, we used RNA-seq data from The Cancer Genome Atlas to identify mutations that are predicted to be immunogenic in that they yielded mutational epitopes presented by the MHC proteins encoded by each patient’s autologous HLA-A alleles. Mutational epitopes were associated with increased patient survival. Moreover, the corresponding tumors had higher CTL content, inferred from CD8A gene expression, and elevated expression of the CTL exhaustion markers PDCD1 and CTLA4. Mutational epitopes were very scarce in tumors without evidence of CTL infiltration. These findings suggest that the abundance of predicted immunogenic mutations may be useful for identifying patients likely to benefit from checkpoint blockade and related immunotherapies.

547 citations


Journal ArticleDOI
TL;DR: It is shown that intron retention acts widely to reduce the levels of transcripts that are less or not required for the physiology of the cell or tissue type in which they are detected, and this "transcriptome tuning" function of IR acts through both nonsense-mediated mRNA decay and nuclear sequestration and turnover of IR transcripts.
Abstract: Alternative splicing (AS) of precursor RNAs is responsible for greatly expanding the regulatory and functional capacity of eukaryotic genomes. Of the different classes of AS, intron retention (IR) is the least well understood. In plants and unicellular eukaryotes, IR is the most common form of AS, whereas in animals, it is thought to represent the least prevalent form. Using high-coverage poly(A)+ RNA-seq data, we observe that IR is surprisingly frequent in mammals, affecting transcripts from as many as three-quarters of multiexonic genes. A highly correlated set of cis features comprising an “IR code” reliably discriminates retained from constitutively spliced introns. We show that IR acts widely to reduce the levels of transcripts that are less or not required for the physiology of the cell or tissue type in which they are detected. This “transcriptome tuning” function of IR acts through both nonsense-mediated mRNA decay and nuclear sequestration and turnover of IR transcripts. We further show that IR is linked to a cross-talk mechanism involving localized stalling of RNA polymerase II (Pol II) and reduced availability of spliceosomal components. Collectively, the results implicate a global checkpoint-type mechanism whereby reduced recruitment of splicing components coupled to Pol II pausing underlies widespread IR-mediated suppression of inappropriately expressed transcripts.

547 citations


Journal ArticleDOI
TL;DR: The SMART-seq single-cell RNA-seq protocol is applied to study the reference lymphoblastoid cell line GM12878 and it is shown that transcriptomes from small pools of 30-100 cells approach the information content and reproducibility of contemporaryRNA-seq from large amounts of input material.
Abstract: Single-cell RNA-seq mammalian transcriptome studies are at an early stage in uncovering cell-to-cell variation in gene expression, transcript processing and editing, and regulatory module activity. Despite great progress recently, substantial challenges remain, including discriminating biological variation from technical noise. Here we apply the SMART-seq single-cell RNA-seq protocol to study the reference lymphoblastoid cell line GM12878. By using spike-in quantification standards, we estimate the absolute number of RNA molecules per cell for each gene and find significant variation in total mRNA content: between 50,000 and 300,000 transcripts per cell. We directly measure technical stochasticity by a pool/split design and find that there are significant differences in expression between individual cells, over and above technical variation. Specific gene coexpression modules were preferentially expressed in subsets of individual cells, including one enriched for mRNA processing and splicing factors. We assess cell-to-cell variation in alternative splicing and allelic bias and report evidence of significant differences in splice site usage that exceed splice variation in the pool/split comparison. Finally, we show that transcriptomes from small pools of 30–100 cells approach the information content and reproducibility of contemporary RNA-seq from large amounts of input material. Together, our results define an experimental and computational path forward for analyzing gene expression in rare cell types and cell states.

Journal ArticleDOI
TL;DR: It is found that virtually all adenosines within Alu repeats that form double-stranded RNA undergo A-to-I editing, although most sites exhibit editing at only low levels, doubling the number of edited sites in the human genome.
Abstract: RNA molecules transmit the information encoded in the genome and generally reflect its content. Adenosine-to-inosine (A-to-I) RNA editing by ADAR proteins converts a genomically encoded adenosine into inosine. It is known that most RNA editing in human takes place in the primate-specific Alu sequences, but the extent of this phenomenon and its effect on transcriptome diversity are not yet clear. Here, we analyzed large-scale RNA-seq data and detected ∼1.6 million editing sites. As detection sensitivity increases with sequencing coverage, we performed ultradeep sequencing of selected Alu sequences and showed that the scope of editing is much larger than anticipated. We found that virtually all adenosines within Alu repeats that form double-stranded RNA undergo A-to-I editing, although most sites exhibit editing at only low levels (<1%). Moreover, using high coverage sequencing, we observed editing of transcripts resulting from residual antisense expression, doubling the number of edited sites in the human genome. Based on bioinformatic analyses and deep targeted sequencing, we estimate that there are over 100 million human Alu RNA editing sites, located in the majority of human genes. These findings set the stage for exploring how this primate-specific massive diversification of the transcriptome is utilized.

Journal ArticleDOI
TL;DR: Fit-Hi-C is described, a method that assigns statistical confidence estimates to mid-range intra-chromosomal contacts by jointly modeling the random polymer looping effect and previously observed technical biases in Hi-C data sets and shows that insulators and heterochromatin regions are hubs for high-confidence contacts, while promoters and strong enhancers are involved in fewer contacts.
Abstract: Our current understanding of how DNA is packed in the nucleus is most accurate at the fine scale of individual nucleosomes and at the large scale of chromosome territories. However, accurate modeling of DNA architecture at the intermediate scale of ∼50 kb-10 Mb is crucial for identifying functional interactions among regulatory elements and their target promoters. We describe a method, Fit-Hi-C, that assigns statistical confidence estimates to mid-range intra-chromosomal contacts by jointly modeling the random polymer looping effect and previously observed technical biases in Hi-C data sets. We demonstrate that our proposed approach computes accurate empirical null models of contact probability without any distribution assumption, corrects for binning artifacts, and provides improved statistical power relative to a previously described method. High-confidence contacts identified by Fit-Hi-C preferentially link expressed gene promoters to active enhancers identified by chromatin signatures in human embryonic stem cells (ESCs), capture 77% of RNA polymerase II-mediated enhancer-promoter interactions identified using ChIA-PET in mouse ESCs, and confirm previously validated, cell line-specific interactions in mouse cortex cells. We observe that insulators and heterochromatin regions are hubs for high-confidence contacts, while promoters and strong enhancers are involved in fewer contacts. We also observe that binding peaks of master pluripotency factors such as NANOG and POU5F1 are highly enriched in high-confidence contacts for human ESCs. Furthermore, we show that pairs of loci linked by high-confidence contacts exhibit similar replication timing in human and mouse ESCs and preferentially lie within the boundaries of topological domains for human and mouse cell lines.

Journal ArticleDOI
TL;DR: SURPI is described, a computational pipeline for pathogen identification from complex metagenomic NGS data generated from clinical samples, and use of the pipeline is demonstrated in the analysis of 237 clinical samples comprising more than 1.1 billion sequences.
Abstract: Unbiased next-generation sequencing (NGS) approaches enable comprehensive pathogen detection in the clinical microbiology laboratory and have numerous applications for public health surveillance, outbreak investigation, and the diagnosis of infectious diseases. However, practical deployment of the technology is hindered by the bioinformatics challenge of analyzing results accurately and in a clinically relevant timeframe. Here we describe SURPI ("sequence-based ultrarapid pathogen identification"), a computational pipeline for pathogen identification from complex metagenomic NGS data generated from clinical samples, and demonstrate use of the pipeline in the analysis of 237 clinical samples comprising more than 1.1 billion sequences. Deployable on both cloud-based and standalone servers, SURPI leverages two state-of-the-art aligners for accelerated analyses, SNAP and RAPSearch, which are as accurate as existing bioinformatics tools but orders of magnitude faster in performance. In fast mode, SURPI detects viruses and bacteria by scanning data sets of 7-500 million reads in 11 min to 5 h, while in comprehensive mode, all known microorganisms are identified, followed by de novo assembly and protein homology searches for divergent viruses in 50 min to 16 h. SURPI has also directly contributed to real-time microbial diagnosis in acutely ill patients, underscoring its potential key role in the development of unbiased NGS-based clinical assays in infectious diseases that demand rapid turnaround times.

Journal ArticleDOI
TL;DR: A large operational analysis to chart the distribution of gene regulatory activities along the mouse genome, using hundreds of insertions of a regulatory sensor finds that enhancers distribute their activities along broad regions and not in a gene-centric manner, defining large regulatory domains.
Abstract: Long-range regulatory interactions play an important role in shaping gene-expression programs. However, the genomic features that organize these activities are still poorly characterized. We conducted a large operational analysis to chart the distribution of gene regulatory activities along the mouse genome, using hundreds of insertions of a regulatory sensor. We found that enhancers distribute their activities along broad regions and not in a gene-centric manner, defining large regulatory domains. Remarkably, these domains correlate strongly with the recently described TADs, which partition the genome into distinct self-interacting blocks. Different features, including specific repeats and CTCF-binding sites, correlate with the transition zones separating regulatory domains, and may help to further organize promiscuously distributed regulatory influences within large domains. These findings support a model of genomic organization where TADs confine regulatory activities to specific but large regulatory domains, contributing to the establishment of specific gene expression profiles.

Journal ArticleDOI
TL;DR: Transposable elements have significantly and continuously shaped gene regulatory networks during mammalian evolution, and are an important driving force for regulatory innovation.
Abstract: Transposable elements (TEs) have been shown to contain functional binding sites for certain transcription factors (TFs). However, the extent to which TEs contribute to the evolution of TF binding sites is not well known. We comprehensively mapped binding sites for 26 pairs of orthologous TFs in two pairs of human and mouse cell lines (representing two cell lineages), along with epigenomic profiles, including DNA methylation and six histone modifications. Overall, we found that 20% of binding sites were embedded within TEs. This number varied across different TFs, ranging from 2% to 40%. We further identified 710 TF–TE relationships in which genomic copies of a TE subfamily contributed a significant number of binding peaks for a TF, and we found that LTR elements dominated these relationships in human. Importantly, TE-derived binding peaks were strongly associated with open and active chromatin signatures, including reduced DNA methylation and increased enhancer-associated histone marks. On average, 66% of TE-derived binding events were cell type-specific with a cell type-specific epigenetic landscape. Most of the binding sites contributed by TEs were species-specific, but we also identified binding sites conserved between human and mouse, the functional relevance of which was supported by a signature of purifying selection on DNA sequences of these TEs. Interestingly, several TFs had significantly expanded binding site landscapes only in one species, which were linked to species-specific gene functions, suggesting that TEs are an important driving force for regulatory innovation. Taken together, our data suggest that TEs have significantly and continuously shaped gene regulatory networks during mammalian evolution.

Journal ArticleDOI
TL;DR: This study provides an effective approach to correct HBB mutations without leaving any genetic footprint in patient-derived iPSCs, thereby demonstrating a critical step toward the future application of stem cell-based gene therapy to monogenic diseases.
Abstract: β-thalassemia, one of the most common genetic diseases worldwide, is caused by mutations in the human hemoglobin beta (HBB) gene. Creation of human induced pluripotent stem cells (iPSCs) from β-thalassemia patients could offer an approach to cure this disease. Correction of the disease-causing mutations in iPSCs could restore normal function and provide a rich source of cells for transplantation. In this study, we used the latest gene-editing tool, CRISPR/Cas9 technology, combined with the piggyBac transposon to efficiently correct the HBB mutations in patient-derived iPSCs without leaving any residual footprint. No off-target effects were detected in the corrected iPSCs, and the cells retain full pluripotency and exhibit normal karyotypes. When differentiated into erythroblasts using a monolayer culture, gene-corrected iPSCs restored expression of HBB compared to the parental iPSCs line. Our study provides an effective approach to correct HBB mutations without leaving any genetic footprint in patient-derived iPSCs, thereby demonstrating a critical step toward the future application of stem cell-based gene therapy to monogenic diseases.

Journal ArticleDOI
TL;DR: This work presents a model of "looping" by which HPV integrant-mediated DNA replication and recombination may result in viral-host DNA concatemers, frequently disrupting genes involved in oncogenesis and amplifying HPV oncogenes E6 and E7.
Abstract: Genomic instability is a hallmark of human cancers, including the 5% caused by human papillomavirus (HPV). Here we report a striking association between HPV integration and adjacent host genomic structural variation in human cancer cell lines and primary tumors. Whole-genome sequencing revealed HPV integrants flanking and bridging extensive host genomic amplifications and rearrangements, including deletions, inversions, and chromosomal translocations. We present a model of "looping" by which HPV integrant-mediated DNA replication and recombination may result in viral-host DNA concatemers, frequently disrupting genes involved in oncogenesis and amplifying HPV oncogenes E6 and E7. Our high-resolution results shed new light on a catastrophic process, distinct from chromothripsis and other mutational processes, by which HPV directly promotes genomic instability.

Journal ArticleDOI
TL;DR: It is argued that considering the evolutionary potential of polyploids in light of the environmental and ecological conditions present around the time ofpolyploidization could mitigate the stark contrast in the proposed evolutionary fates of Polyploids.
Abstract: Ancient whole-genome duplications (WGDs), also referred to as paleopolyploidizations, have been reported in most evolutionary lineages. Their attributed role remains a major topic of discussion, ranging from an evolutionary dead end to a road toward evolutionary success, with evidence supporting both fates. Previously, based on dating WGDs in a limited number of plant species, we found a clustering of angiosperm paleopolyploidizations around the Cretaceous-Paleogene (K-Pg) extinction event about 66 million years ago. Here we revisit this finding, which has proven controversial, by combining genome sequence information for many more plant lineages and using more sophisticated analyses. We include 38 full genome sequences and three transcriptome assemblies in a Bayesian evolutionary analysis framework that incorporates uncorrelated relaxed clock methods and fossil uncertainty. In accordance with earlier findings, we demonstrate a strongly nonrandom pattern of genome duplications over time with many WGDs clustering around the K-Pg boundary. We interpret these results in the context of recent studies on invasive polyploid plant species, and suggest that polyploid establishment is promoted during times of environmental stress. We argue that considering the evolutionary potential of polyploids in light of the environmental and ecological conditions present around the time of polyploidization could mitigate the stark contrast in the proposed evolutionary fates of polyploids.

Journal ArticleDOI
TL;DR: A novel probabilistic model is presented, TITAN, to infer CNA and LOH events while accounting for mixtures of cell populations, thereby estimating the proportion of cells harboring each event.
Abstract: The evolution of cancer genomes within a single tumor creates mixed cell populations with divergent somatic mutational landscapes. Inference of tumor subpopulations has been disproportionately focused on the assessment of somatic point mutations, whereas computational methods targeting evolutionary dynamics of copy number alterations (CNA) and loss of heterozygosity (LOH) in whole-genome sequencing data remain underdeveloped. We present a novel probabilistic model, TITAN, to infer CNA and LOH events while accounting for mixtures of cell populations, thereby estimating the proportion of cells harboring each event. We evaluate TITAN on idealized mixtures, simulating clonal populations from whole-genome sequences taken from genomically heterogeneous ovarian tumor sites collected from the same patient. In addition, we show in 23 whole genomes of breast tumors that the inference of CNA and LOH using TITAN critically informs population structure and the nature of the evolving cancer genome. Finally, we experimentally validated subclonal predictions using fluorescence in situ hybridization (FISH) and single-cell sequencing from an ovarian cancer patient sample, thereby recapitulating the key modeling assumptions of TITAN.

Journal ArticleDOI
TL;DR: Evidence is provided that for six common autoimmune disorders, the GWAS association arises from multiple polymorphisms in LD that map to clusters of enhancer elements active in the same cell type, which suggests a "multiple enhancer variant" hypothesis for common traits.
Abstract: DNA variants (SNPs) that predispose to common traits often localize within noncoding regulatory elements such as enhancers. Moreover, loci identified by genome-wide association studies (GWAS) often contain multiple SNPs in linkage disequilibrium (LD), any of which may be causal. Thus, determining the effect of these multiple variant SNPs on target transcript levels has been a major challenge. Here, we provide evidence that for six common autoimmune disorders (rheumatoid arthritis, Crohn's disease, celiac disease, multiple sclerosis, lupus, and ulcerative colitis), the GWAS association arises from multiple polymorphisms in LD that map to clusters of enhancer elements active in the same cell type. This finding suggests a "multiple enhancer variant" hypothesis for common traits, where several variants in LD impact multiple enhancers and cooperatively affect gene expression. Using a novel method to delineate enhancer-gene interactions, we show that multiple enhancer variants within a given locus typically target the same gene. Using available data from HapMap and B lymphoblasts as a model system, we provide evidence at numerous loci that multiple enhancer variants cooperatively contribute to altered expression of their gene targets. The effects on target transcript levels tend to be modest and can be either gain- or loss-of-function. Additionally, the genes associated with multiple enhancer variants encode proteins that are often functionally related and enriched in common pathways. Overall, the multiple enhancer variant hypothesis offers a new paradigm by which noncoding variants can confer susceptibility to common traits.

Journal ArticleDOI
TL;DR: This work improves on previous methods by first implementing a combined correction for sequence mappability and GC content, and second, by applying this procedure to sequence data from the 1000 Genomes Project in order to develop a blacklist of problematic genome regions.
Abstract: Detection of DNA copy number aberrations by shallow whole-genome sequencing (WGS) faces many challenges, including lack of completion and errors in the human reference genome, repetitive sequences, polymorphisms, variable sample quality, and biases in the sequencing procedures. Formalin-fixed paraffin-embedded (FFPE) archival material, the analysis of which is important for studies of cancer, presents particular analytical difficulties due to degradation of the DNA and frequent lack of matched reference samples. We present a robust, cost-effective WGS method for DNA copy number analysis that addresses these challenges more successfully than currently available procedures. In practice, very useful profiles can be obtained with ∼0.1× genome coverage. We improve on previous methods by first implementing a combined correction for sequence mappability and GC content, and second, by applying this procedure to sequence data from the 1000 Genomes Project in order to develop a blacklist of problematic genome regions. A small subset of these blacklisted regions was previously identified by ENCODE, but the vast majority are novel unappreciated problematic regions. Our procedures are implemented in a pipeline called QDNAseq. We have analyzed over 1000 samples, most of which were obtained from the fixed tissue archives of more than 25 institutions. We demonstrate that for most samples our sequencing and analysis procedures yield genome profiles with noise levels near the statistical limit imposed by read counting. The described procedures also provide better correction of artifacts introduced by low DNA quality than prior approaches and better copy number data than high-resolution microarrays at a substantially lower cost.

Journal ArticleDOI
TL;DR: It is found that ∼20% of human lincRNAs are not expressed beyond chimpanzee and are undetectable even in rhesus, which suggests that exact splice sites are not critical.
Abstract: .Long intergenic noncoding RNAs (lincRNAs) play diverse regulatory roles in human development and disease, but little is known about their evolutionary history and constraint. Here, we characterize human lincRNA expression patterns in nine tissues across six mammalian species and multiple individuals. Of the 1898 human lincRNAs expressed in these tissues, we find orthologous transcripts for 80% in chimpanzee, 63% in rhesus, 39% in cow, 38% in mouse, and 35% in rat. Mammalian-expressed lincRNAs show remarkably strong conservation of tissue specificity, suggesting that it is selectively maintained. In contrast, abundant splice-site turnover suggests that exact splice sites are not critical. Relative to evolutionarily young lincRNAs, mammalian-expressed lincRNAs show higher primary sequence conservation in their promoters and exons, increased proximity to protein-coding genes enriched for tissue-specific functions, fewer repeat elements, and more frequent single-exon transcripts. Remarkably, we find that ~20% of human lincRNAs are not expressed beyond chimpanzee and are undetectable even in rhesus. These hominid-specific lincRNAs are more tissue specific, enriched for testis, and faster evolving within the human lineage.

Journal ArticleDOI
TL;DR: It is proposed that exome sequencing projects should systematically capture clinical phenotypes to take advantage of the strategy presented here and conclude that incorporation of phenotype data can play a vital role in translational bioinformatics.
Abstract: Numerous new disease-gene associations have been identified by whole-exome sequencing studies in the last few years. However, many cases remain unsolved due to the sheer number of candidate variants remaining after common filtering strategies such as removing low quality and common variants and those deemed unlikely to be pathogenic. The observation that each of our genomes contains about 100 genuine loss-of-function variants makes identification of the causative mutation problematic when using these strategies alone. We propose using the wealth of genotype to phenotype data that already exists from model organism studies to assess the potential impact of these exome variants. Here, we introduce PHenotypic Interpretation of Variants in Exomes (PHIVE), an algorithm that integrates the calculation of phenotype similarity between human diseases and genetically modified mouse models with evaluation of the variants according to allele frequency, pathogenicity, and mode of inheritance approaches in our Exomiser tool. Large-scale validation of PHIVE analysis using 100,000 exomes containing known mutations demonstrated a substantial improvement (up to 54.1-fold) over purely variant-based (frequency and pathogenicity) methods with the correct gene recalled as the top hit in up to 83% of samples, corresponding to an area under the ROC curve of >95%. We conclude that incorporation of phenotype data can play a vital role in translational bioinformatics and propose that exome sequencing projects should systematically capture clinical phenotypes to take advantage of the strategy presented here.

Journal ArticleDOI
TL;DR: It is concluded that epigenetic regulation plays multiple crucial roles in sexual reversal of tongue sole fish, and the first clues on the mechanisms behind gene dosage balancing in an organism that undergoes sexual reversal are offered.
Abstract: Environmental sex determination (ESD) occurs in divergent, phylogenetically unrelated taxa, and in some species, co-occurs with genetic sex determination (GSD) mechanisms. Although epigenetic regulation in response to environmental effects has long been proposed to be associated with ESD, a systemic analysis on epigenetic regulation of ESD is still lacking. Using half-smooth tongue sole (Cynoglossus semilaevis) as a model-a marine fish that has both ZW chromosomal GSD and temperature-dependent ESD-we investigated the role of DNA methylation in transition from GSD to ESD. Comparative analysis of the gonadal DNA methylomes of pseudomale, female, and normal male fish revealed that genes in the sex determination pathways are the major targets of substantial methylation modification during sexual reversal. Methylation modification in pseudomales is globally inherited in their ZW offspring, which can naturally develop into pseudomales without temperature incubation. Transcriptome analysis revealed that dosage compensation occurs in a restricted, methylated cytosine enriched Z chromosomal region in pseudomale testes, achieving equal expression level in normal male testes. In contrast, female-specific W chromosomal genes are suppressed in pseudomales by methylation regulation. We conclude that epigenetic regulation plays multiple crucial roles in sexual reversal of tongue sole fish. We also offer the first clues on the mechanisms behind gene dosage balancing in an organism that undergoes sexual reversal. Finally, we suggest a causal link between the bias sex chromosome assortment in the offspring of a pseudomale family and the transgenerational epigenetic inheritance of sexual reversal in tongue sole fish.

Journal ArticleDOI
TL;DR: This study profiles the shifting RNA landscape of gliomas during progression and reveled ZM as a novel, recurrent fusion transcript in sGBMs and revealed that the fusion arose from translocation events involving introns 3 or 8 of PTPRZ and intron 1 of MET.
Abstract: Studies of gene rearrangements and the consequent oncogenic fusion proteins have laid the foundation for targeted cancer therapy. To identify oncogenic fusions associated with glioma progression, we catalogued fusion transcripts by RNA-seq of 272 gliomas. Fusion transcripts were more frequently found in high-grade gliomas, in the classical subtype of gliomas, and in gliomas treated with radiation/temozolomide. Sixty-seven in-frame fusion transcripts were identified, including three recurrent fusion transcripts: FGFR3-TACC3, RNF213-SLC26A11, and PTPRZ1-MET (ZM). Interestingly, the ZM fusion was found only in grade III astrocytomas (1/13; 7.7%) or secondary GBMs (sGBMs, 3/20; 15.0%). In an independent cohort of sGBMs, the ZM fusion was found in three of 20 (15%) specimens. Genomic analysis revealed that the fusion arose from translocation events involving introns 3 or 8 of PTPRZ and intron 1 of MET. ZM fusion transcripts were found in GBMs irrespective of isocitrate dehydrogenase 1 (IDH1) mutation status. sGBMs harboring ZM fusion showed higher expression of genes required for PIK3CA signaling and lowered expression of genes that suppressed RB1 or TP53 function. Expression of the ZM fusion was mutually exclusive with EGFR overexpression in sGBMs. Exogenous expression of the ZM fusion in the U87MG glioblastoma line enhanced cell migration and invasion. Clinically, patients afflicted with ZM fusion harboring glioblastomas survived poorly relative to those afflicted with non-ZM-harboring sGBMs (P < 0.001). Our study profiles the shifting RNA landscape of gliomas during progression and reveled ZM as a novel, recurrent fusion transcript in sGBMs.

Journal ArticleDOI
TL;DR: This work systematically identified long noncoding natural antisense transcripts (lncNATs), defined as lncRNAs transcribed from the opposite DNA strand of coding orNoncoding genes in Arabidopsis.
Abstract: Recent research on long noncoding RNAs (lncRNAs) has expanded our understanding of gene transcription regulation and the generation of cellular complexity. Depending on their genomic origins, lncRNAs can be transcribed from intergenic or intragenic regions or from introns of protein-coding genes. We have recently reported more than 6000 intergenic lncRNAs in Arabidopsis. Here, we systematically identified long noncoding natural antisense transcripts (lncNATs), defined as lncRNAs transcribed from the opposite DNA strand of coding or noncoding genes. We found a total of 37,238 sense-antisense transcript pairs and 70% of annotated mRNAs to be associated with antisense transcripts in Arabidopsis. These lncNATs could be reproducibly detected by different technical platforms, including strand-specific tiling arrays, Agilent custom expression arrays, strand-specific RNA-seq, and qRT-PCR experiments. Moreover, we investigated the expression profiles of sense-antisense pairs in response to light and observed spatial and developmental-specific light effects on 626 concordant and 766 discordant NAT pairs. Genes for a large number of the light-responsive NAT pairs are associated with histone modification peaks, and histone acetylation is dynamically correlated with light-responsive expression changes of NATs.

Journal ArticleDOI
TL;DR: This study surveyed the genotypes and DNA methylomes of 237 neonates and found 1423 punctuate regions of the methylome that were highly variable across individuals, termed variably methylated regions (VMRs), against a backdrop of homogeneity.
Abstract: Integrating the genotype with epigenetic marks holds the promise of better understanding the biology that underlies the complex interactions of inherited and environmental components that define the developmental origins of a range of disorders. The quality of the in utero environment significantly influences health over the lifecourse. Epigenetics, and in particular DNA methylation marks, have been postulated as a mechanism for the enduring effects of the prenatal environment. Accordingly, neonate methylomes contain molecular memory of the individual in utero experience. However, interindividual variation in methylation can also be a consequence of DNA sequence polymorphisms that result in methylation quantitative trait loci (methQTLs) and, potentially, the interaction between fixed genetic variation and environmental influences. We surveyed the genotypes and DNA methylomes of 237 neonates and found 1423 punctuate regions of the methylome that were highly variable across individuals, termed variably methylated regions (VMRs), against a backdrop of homogeneity. MethQTLs were readily detected in neonatal methylomes, and genotype alone best explained ∼25% of the VMRs. We found that the best explanation for 75% of VMRs was the interaction of genotype with different in utero environments, including maternal smoking, maternal depression, maternal BMI, infant birth weight, gestational age, and birth order. Our study sheds new light on the complex relationship between biological inheritance as represented by genotype and individual prenatal experience and suggests the importance of considering both fixed genetic variation and environmental factors in interpreting epigenetic variation.

Journal ArticleDOI
TL;DR: Pl placental-specific imprinting provides evidence for an inheritable epigenetic state that is independent of DNA methylation and the existence of a novel imprinting mechanism at these loci.
Abstract: Genomic imprinting is a form of epigenetic regulation that results in the expression of either the maternally or paternally inherited allele of a subset of genes (Ramowitz and Bartolomei 2011). This imprinted expression of transcripts is crucial for normal mammalian development. In humans, loss-of-imprinting of specific loci results in a number of diseases exemplified by the reciprocal growth phenotypes of the Beckwith-Wiedemann and Silver-Russell syndromes, and the behavioral disorders Angelman and Prader-Willi syndromes (Kagami et al. 2008; Buiting 2010; Choufani et al. 2010; Eggermann 2010; Kelsey 2010; Mackay and Temple 2010). In addition, aberrant imprinting also contributes to multigenic disorders associated with various complex traits and cancer (Kong et al. 2009; Monk 2010). Imprinted loci contain differentially methylated regions (DMRs) where cytosine methylation marks one of the parental alleles, providing cis-acting regulatory elements that influence the allelic expression of surrounding genes. Some DMRs acquire their allelic methylation during gametogenesis, when the two parental genomes are separated, resulting from the cooperation of the de novo methyltransferase DNMT3A and its cofactor DNMT3L (Bourc'his et al. 2001; Hata et al. 2002). These primary, or germline imprinted DMRs are stably maintained throughout somatic development, surviving the epigenetic reprogramming at the oocyte-to-embryo transition (Smallwood et al. 2011; Smith et al. 2012). To confirm that an imprinted DMR functions as an imprinting control region (ICR), disruption of the imprinted expression upon genetic deletion of that DMR, either through experimental targeting in mouse or that which occurs spontaneously in humans, is required. A subset of DMRs, known as secondary DMRs, acquire methylation during development and are regulated by nearby germline DMRs in a hierarchical fashion (Coombes et al. 2003; Lopes et al. 2003; Kagami et al. 2010). With the advent of large-scale, base-resolution methylation technologies, it is now possible to discriminate allelic methylation dictated by sequence variants from imprinted methylation. Yet our knowledge of the total number of imprinted DMRs in humans, and their developmental dynamics, remains incomplete, hampered by genetic heterogeneity of human samples. Here we present high-resolution mapping of human imprinted methylation. We performed whole-genome-wide bisulfite sequencing (WGBS) on leukocyte-, brain-, liver-, and placenta-derived DNA samples to identify partially methylated regions common to all tissues consistent with imprinted DMRs. We subsequently confirmed the partial methylated states in tissues using high-density methylation microarrays. The parental origin of methylation was determined by comparing microarray data for DNA samples from reciprocal genome-wide uniparental disomy (UPD) samples, in which all chromosomes are inherited from one parent (Lapunzina and Monk 2011), and androgenetic hydatidiform moles, which are created by the fertilization of an oocyte lacking a nucleus by a sperm that endoreduplicates. The use of uniparental disomies and hydatidiform moles meant that our analyses were not subjected to genotype influences, enabling us to characterize all known imprinted DMRs at base-pair resolution and to identify 21 imprinted domains, which we show are absent in mice. Lastly, we extended our analyses to determine the methylation profiles of all imprinted DMRs in sperm, stem cells derived from parthenogenetically activated metaphase-2 oocyte blastocytes (phES) (Mai et al. 2007; Harness et al. 2011), and stem cells (hES) generated from both six-cell blastomeres and the inner cell mass of blastocysts, delineating the extent of embryonic reprogramming that occurs at these loci during human development.

Journal ArticleDOI
TL;DR: MultiBLUP is proposed, which extends the BLUP model to include multiple random effects, allowing greatly improved prediction when the random effects correspond to classes of SNPs with distinct effect-size variances, and is computationally very efficient.
Abstract: BLUP (best linear unbiased prediction) is widely used to predict complex traits in plant and animal breeding, and increasingly in human genetics. The BLUP mathematical model, which consists of a single random effect term, was adequate when kinships were measured from pedigrees. However, when genome-wide SNPs are used to measure kinships, the BLUP model implicitly assumes that all SNPs have the same effect-size distribution, which is a severe and unnecessary limitation. We propose MultiBLUP, which extends the BLUP model to include multiple random effects, allowing greatly improved prediction when the random effects correspond to classes of SNPs with distinct effect-size variances. The SNP classes can be specified in advance, for example, based on SNP functional annotations, and we also provide an adaptive procedure for determining a suitable partition of SNPs. We apply MultiBLUP to genome-wide association data from the Wellcome Trust Case Control Consortium (seven diseases), and from much larger studies of celiac disease and inflammatory bowel disease, finding that it consistently provides better prediction than alternative methods. Moreover, MultiBLUP is computationally very efficient; for the largest data set, which includes 12,678 individuals and 1.5 M SNPs, the total analysis can be run on a single desktop PC in less than a day and can be parallelized to run even faster. Tools to perform MultiBLUP are freely available in our software LDAK.