scispace - formally typeset
Search or ask a question

Showing papers on "Human genome published in 2020"


Journal ArticleDOI
06 Feb 2020-Nature
TL;DR: Whole-genome sequencing data for 2,778 cancer samples from 2,658 unique donors is used to reconstruct the evolutionary history of cancer, revealing that driver mutations can precede diagnosis by several years to decades.
Abstract: Cancer develops through a process of somatic evolution1,2. Sequencing data from a single biopsy represent a snapshot of this process that can reveal the timing of specific genomic aberrations and the changing influence of mutational processes3. Here, by whole-genome sequencing analysis of 2,658 cancers as part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA)4, we reconstruct the life history and evolution of mutational processes and driver mutation sequences of 38 types of cancer. Early oncogenesis is characterized by mutations in a constrained set of driver genes, and specific copy number gains, such as trisomy 7 in glioblastoma and isochromosome 17q in medulloblastoma. The mutational spectrum changes significantly throughout tumour evolution in 40% of samples. A nearly fourfold diversification of driver genes and increased genomic instability are features of later stages. Copy number alterations often occur in mitotic crises, and lead to simultaneous gains of chromosomal segments. Timing analyses suggest that driver mutations often precede diagnosis by many years, if not decades. Together, these results determine the evolutionary trajectories of cancer, and highlight opportunities for early cancer detection.

565 citations


Journal ArticleDOI
29 Jul 2020-Nature
TL;DR: The spectrum of RBP binding throughout the transcriptome and the connections between these interactions and various aspects of RNA biology, including RNA stability, splicing regulation and RNA localization are described.
Abstract: Many proteins regulate the expression of genes by binding to specific regions encoded in the genome1. Here we introduce a new data set of RNA elements in the human genome that are recognized by RNA-binding proteins (RBPs), generated as part of the Encyclopedia of DNA Elements (ENCODE) project phase III. This class of regulatory elements functions only when transcribed into RNA, as they serve as the binding sites for RBPs that control post-transcriptional processes such as splicing, cleavage and polyadenylation, and the editing, localization, stability and translation of mRNAs. We describe the mapping and characterization of RNA elements recognized by a large collection of human RBPs in K562 and HepG2 cells. Integrative analyses using five assays identify RBP binding sites on RNA and chromatin in vivo, the in vitro binding preferences of RBPs, the function of RBP binding sites and the subcellular localization of RBPs, producing 1,223 replicated data sets for 356 RBPs. We describe the spectrum of RBP binding throughout the transcriptome and the connections between these interactions and various aspects of RNA biology, including RNA stability, splicing regulation and RNA localization. These data expand the catalogue of functional elements encoded in the human genome by the addition of a large set of elements that function at the RNA level by interacting with RBPs.

542 citations


Journal ArticleDOI
03 Sep 2020-Nature
TL;DR: High-coverage, ultra-long-read nanopore sequencing is used to create a new human genome assembly that improves on the coverage and accuracy of the current reference (GRCh38) and includes the gap-free, telomere-to-telomere sequence of the X chromosome.
Abstract: After two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no single chromosome has been finished end to end, and hundreds of unresolved gaps persist1,2. Here we present a human genome assembly that surpasses the continuity of GRCh382, along with a gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the complete hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our efforts on the human X chromosome3, we reconstructed the centromeric satellite DNA array (approximately 3.1 Mb) and closed the 29 remaining gaps in the current reference, including new sequences from the human pseudoautosomal regions and from cancer-testis ampliconic gene families (CT-X and GAGE). These sequences will be integrated into future human reference genome releases. In addition, the complete chromosome X, combined with the ultra-long nanopore data, allowed us to map methylation patterns across complex tandem repeats and satellite arrays. Our results demonstrate that finishing the entire human genome is now within reach, and the data presented here will facilitate ongoing efforts to complete the other human chromosomes. High-coverage, ultra-long-read nanopore sequencing is used to create a new human genome assembly that improves on the coverage and accuracy of the current reference (GRCh38) and includes the gap-free, telomere-to-telomere sequence of the X chromosome.

502 citations


Journal ArticleDOI
05 Feb 2020-Nature
TL;DR: Whole-genome sequencing data from more than 2,500 cancers of 38 tumour types reveal 16 signatures that can be used to classify somatic structural variants, highlighting the diversity of genomic rearrangements in cancer.
Abstract: A key mutational process in cancer is structural variation, in which rearrangements delete, amplify or reorder genomic segments that range in size from kilobases to whole chromosomes1-7. Here we develop methods to group, classify and describe somatic structural variants, using data from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA), which aggregated whole-genome sequencing data from 2,658 cancers across 38 tumour types8. Sixteen signatures of structural variation emerged. Deletions have a multimodal size distribution, assort unevenly across tumour types and patients, are enriched in late-replicating regions and correlate with inversions. Tandem duplications also have a multimodal size distribution, but are enriched in early-replicating regions-as are unbalanced translocations. Replication-based mechanisms of rearrangement generate varied chromosomal structures with low-level copy-number gains and frequent inverted rearrangements. One prominent structure consists of 2-7 templates copied from distinct regions of the genome strung together within one locus. Such cycles of templated insertions correlate with tandem duplications, and-in liver cancer-frequently activate the telomerase gene TERT. A wide variety of rearrangement processes are active in cancer, which generate complex configurations of the genome upon which selection can act.

479 citations


Journal ArticleDOI
TL;DR: The currently available platforms, how the technologies are being applied to assemble and phase human genomes, and their impact on improving the authors' understanding of human genetic variation are discussed.
Abstract: Over the past decade, long-read, single-molecule DNA sequencing technologies have emerged as powerful players in genomics. With the ability to generate reads tens to thousands of kilobases in length with an accuracy approaching that of short-read sequencing technologies, these platforms have proven their ability to resolve some of the most challenging regions of the human genome, detect previously inaccessible structural variants and generate some of the first telomere-to-telomere assemblies of whole chromosomes. Long-read sequencing technologies will soon permit the routine assembly of diploid genomes, which will revolutionize genomics by revealing the full spectrum of human genetic variation, resolving some of the missing heritability and leading to the discovery of novel mechanisms of disease.

425 citations


Journal ArticleDOI
20 Mar 2020-Science
TL;DR: The authors' study adds data about African, Oceanian, and Amerindian populations and indicates that diversity tends to result from differences at the single-nucleotide level rather than copy number variation.
Abstract: Genome sequences from diverse human groups are needed to understand the structure of genetic variation in our species and the history of, and relationships between, different populations. We present 929 high-coverage genome sequences from 54 diverse human populations, 26 of which are physically phased using linked-read sequencing. Analyses of these genomes reveal an excess of previously undocumented common genetic variation private to southern Africa, central Africa, Oceania, and the Americas, but an absence of such variants fixed between major geographical regions. We also find deep and gradual population separations within Africa, contrasting population size histories between hunter-gatherer and agriculturalist groups in the past 10,000 years, and a contrast between single Neanderthal but multiple Denisovan source populations contributing to present-day human populations.

415 citations


Journal ArticleDOI
06 Feb 2020-Nature
TL;DR: It is shown that although point mutations and structural variants that drive cancer are less frequent in non-coding genes and regulatory sequences than in protein-c coding genes, additional examples of these drivers will be found as more cancer genomes become available.
Abstract: The discovery of drivers of cancer has traditionally focused on protein-coding genes1-4. Here we present analyses of driver point mutations and structural variants in non-coding regions across 2,658 genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium5 of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). For point mutations, we developed a statistically rigorous strategy for combining significance levels from multiple methods of driver discovery that overcomes the limitations of individual methods. For structural variants, we present two methods of driver discovery, and identify regions that are significantly affected by recurrent breakpoints and recurrent somatic juxtapositions. Our analyses confirm previously reported drivers6,7, raise doubts about others and identify novel candidates, including point mutations in the 5' region of TP53, in the 3' untranslated regions of NFKBIZ and TOB1, focal deletions in BRD4 and rearrangements in the loci of AKR1C genes. We show that although point mutations and structural variants that drive cancer are less frequent in non-coding genes and regulatory sequences than in protein-coding genes, additional examples of these drivers will be found as more cancer genomes become available.

345 citations


Journal ArticleDOI
29 Jul 2020-Nature
TL;DR: A high-density DNase I cleavage map from 243 human cell and tissue types provides a genome-wide, nucleotide-resolution map of human transcription factor footprints, and shows that the enrichment of genetic variants associated with diseases or phenotypic traits in regulatory regions is almost entirely attributable to variants within footprints.
Abstract: Combinatorial binding of transcription factors to regulatory DNA underpins gene regulation in all organisms. Genetic variation in regulatory regions has been connected with diseases and diverse phenotypic traits1, but it remains challenging to distinguish variants that affect regulatory function2. Genomic DNase I footprinting enables the quantitative, nucleotide-resolution delineation of sites of transcription factor occupancy within native chromatin3–6. However, only a small fraction of such sites have been precisely resolved on the human genome sequence6. Here, to enable comprehensive mapping of transcription factor footprints, we produced high-density DNase I cleavage maps from 243 human cell and tissue types and states and integrated these data to delineate about 4.5 million compact genomic elements that encode transcription factor occupancy at nucleotide resolution. We map the fine-scale structure within about 1.6 million DNase I-hypersensitive sites and show that the overwhelming majority are populated by well-spaced sites of single transcription factor–DNA interaction. Cell-context-dependent cis-regulation is chiefly executed by wholesale modulation of accessibility at regulatory DNA rather than by differential transcription factor occupancy within accessible elements. We also show that the enrichment of genetic variants associated with diseases or phenotypic traits in regulatory regions1,7 is almost entirely attributable to variants within footprints, and that functional variants that affect transcription factor occupancy are nearly evenly partitioned between loss- and gain-of-function alleles. Unexpectedly, we find increased density of human genetic variation within transcription factor footprints, revealing an unappreciated driver of cis-regulatory evolution. Our results provide a framework for both global and nucleotide-precision analyses of gene regulatory mechanisms and functional genetic variation. A high-density DNase I cleavage map from 243 human cell and tissue types provides a genome-wide, nucleotide-resolution map of human transcription factor footprints.

191 citations


Journal ArticleDOI
29 Jul 2020-Nature
TL;DR: The results create a universal, extensible coordinate system and vocabulary for human regulatory DNA marked by DHSs, and provide a new global perspective on the architecture of human gene regulation.
Abstract: DNase I hypersensitive sites (DHSs) are generic markers of regulatory DNA1–5 and contain genetic variations associated with diseases and phenotypic traits6–8. We created high-resolution maps of DHSs from 733 human biosamples encompassing 438 cell and tissue types and states, and integrated these to delineate and numerically index approximately 3.6 million DHSs within the human genome sequence, providing a common coordinate system for regulatory DNA. Here we show that these maps highly resolve the cis-regulatory compartment of the human genome, which encodes unexpectedly diverse cell- and tissue-selective regulatory programs at very high density. These programs can be captured comprehensively by a simple vocabulary that enables the assignment to each DHS of a regulatory barcode that encapsulates its tissue manifestations, and global annotation of protein-coding and non-coding RNA genes in a manner orthogonal to gene expression. Finally, we show that sharply resolved DHSs markedly enhance the genetic association and heritability signals of diseases and traits. Rather than being confined to a small number of distal elements or promoters, we find that genetic signals converge on congruently regulated sets of DHSs that decorate entire gene bodies. Together, our results create a universal, extensible coordinate system and vocabulary for human regulatory DNA marked by DHSs, and provide a new global perspective on the architecture of human gene regulation. High-resolution maps of DNase I hypersensitive sites from 733 human biosamples are used to identify and index regulatory DNA within the human genome.

168 citations


Journal ArticleDOI
02 Jul 2020-Nature
TL;DR: A scalable pipeline is used to map and characterize structural variants in 17,795 deeply sequenced human genomes to create the largest, to the authors' knowledge, whole-genome-sequencing-based structural variant resource so far and infer the dosage sensitivity of genes and noncoding elements.
Abstract: A key goal of whole-genome sequencing for studies of human genetics is to interrogate all forms of variation, including single-nucleotide variants, small insertion or deletion (indel) variants and structural variants. However, tools and resources for the study of structural variants have lagged behind those for smaller variants. Here we used a scalable pipeline1 to map and characterize structural variants in 17,795 deeply sequenced human genomes. We publicly release site-frequency data to create the largest, to our knowledge, whole-genome-sequencing-based structural variant resource so far. On average, individuals carry 2.9 rare structural variants that alter coding regions; these variants affect the dosage or structure of 4.2 genes and account for 4.0-11.2% of rare high-impact coding alleles. Using a computational model, we estimate that structural variants account for 17.2% of rare alleles genome-wide, with predicted deleterious effects that are equivalent to loss-of-function coding alleles; approximately 90% of such structural variants are noncoding deletions (mean 19.1 per genome). We report 158,991 ultra-rare structural variants and show that 2% of individuals carry ultra-rare megabase-scale structural variants, nearly half of which are balanced or complex rearrangements. Finally, we infer the dosage sensitivity of genes and noncoding elements, and reveal trends that relate to element class and conservation. This work will help to guide the analysis and interpretation of structural variants in the era of whole-genome sequencing.

162 citations


Journal ArticleDOI
TL;DR: Reviews efforts to create pan-genomes for a range of species, from bacteria to humans, and further considers the computational methods that have been proposed in order to capture, interpret and comparePan-genome data.
Abstract: Since the early days of the genome era, the scientific community has relied on a single ‘reference’ genome for each species, which is used as the basis for a wide range of genetic analyses, including studies of variation within and across species. As sequencing costs have dropped, thousands of new genomes have been sequenced, and scientists have come to realize that a single reference genome is inadequate for many purposes. By sampling a diverse set of individuals, one can begin to assemble a pan-genome: a collection of all the DNA sequences that occur in a species. Here we review efforts to create pan-genomes for a range of species, from bacteria to humans, and we further consider the computational methods that have been proposed in order to capture, interpret and compare pan-genome data. As scientists continue to survey and catalogue the genomic variation across human populations and begin to assemble a human pan-genome, these efforts will increase our power to connect variation to human diversity, disease and beyond. Although single reference genomes are valuable resources, they do not capture genetic diversity among individuals. Sherman and Salzberg discuss the concept of ‘pan-genomes’, which are reference genomes that encompass the genetic variation within a given species. Focusing particularly on large eukaryotic pan-genomes, they describe the latest progress, the varied methodological approaches and computational challenges, as well as applications in fields such as agriculture and human disease.

Posted ContentDOI
19 Sep 2020-bioRxiv
TL;DR: A novel pre-trained bidirectional encoder representation that forms global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts, named DNABERT, and can be readily applied to other organisms with exceptional performance.
Abstract: Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, that forms global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on many sequence predictions tasks, after easy fine-tuning using small task-specific data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variants. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance.

Journal ArticleDOI
TL;DR: A convolutional neural network, Akita, is presented that accurately predicts genome folding from DNA sequence alone and can be used to perform in silico saturation mutagenesis, interpret eQTLs, make predictions for structural variants, and probe species-specific genome folding.
Abstract: In interphase, the human genome sequence folds in three dimensions into a rich variety of locus-specific contact patterns. Cohesin and CTCF (CCCTC-binding factor) are key regulators; perturbing the levels of either greatly disrupts genome-wide folding as assayed by chromosome conformation capture methods. Still, how a given DNA sequence encodes a particular locus-specific folding pattern remains unknown. Here we present a convolutional neural network, Akita, that accurately predicts genome folding from DNA sequence alone. Representations learned by Akita underscore the importance of an orientation-specific grammar for CTCF binding sites. Akita learns predictive nucleotide-level features of genome folding, revealing effects of nucleotides beyond the core CTCF motif. Once trained, Akita enables rapid in silico predictions. Accounting for this, we demonstrate how Akita can be used to perform in silico saturation mutagenesis, interpret eQTLs, make predictions for structural variants and probe species-specific genome folding. Collectively, these results enable decoding genome function from sequence through structure.

Journal ArticleDOI
TL;DR: How TE-derived enhancer sequences can rapidly facilitate changes in existing gene regulatory networks and mediate species- and cell-type-specific regulatory innovations, and a unique contribution of TEs to species-specific gene expression divergence in pluripotency and early embryogenesis are discussed.
Abstract: Eukaryotic gene regulation is mediated by cis-regulatory elements, which are embedded within the vast non-coding genomic space and recognized by the transcription factors in a sequence- and context-dependent manner. A large proportion of eukaryotic genomes, including at least half of the human genome, are composed of transposable elements (TEs), which in their ancestral form carried their own cis-regulatory sequences able to exploit the host trans environment to promote TE transcription and facilitate transposition. Although not all present-day TE copies have retained this regulatory function, the preexisting regulatory potential of TEs can provide a rich source of cis-regulatory innovation for the host. Here, we review recent evidence documenting diverse contributions of TE sequences to gene regulation by functioning as enhancers, promoters, silencers and boundary elements. We discuss how TE-derived enhancer sequences can rapidly facilitate changes in existing gene regulatory networks and mediate species- and cell-type-specific regulatory innovations, and we postulate a unique contribution of TEs to species-specific gene expression divergence in pluripotency and early embryogenesis. With advances in genome-wide technologies and analyses, systematic investigation of TEs' cis-regulatory potential is now possible and our understanding of the biological impact of genomic TEs is increasing. This article is part of a discussion meeting issue 'Crossroads between transposons and gene regulation'.

Journal ArticleDOI
29 Jul 2020-Nature
TL;DR: A map of cohesin-mediated Chromatin loops in 24 types of human cells identifies loops that show cell-type-specific variation, indicating that chromatin loops may help to specify cell-specific gene expression programs and functions.
Abstract: Physical interactions between distal regulatory elements have a key role in regulating gene expression, but the extent to which these interactions vary between cell types and contribute to cell-type-specific gene expression remains unclear. Here, to address these questions as part of phase III of the Encyclopedia of DNA Elements (ENCODE), we mapped cohesin-mediated chromatin loops, using chromatin interaction analysis by paired-end tag sequencing (ChIA-PET), and analysed gene expression in 24 diverse human cell types, including core ENCODE cell lines. Twenty-eight per cent of all chromatin loops vary across cell types; these variations modestly correlate with changes in gene expression and are effective at grouping cell types according to their tissue of origin. The connectivity of genes corresponds to different functional classes, with housekeeping genes having few contacts, and dosage-sensitive genes being more connected to enhancer elements. This atlas of chromatin loops complements the diverse maps of regulatory architecture that comprise the ENCODE Encyclopedia, and will help to support emerging analyses of genome structure and function. A map of cohesin-mediated chromatin loops in 24 types of human cells identifies loops that show cell-type-specific variation, indicating that chromatin loops may help to specify cell-specific gene expression programs and functions.

Journal ArticleDOI
06 May 2020-Nature
TL;DR: Dimethyl sulfate mutational profiling with sequencing and an algorithm named ‘detection of RNA folding ensembles using expectation–maximization’ (DREEM) are developed, which reveal that heterogeneity of RNA structure in HIV-1 regulates the use of splice sites and expression of viral genes.
Abstract: Human immunodeficiency virus 1 (HIV-1) is a retrovirus with a ten-kilobase single-stranded RNA genome. HIV-1 must express all of its gene products from a single primary transcript, which undergoes alternative splicing to produce diverse protein products that include structural proteins and regulatory factors1,2. Despite the critical role of alternative splicing, the mechanisms that drive the choice of splice site are poorly understood. Synonymous RNA mutations that lead to severe defects in splicing and viral replication indicate the presence of unknown cis-regulatory elements3. Here we use dimethyl sulfate mutational profiling with sequencing (DMS-MaPseq) to investigate the structure of HIV-1 RNA in cells, and develop an algorithm that we name 'detection of RNA folding ensembles using expectation-maximization' (DREEM), which reveals the alternative conformations that are assumed by the same RNA sequence. Contrary to previous models that have analysed population averages4, our results reveal heterogeneous regions of RNA structure across the entire HIV-1 genome. In addition to confirming that in vitro characterized5 alternative structures for the HIV-1 Rev responsive element also exist in cells, we discover alternative conformations at critical splice sites that influence the ratio of transcript isoforms. Our simultaneous measurement of splicing and intracellular RNA structure provides evidence for the long-standing hypothesis6-8 that heterogeneity in RNA conformation regulates splice-site use and viral gene expression.

Journal ArticleDOI
30 Jul 2020-Nature
TL;DR: In the third phase of ENCODE, nearly a million and more than 300,000 cCRE annotations have been generated for human and mouse, respectively, and these have provided a valuable resource for the scientific community.
Abstract: The Encylopedia of DNA Elements (ENCODE) Project launched in 2003 with the long-term goal of developing a comprehensive map of functional elements in the human genome. These included genes, biochemical regions associated with gene regulation (for example, transcription factor binding sites, open chromatin, and histone marks) and transcript isoforms. The marks serve as sites for candidate cis-regulatory elements (cCREs) that may serve functional roles in regulating gene expression1. The project has been extended to model organisms, particularly the mouse. In the third phase of ENCODE, nearly a million and more than 300,000 cCRE annotations have been generated for human and mouse, respectively, and these have provided a valuable resource for the scientific community.

Journal ArticleDOI
TL;DR: This model, termed Xpresso, more than doubles the accuracy of alternative sequence-based models and isolates rules as predictive as models relying on chromatic immunoprecipitation sequencing data, and its residuals can be used to quantify the influence of enhancers, heterochromatic domains, and microRNAs.

Journal ArticleDOI
TL;DR: These regions were explored by combining association analysis with in silico genomic feature annotations and a Bayesian approach that combines genetic association, linkage disequilibrium and enriched genomic features to determine variants with high posterior probabilities of being causal.
Abstract: Genome-wide association studies have identified breast cancer risk variants in over 150 genomic regions, but the mechanisms underlying risk remain largely unknown. These regions were explored by combining association analysis with in silico genomic feature annotations. We defined 205 independent risk-associated signals with the set of credible causal variants in each one. In parallel, we used a Bayesian approach (PAINTOR) that combines genetic association, linkage disequilibrium and enriched genomic features to determine variants with high posterior probabilities of being causal. Potentially causal variants were significantly over-represented in active gene regulatory regions and transcription factor binding sites. We applied our INQUSIT pipeline for prioritizing genes as targets of those potentially causal variants, using gene expression (expression quantitative trait loci), chromatin interaction and functional annotations. Known cancer drivers, transcription factors and genes in the developmental, apoptosis, immune system and DNA integrity checkpoint gene ontology pathways were over-represented among the highest-confidence target genes.

Journal ArticleDOI
TL;DR: Current findings about expression, functions and potential ceRNA mechanisms of pseudogene-derived lncRNAs in human cancer may provide some crucial clues in developing potential targets for cancer therapy in the future.
Abstract: Pseudogenes, abundant in the human genome, are traditionally considered as non-functional "junk genes." However, recent studies have revealed that pseudogenes act as key regulators at DNA, RNA or protein level in diverse human disorders (including cancer), among which pseudogene-derived long non-coding RNA (lncRNA) transcripts are extensively investigated and has been reported to be frequently dysregulated in various types of human cancer. Growing evidence demonstrates that pseudogene-derived lncRNAs play important roles in cancer initiation and progression by serving as competing endogenous RNAs (ceRNAs) through competitively binding to shared microRNAs (miRNAs), thus affecting both their cognate genes and unrelated genes. Herein, we retrospect those current findings about expression, functions and potential ceRNA mechanisms of pseudogene-derived lncRNAs in human cancer, which may provide us with some crucial clues in developing potential targets for cancer therapy in the future.

Journal ArticleDOI
TL;DR: It is found that silencers are widely distributed and may function in a tissue-specific fashion and probably contributes substantially to the regulation of gene expression and human biology.
Abstract: The majority of the human genome does not encode proteins. Many of these noncoding regions contain important regulatory sequences that control gene expression. To date, most studies have focused on activators such as enhancers, but regions that repress gene expression—silencers—have not been systematically studied. We have developed a system that identifies silencer regions in a genome-wide fashion on the basis of silencer-mediated transcriptional repression of caspase 9. We found that silencers are widely distributed and may function in a tissue-specific fashion. These silencers harbor unique epigenetic signatures and are associated with specific transcription factors. Silencers also act at multiple genes, and at the level of chromosomal domains and long-range interactions. Deletion of silencer regions linked to the drug transporter genes ABCC2 and ABCG2 caused chemo-resistance. Overall, our study demonstrates that tissue-specific silencing is widespread throughout the human genome and probably contributes substantially to the regulation of gene expression and human biology. A genome-wide screen identifies silencer regions in human cells. Deletion of silencers linked to the transporter genes ABCC2 and ABCG2 causes their up-regulation and chemo-resistance.

Journal ArticleDOI
TL;DR: Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers, suggesting that HiFi may be the most effective standalone technology for de novo assembly of human genomes.
Abstract: The sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective standalone technology for de novo assembly of human genomes.

Journal ArticleDOI
TL;DR: This review is centered on non-viral vectors mainly comprising of cationic lipids and polymers for nucleic acid-based delivery for numerous gene therapy-based applications.
Abstract: The field of gene therapy has experienced an insurgence of attention for its widespread ability to regulate gene expression by targeting genomic DNA, messenger RNA, microRNA, and short-interfering RNA for treating malignant and non-malignant disorders. Numerous nucleic acid analogs have been developed to target coding or non-coding sequences of the human genome for gene regulation. However, broader clinical applications of nucleic acid analogs have been limited due to their poor cell or organ-specific delivery. To resolve these issues, non-viral vectors based on nanoparticles, liposomes, and polyplexes have been developed to date. This review is centered on non-viral vectors mainly comprising of cationic lipids and polymers for nucleic acid-based delivery for numerous gene therapy-based applications.

Journal ArticleDOI
TL;DR: A novel and powerful approach to apply mouse regulatory models to analyze human genetic variants associated with molecular phenotypes and disease and unleash thousands of non-human epigenetic and transcriptional profiles toward more effective investigation of how gene regulation affects human disease.
Abstract: Machine learning algorithms trained to predict the regulatory activity of nucleic acid sequences have revealed principles of gene regulation and guided genetic variation analysis. While the human genome has been extensively annotated and studied, model organisms have been less explored. Model organism genomes offer both additional training sequences and unique annotations describing tissue and cell states unavailable in humans. Here, we develop a strategy to train deep convolutional neural networks simultaneously on multiple genomes and apply it to learn sequence predictors for large compendia of human and mouse data. Training on both genomes improves gene expression prediction accuracy on held out and variant sequences. We further demonstrate a novel and powerful approach to apply mouse regulatory models to analyze human genetic variants associated with molecular phenotypes and disease. Together these techniques unleash thousands of non-human epigenetic and transcriptional profiles toward more effective investigation of how gene regulation affects human disease.

Journal ArticleDOI
26 Jun 2020-Science
TL;DR: In this paper, the authors describe a systematic complex genetic interaction analysis with yeast paralogs derived from the whole-genome duplication event, mapping digenic interactions for a deletion mutant of each paralog, and of trigenic interaction for the double mutant, provides insight into their roles and a quantitative measure of their functional redundancy.
Abstract: Whole-genome duplication has played a central role in the genome evolution of many organisms, including the human genome Most duplicated genes are eliminated, and factors that influence the retention of persisting duplicates remain poorly understood We describe a systematic complex genetic interaction analysis with yeast paralogs derived from the whole-genome duplication event Mapping of digenic interactions for a deletion mutant of each paralog, and of trigenic interactions for the double mutant, provides insight into their roles and a quantitative measure of their functional redundancy Trigenic interaction analysis distinguishes two classes of paralogs: a more functionally divergent subset and another that retained more functional overlap Gene feature analysis and modeling suggest that evolutionary trajectories of duplicated genes are dictated by combined functional and structural entanglement factors

Journal ArticleDOI
TL;DR: This work mapped, at high resolution, the genomic sites of MiDAS in cells treated with the DNA polymerase inhibitor aphidicolin, and found leading and lagging strand synthesis were uncoupled inMiDAS, consistent with MiDas being a form of break-induced replication, a repair mechanism for collapsed DNA replication forks.
Abstract: DNA replication stress, a feature of human cancers, often leads to instability at specific genomic loci, such as the common fragile sites (CFSs) Cells experiencing DNA replication stress may also exhibit mitotic DNA synthesis (MiDAS) To understand the physiological function of MiDAS and its relationship to CFSs, we mapped, at high resolution, the genomic sites of MiDAS in cells treated with the DNA polymerase inhibitor aphidicolin Sites of MiDAS were evident as well-defined peaks that were largely conserved between cell lines and encompassed all known CFSs The MiDAS peaks mapped within large, transcribed, origin-poor genomic regions In cells that had been treated with aphidicolin, these regions remained unreplicated even in late S phase; MiDAS then served to complete their replication after the cells entered mitosis Interestingly, leading and lagging strand synthesis were uncoupled in MiDAS, consistent with MiDAS being a form of break-induced replication, a repair mechanism for collapsed DNA replication forks Our results provide a better understanding of the mechanisms leading to genomic instability at CFSs and in cancer cells

Journal ArticleDOI
TL;DR: This work reveals that H3K27me3-marked large DNA methylation grand canyons represent a set of very-long-range loops associated with cellular identity, and suggests the formation of these loops by interactions between repressive elements in the loci, forming a genomic subcompartment, rather than by cohesion/CTCF-mediated extrusion.

Journal ArticleDOI
TL;DR: It has been reported that lncRNAs interact with DNA, RNA, and/or protein molecules, and regulate chromatin organisation, transcriptional and post-transcriptional regulation, and are directly linked to the transformation of healthy cells into tumour cells.

Journal ArticleDOI
09 Jul 2020-Cell
TL;DR: A comprehensive analysis of structural variation in the Human Genome Diversity panel, a high-coverage dataset of 911 samples from 54 diverse worldwide populations, identifies 126,018 variants, 78% of which were not identified in previous global sequencing projects.

Journal ArticleDOI
TL;DR: Although there is still no evidence for HERV-K (HML-2) as a direct cause of diseases, aberrant expression profiles of the HERVs transcripts and their regulatory function to their proximal host-genes were identified in different diseases.
Abstract: Human endogenous retroviruses (HERVs) are derived from exogenous retrovirus infections in the evolution of primates and account for about 8% of the human genome. They were considered as silent passengers within our genomes for a long time, however, reactivation of HERVs has been associated with tumors and autoimmune diseases, especially the HERV-K (HML-2) family, the most recent integration groups with the least number of mutations and the most biologically active to encode functional retroviral proteins and produce retrovirus-like particles. Increasing studies are committed to determining the potential role of HERV-K (HML-2) in pathogenicity. Although there is still no evidence for HERV-K (HML-2) as a direct cause of diseases, aberrant expression profiles of the HERV-K (HML-2) transcripts and their regulatory function to their proximal host-genes were identified in different diseases. In this review, we summarized the advances between HERV-K (HML-2) and diseases to provide basis for further studies on the causal relationship between HERV-K (HML-2) and diseases. We recommended more attention to polymorphic integrated HERV-K (HML-2) loci which could be genetic causative factors and be associated with inter-individual differences in tumorigenesis and autoimmune diseases.