scispace - formally typeset
Search or ask a question

Showing papers by "Manolis Kellis published in 2018"


Journal ArticleDOI
TL;DR: A mathematical expression is derived to compute PrediXcan results using summary data, and the effects of gene expression variation on human phenotypes in 44 GTEx tissues and >100 phenotypes are investigated.
Abstract: Scalable, integrative methods to understand mechanisms that link genetic variants with phenotypes are needed. Here we derive a mathematical expression to compute PrediXcan (a gene mapping approach) results using summary data (S-PrediXcan) and show its accuracy and general robustness to misspecified reference sets. We apply this framework to 44 GTEx tissues and 100+ phenotypes from GWAS and meta-analysis studies, creating a growing public catalog of associations that seeks to capture the effects of gene expression variation on human phenotypes. Replication in an independent cohort is shown. Most of the associations are tissue specific, suggesting context specificity of the trait etiology. Colocalized significant associations in unexpected tissues underscore the need for an agnostic scanning of multiple contexts to improve our ability to detect causal regulatory mechanisms. Monogenic disease genes are enriched among significant associations for related traits, suggesting that smaller alterations of these genes may cause a spectrum of milder phenotypes.

657 citations


Journal ArticleDOI
TL;DR: Integration of expression quantitative trait locus (eQTL) data from the Genotype-Tissue Expression project with genome-wide association study data shows that eQTLs are enriched for trait associations in disease-relevant tissues.
Abstract: We apply integrative approaches to expression quantitative loci (eQTLs) from 44 tissues from the Genotype-Tissue Expression project and genome-wide association study data. About 60% of known trait-associated loci are in linkage disequilibrium with a cis-eQTL, over half of which were not found in previous large-scale whole blood studies. Applying polygenic analyses to metabolic, cardiovascular, anthropometric, autoimmune, and neurodegenerative traits, we find that eQTLs are significantly enriched for trait associations in relevant pathogenic tissues and explain a substantial proportion of the heritability (40-80%). For most traits, tissue-shared eQTLs underlie a greater proportion of trait associations, although tissue-specific eQTLs have a greater contribution to some traits, such as blood pressure. By integrating information from biological pathways with eQTL target genes and applying a gene-based approach, we validate previously implicated causal genes and pathways, and propose new variant and gene associations for several complex traits, which we replicate in the UK BioBank and BioVU.

358 citations


Journal ArticleDOI
TL;DR: An essential function for m6A mRNA modification in promoting neural stem cell proliferation is demonstrated and interactions between m 6A and histone modification as a novel gene regulatory mechanism are revealed.
Abstract: Internal N6-methyladenosine (m6A) modification is widespread in messenger RNAs (mRNAs) and is catalyzed by heterodimers of methyltransferase-like protein 3 (Mettl3) and Mettl14. To understand the role of m6A in development, we deleted Mettl14 in embryonic neural stem cells (NSCs) in a mouse model. Phenotypically, NSCs lacking Mettl14 displayed markedly decreased proliferation and premature differentiation, suggesting that m6A modification enhances NSC self-renewal. Decreases in the NSC pool led to a decreased number of late-born neurons during cortical neurogenesis. Mechanistically, we discovered a genome-wide increase in specific histone modifications in Mettl14 knockout versus control NSCs. These changes correlated with altered gene expression and observed cellular phenotypes, suggesting functional significance of altered histone modifications in knockout cells. Finally, we found that m6A regulates histone modification in part by destabilizing transcripts that encode histone-modifying enzymes. Our results suggest an essential role for m6A in development and reveal m6A-regulated histone modifications as a previously unknown mechanism of gene regulation in mammalian cells.

298 citations


Journal ArticleDOI
TL;DR: Large-scale deep-coverage whole-genome sequencing is now feasible and offers potential advantages for locus discovery and the incremental value of WGS for discovery is limited but WGS permits simultaneous assessment of monogenic and polygenic models to severe hypercholesterolemia.
Abstract: Large-scale deep-coverage whole-genome sequencing (WGS) is now feasible and offers potential advantages for locus discovery. We perform WGS in 16,324 participants from four ancestries at mean depth >29X and analyze genotypes with four quantitative traits—plasma total cholesterol, low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol, and triglycerides. Common variant association yields known loci except for few variants previously poorly imputed. Rare coding variant association yields known Mendelian dyslipidemia genes but rare non-coding variant association detects no signals. A high 2M-SNP LDL-C polygenic score (top 5th percentile) confers similar effect size to a monogenic mutation (~30 mg/dl higher for each); however, among those with severe hypercholesterolemia, 23% have a high polygenic score and only 2% carry a monogenic mutation. At these sample sizes and for these phenotypes, the incremental value of WGS for discovery is limited but WGS permits simultaneous assessment of monogenic and polygenic models to severe hypercholesterolemia.

147 citations


Journal ArticleDOI
TL;DR: High-resolution enhancer function quantification and high-resolution dissection for millions of accessible DNA fragments, revealing driver nucleotides and helping interpret non-coding disease variants are reported.
Abstract: Genome-wide epigenomic maps have revealed millions of putative enhancers and promoters, but experimental validation of their function and high-resolution dissection of their driver nucleotides remain limited. Here, we present HiDRA (High-resolution Dissection of Regulatory Activity), a combined experimental and computational method for high-resolution genome-wide testing and dissection of putative regulatory regions. We test ~7 million accessible DNA fragments in a single experiment, by coupling accessible chromatin extraction with self-transcribing episomal reporters (ATAC-STARR-seq). By design, fragments are highly overlapping in densely-sampled accessible regions, enabling us to pinpoint driver regulatory nucleotides by exploiting differences in activity between partially-overlapping fragments using a machine learning model (SHARPR-RE). In GM12878 lymphoblastoid cells, we find ~65,000 regions showing enhancer function, and pinpoint ~13,000 high-resolution driver elements. These are enriched for regulatory motifs, evolutionarily-conserved nucleotides, and disease-associated genetic variants from genome-wide association studies. Overall, HiDRA provides a high-throughput, high-resolution approach for dissecting regulatory regions and driver nucleotides.

105 citations


Journal ArticleDOI
TL;DR: Characterization of mRNA structure during the zebrafish maternal-to-zygotic transition identifies the ribosome as a major RNA structure remodeler in vivo and reveals that structural dynamics can affect gene expression, partly by modulating miRNA activity.
Abstract: RNA folding plays a crucial role in RNA function. However, knowledge of the global structure of the transcriptome is limited to cellular systems at steady state, thus hindering the understanding of RNA structure dynamics during biological transitions and how it influences gene function. Here, we characterized mRNA structure dynamics during zebrafish development. We observed that on a global level, translation guides structure rather than structure guiding translation. We detected a decrease in structure in translated regions and identified the ribosome as a major remodeler of RNA structure in vivo. In contrast, we found that 3′ untranslated regions (UTRs) form highly folded structures in vivo, which can affect gene expression by modulating microRNA activity. Furthermore, dynamic 3′-UTR structures contain RNA-decay elements, such as the regulatory elements in nanog and ccna1, two genes encoding key maternal factors orchestrating the maternal-to-zygotic transition. These results reveal a central role of RNA structure dynamics in gene regulatory programs. Characterization of mRNA structure during the zebrafish maternal-to-zygotic transition identifies the ribosome as a major RNA structure remodeler in vivo and reveals that structural dynamics can affect gene expression, partly by modulating miRNA activity.

91 citations


Journal ArticleDOI
28 Sep 2018-Science
TL;DR: The exploration of the link between SD-ASM, stochastic variation in DNA methylation, and gene regulation requires deep coverage by WGBS across tissues and individuals and the context of other epigenomic marks and gene transcription.
Abstract: INTRODUCTION A majority of imbalances in DNA methylation between homologous chromosomes in humans are sequence-dependent; the DNA sequence differences between the two chromosomes cause differences in the methylation state of neighboring cytosines on the same chromosome. The analyses of this sequence-dependent allele-specific methylation (SD-ASM) traditionally involved measurement of average methylation levels across many cells. Detailed understanding of SD-ASM at the single-cell and single-chromosome levels is lacking. This gap in understanding may hide the connection between SD-ASM, ubiquitous stochastic cell-to-cell and chromosome-to-chromosome variation in DNA methylation, and the puzzling and evolutionarily conserved patterns of intermediate methylation at gene regulatory loci. RATIONALE Whole-genome bisulfite sequencing (WGBS) provides the ultimate single-chromosome level of resolution and comprehensive whole-genome coverage required to explore SD-ASM. However, the exploration of the link between SD-ASM, stochastic variation in DNA methylation, and gene regulation requires deep coverage by WGBS across tissues and individuals and the context of other epigenomic marks and gene transcription. RESULTS We constructed maps of allelic imbalances in DNA methylation, histone marks, and gene transcription in 71 epigenomes from 36 distinct cell and tissue types from 13 donors. Deep (1691-fold) combined WGBS read coverage across 49 methylomes revealed CpG methylation imbalances exceeding 30% differences at 5% of the loci, which is more conservative than previous estimates in the 8 to 10% range; a similar value (8%) is observed in our dataset when we lowered our threshold for detecting allelic imbalance to 20% methylation difference between the two alleles. Extensive sequence-dependent CpG methylation imbalances were observed at thousands of heterozygous regulatory loci. Stochastic switching, defined as random transitions between fully methylated and unmethylated states of DNA, occurred at thousands of regulatory loci bound by transcription factors (TFs). Our results explain the conservation of intermediate methylation states at regulatory loci by showing that the intermediate methylation reflects the relative frequencies of fully methylated and fully unmethylated epialleles. SD-ASM is explainable by different relative frequencies of methylated and unmethylated epialleles for the two alleles. The differences in epiallele frequency spectra of the alleles at thousands of TF-bound regulatory loci correlated with the differences in alleles’ affinities for TF binding, which suggests a mechanistic explanation for SD-ASM. We observed an excess of rare variants among those showing SD-ASM, which suggests that an average human genome harbors at least ~200 detrimental rare variants that also show SD-ASM. The methylome’s sensitivity to genetic variation is unevenly distributed across the genome, which is consistent with buffering of housekeeping genes against the effects of random mutations. By contrast, less essential genes with tissue-specific expression patterns show sensitivity, thus providing opportunity for evolutionary innovation through changes in gene regulation. CONCLUSION Analysis of allelic epigenome maps provides a unifying model that links sequence-dependent allelic imbalances of the epigenome, stochastic switching at gene regulatory loci, selective buffering of the regulatory circuitry against the effects of random mutations, and disease-associated genetic variation.

80 citations


Journal ArticleDOI
TL;DR: In this article, the authors used deep-coverage whole genome sequencing in 8392 individuals of European and African ancestry to discover and interpret both single-nucleotide variants and copy number (CN) variation associated with Lp(a).
Abstract: Lipoprotein(a), Lp(a), is a modified low-density lipoprotein particle that contains apolipoprotein(a), encoded by LPA, and is a highly heritable, causal risk factor for cardiovascular diseases that varies in concentrations across ancestries. Here, we use deep-coverage whole genome sequencing in 8392 individuals of European and African ancestry to discover and interpret both single-nucleotide variants and copy number (CN) variation associated with Lp(a). We observe that genetic determinants between Europeans and Africans have several unique determinants. The common variant rs12740374 associated with Lp(a) cholesterol is an eQTL for SORT1 and independent of LDL cholesterol. Observed associations of aggregates of rare non-coding variants are largely explained by LPA structural variation, namely the LPA kringle IV 2 (KIV2)-CN. Finally, we find that LPA risk genotypes confer greater relative risk for incident atherosclerotic cardiovascular diseases compared to directly measured Lp(a), and are significantly associated with measures of subclinical atherosclerosis in African Americans.

72 citations


Journal ArticleDOI
TL;DR: The screen for significant mutation patterns coupled with correlative mutational analysis identified new individual driver candidates and suggest that some non-coding mutations recurrently affect expression and play a role in cancer development.
Abstract: Cancer develops by accumulation of somatic driver mutations, which impact cellular function. Mutations in non-coding regulatory regions can now be studied genome-wide and further characterized by correlation with gene expression and clinical outcome to identify driver candidates. Using a new two-stage procedure, called ncDriver, we first screened 507 ICGC whole-genomes from 10 cancer types for non-coding elements, in which mutations are both recurrent and have elevated conservation or cancer specificity. This identified 160 significant non-coding elements, including the TERT promoter, a well-known non-coding driver element, as well as elements associated with known cancer genes and regulatory genes (e.g., PAX5, TOX3, PCF11, MAPRE3). However, in some significant elements, mutations appear to stem from localized mutational processes rather than recurrent positive selection in some cases. To further characterize the driver potential of the identified elements and shortlist candidates, we identified elements where presence of mutations correlated significantly with expression levels (e.g., TERT and CDH10) and survival (e.g., CDH9 and CDH10) in an independent set of 505 TCGA whole-genome samples. In a larger pan-cancer set of 4128 TCGA exomes with expression profiling, we identified mutational correlation with expression for additional elements (e.g., near GATA3, CDC6, ZNF217, and CTCF transcription factor binding sites). Survival analysis further pointed to MIR122, a known marker of poor prognosis in liver cancer. In conclusion, the screen for significant mutation patterns coupled with correlative mutational analysis identified new individual driver candidates and suggest that some non-coding mutations recurrently affect expression and play a role in cancer development.

70 citations


Journal ArticleDOI
TL;DR: RANGER-DTL 2.0 has a particular focus on reconciliation accuracy and can account for many sources of reconciliation uncertainty including uncertain gene tree rooting, gene tree topological uncertainty, multiple optimal reconciliations and alternative event cost assignments.
Abstract: Summary RANGER-DTL 2.0 is a software program for inferring gene family evolution using Duplication-Transfer-Loss reconciliation. This new software is highly scalable and easy to use, and offers many new features not currently available in any other reconciliation program. RANGER-DTL 2.0 has a particular focus on reconciliation accuracy and can account for many sources of reconciliation uncertainty including uncertain gene tree rooting, gene tree topological uncertainty, multiple optimal reconciliations and alternative event cost assignments. RANGER-DTL 2.0 is open-source and written in C++ and Python. Availability and implementation Pre-compiled executables, source code (open-source under GNU GPL) and a detailed manual are freely available from http://compbio.engr.uconn.edu/software/RANGER-DTL/. Supplementary information Supplementary data are available at Bioinformatics online.

61 citations


Journal ArticleDOI
TL;DR: The readthrough efficiency of the annotated stop codon for the sequence encoding vitamin D receptor (VDR), a member of the nuclear receptor superfamily of ligand-inducible transcription factors, was the highest of those tested but all showed notable levels of readthrough.

Journal ArticleDOI
TL;DR: An in-depth investigation on the 2764 genes classified as coding by one or more sets of manual curators and not coding by others suggests that most are not under protein-like purifying selection and so are unlikely to code for functional proteins.
Abstract: Seventeen years after the sequencing of the human genome, the human proteome is still under revision. One in eight of the 22 210 coding genes listed by the Ensembl/GENCODE, RefSeq and UniProtKB reference databases are annotated differently across the three sets. We have carried out an in-depth investigation on the 2764 genes classified as coding by one or more sets of manual curators and not coding by others. Data from large-scale genetic variation analyses suggests that most are not under protein-like purifying selection and so are unlikely to code for functional proteins. A further 1470 genes annotated as coding in all three reference sets have characteristics that are typical of non-coding genes or pseudogenes. These potential non-coding genes also appear to be undergoing neutral evolution and have considerably less supporting transcript and protein evidence than other coding genes. We believe that the three reference databases currently overestimate the number of human coding genes by at least 2000, complicating and adding noise to large-scale biomedical experiments. Determining which potential non-coding genes do not code for proteins is a difficult but vitally important task since the human reference proteome is a fundamental pillar of most basic research and supports almost all large-scale biomedical projects.

Journal ArticleDOI
TL;DR: In this paper, the authors investigated the relationship between the chromatin opening and transcriptional activation and concluded that chromatin accessibility is a central factor for successful transcriptional reprogramming in oocytes.

01 Jan 2018
TL;DR: Deep-coverage whole genome sequencing in 8392 individuals of European and African ancestry is used to discover and interpret both single-nucleotide variants and copy number (CN) variation associated with Lp(a), and finds that LPA risk genotypes confer greater relative risk for incident atherosclerotic cardiovascular diseases compared to directly measured LP(a).

Journal ArticleDOI
TL;DR: It is found that the four ATE1 isoforms have different, only partially overlapping target site specificity that includes more variability in the target residues than previously believed.
Abstract: Protein arginylation mediated by arginyltransferase ATE1 is a key regulatory process essential for mammalian embryogenesis, cell migration, and protein regulation. Despite decades of studies, very little is known about the specificity of ATE1-mediated target site recognition. Here, we used in vitro assays and computational analysis to dissect target site specificity of mouse arginyltransferases and gain insights into the complexity of the mammalian arginylome. We found that the four ATE1 isoforms have different, only partially overlapping target site specificity that includes more variability in the target residues than previously believed. Based on all the available data, we generated an algorithm for identifying potential arginylation consensus motif and used this algorithm for global prediction of proteins arginylated in vivo on the N-terminal D and E. Our analysis reveals multiple proteins with potential ATE1 target sites and expand our understanding of the biological complexity of the intracellular arginylome.

Posted ContentDOI
26 Sep 2018-bioRxiv
TL;DR: This work shows that a single amino acid, arginine, is the major contributor to codon usage bias differences across domains of life, and exploits the finding to show that the identified domain-specific codon bias signatures can be used to classify a given sequence into its corresponding domain with high accuracy.
Abstract: Due to the degeneracy of the genetic code, multiple codons are translated into the same amino acid. Despite being ‘synonymous’, these codons are not equally used. Selective pressures are thought to drive the choice among synonymous codons within a genome, while GC content, which is generally attributed to mutational drift, is the major determinant of interspecies codon usage bias. Here we find that in addition to the bias caused by GC content, inter-species codon usage signatures can also be detected. More specifically, we show that a single amino acid, arginine, is the major contributor to codon usage bias differences across domains of life. We then exploit this finding, and show that the identified domain-specific codon bias signatures can be used to classify a given sequence into its corresponding domain with high accuracy. Considering that species belonging to the same domain share similar tRNA decoding strategies, we then wondered whether the inclusion of codon autocorrelation patterns might improve the classification performance of our algorithm. However, we find that autocorrelation patterns are not domain-specific, and surprisingly, are unrelated to tRNA reusage, in contrast to the common belief. Instead, our results reveal that codon autocorrelation patterns are a consequence of codon optimality throughout a sequence, where highly expressed genes display autocorrelated ‘optimal’ codons, whereas lowly expressed genes display autocorrelated ‘non-optimal’ codons.

Journal ArticleDOI
TL;DR: It is hypothesize that loss of LDAH is involved in PCa and other phenotypes observed in support of a genotype‐phenotype association in an n‐of‐one human subject.
Abstract: Great strides in gene discovery have been made using a multitude of methods to associate phenotypes with genetic variants, but there still remains a substantial gap between observed symptoms and identified genetic defects. Herein, we use the convergence of various genetic and genomic techniques to investigate the underpinnings of a constellation of phenotypes that include prostate cancer (PCa) and sensorineural hearing loss (SNHL) in a human subject. Through interrogation of the subject's de novo, germline, balanced chromosomal translocation, we first identify a correlation between his disorders and a poorly annotated gene known as lipid droplet associated hydrolase (LDAH). Using data repositories of both germline and somatic variants, we identify convergent genomic evidence that substantiates a correlation between loss of LDAH and PCa. This correlation is validated through both in vitro and in vivo models that show loss of LDAH results in increased risk of PCa and, to a lesser extent, SNHL. By leveraging convergent evidence in emerging genomic data, we hypothesize that loss of LDAH is involved in PCa and other phenotypes observed in support of a genotype-phenotype association in an n-of-one human subject.

Posted ContentDOI
02 Jul 2018-bioRxiv
TL;DR: Reanalyze the evidence used by CHESS, and it is found that nearly all protein-coding predictions are false positives, and that 86% overlap transposons marked by RepeatMasker that are known to frequently result in false positive protein- coding predictions.
Abstract: In a 2018 paper posted to bioRxiv, Pertea et al. presented the CHESS database, a new catalog of human gene annotations that includes 1,178 new protein-coding predictions. These are based on evidence of transcription in human tissues and homology to earlier annotations in human and other mammals. Here, we reanalyze the evidence used by CHESS, and find that nearly all protein-coding predictions are false positives. We find that 86% overlap transposons marked by RepeatMasker that are known to frequently result in false positive protein-coding predictions. More than half are homologous to only nine Alu-derived primate sequences corresponding to an erroneous and previously withdrawn Pfam protein domain. The entire set shows poor evolutionary conservation and PhyloCSF protein-coding evolutionary signatures indistinguishable from noncoding RNAs, indicating lack of protein-coding constraint. Only four predictions are supported by mass spectrometry evidence, and even those matches are inconclusive. Overall, the new protein-coding predictions are unsupported by any credible experimental or evolutionary evidence of function, result primarily from homology to genes incorrectly classified as protein-coding, and are unlikely to encode functional proteins.

Posted Content
TL;DR: The platform combines the security of Intel Software Guarded eXtensions (SGX), transparency of blockchain technology, and verifiability of open algorithms and source codes to allow AI models to be trained on medical data securely.
Abstract: Artificial Intelligence (AI) incorporating genetic and medical information have been applied in disease risk prediction, unveiling disease mechanism, and advancing therapeutics. However, AI training relies on highly sensitive and private data which significantly limit their applications and robustness evaluation. Moreover, the data access management after sharing across organization heavily relies on legal restriction, and there is no guarantee in preventing data leaking after sharing. Here, we present Genie, a secure AI platform which allows AI models to be trained on medical data securely. The platform combines the security of Intel Software Guarded eXtensions (SGX), transparency of blockchain technology, and verifiability of open algorithms and source codes. Genie shares insights of genetic and medical data without exposing anyone's raw data. All data is instantly encrypted upon upload and contributed to the models that the user chooses. The usage of the model and the value generated from the genetic and health data will be tracked via a blockchain, giving the data transparent and immutable ownership.

Journal ArticleDOI
TL;DR: In the version of this article initially published online, there were errors in URLs for www.southernbiotech.com, appearing in Methods sections “m6A dot-blot” and “Western blot analysis.”
Abstract: In the version of this article initially published online, there were errors in URLs for www.southernbiotech.com, appearing in Methods sections "m6A dot-blot" and "Western blot analysis." The first two URLs should be https://www.southernbiotech.com/?catno=4030-05&type=Polyclonal#&panel1-1 and the third should be https://www.southernbiotech.com/?catno=6170-05&type=Polyclonal. In addition, some Methods URLs for bioz.com, www.abcam.com and www.sysy.com were printed correctly but not properly linked. The errors have been corrected in the PDF and HTML versions of this article.

Posted ContentDOI
01 Jan 2018-bioRxiv
TL;DR: An algorithm, called DECODE, is developed to assess the extent of joint presence/absence of genes across different cells, and to infer a gene dependency network, and it is shown that this network captures biologically-meaningful pathways, cell-type specific modules, and connectivity patterns characteristic of complex networks.
Abstract: An inherent challenge in interpreting single-cell transcriptomic data is the high frequency of zero values. This phenomenon has been attributed to both biological and technical sources, although the extent of the contribution of each remains unclear. Here, we show that the underlying gene presence/absence sparsity patterns are by themselves highly informative. We develop an algorithm, called DECODE, to assess the extent of joint presence/absence of genes across different cells, and to infer a gene dependency network. We show that this network captures biologically-meaningful pathways, cell-type specific modules, and connectivity patterns characteristic of complex networks. We develop a model that uses this network to discriminate biological vs. technical zeros, by exploiting each gene9s local network neighborhood. For inferred non-biological zeros, we build a predictive model that imputes the missing value of each gene based on activity patterns of its most informative neighbors. We show that our framework accurately infers gene-gene functional dependencies, pinpoints technical zeros, and predicts biologically-meaningful missing values in three diverse datasets.


Posted ContentDOI
05 Jul 2018-bioRxiv
TL;DR: A significant over-representation of secondary alleles in chaperonin-encoding genes is found in Bardet-Biedl syndrome cohorts, indicating a complex genetic architecture for BBS that informs the biological properties of disease modules and presents a model paradigm for secondary-variant burden analysis in recessive disorders.
Abstract: The influence of genetic background on driver mutations is well established; however, the mechanisms by which the background interacts with Mendelian loci remains unclear. We performed a systematic secondary-variant burden analysis of two independent Bardet-Biedl syndrome (BBS) cohorts with known recessive biallelic pathogenic mutations in one of 17 BBS genes for each individual. We observed a significant enrichment of trans-acting rare nonsynonymous secondary variants compared to either population controls or to a cohort of individuals with a non-BBS diagnosis and recessive variants in the same gene set. Strikingly, we found a significant over-representation of secondary alleles in chaperonin-encoding genes, a finding corroborated by the observation of epistatic interactions involving this complex in vivo. These data indicate a complex genetic architecture for BBS that informs the biological properties of disease modules and presents a model paradigm for secondary-variant burden analysis in recessive disorders.

Proceedings ArticleDOI
TL;DR: A two-component Bayesian deconvolution model to infer the tumor-derived and immune-derived exosomal contribution to the observed mixed plasma- derived exosome signal paves the way for more widespread usage of plasma-derivedExosomes as a clinical monitoring prediction and monitoring tool.
Abstract: There is a critical need for robust and minimally invasive biomarkers for predicting and monitoring tumor progression and response to treatment. Transcriptomes of plasma-derived exosomes (PDEs) are suitable candidates to fulfill such a role, since they contain a subtranscriptome of their cell of origin, and, since nearly all cell types secrete exosomes, this allows for the potential monitoring of multiple cell types concurrently. However, a major issue preventing the widespread adoption of plasma-derived exosome as biomarkers is that observed plasma exosomes actually result from a mixture of exosomes from multiple cell types. This confounds detailed dissection and interpretation of putative plasma-derived exosomal biomarkers. To address this issue, we develop a two-component Bayesian deconvolution model to infer the tumor-derived and immune-derived exosomal contribution to the observed mixed plasma-derived exosome signal. Our model leverages transcriptomic information from 3 different sources: (1) paired patient bulk and plasma-derived exosomes, (2) paired cell-line tumor and cell-line tumor-derived exosomes, and (3) healthy control plasma-derived exosomes to learn gene-by-gene mixing fractions between tumor and immune components and the mapping from tumor to tumor-derived exosomes transcriptomic profiles. Using this information, we are able to further infer the patient-specific tumor-derived and immune-derived exosomal transcriptomic profiles for each gene. The outputs from our model enable us to derive tumor-specific and immune-specific exosomal biomarkers. We first show that our model is performant in an extensive set of in silico simulations. Next, we applied our model to transcriptomes collected prior to and during aPD1 immunotherapy treatment from a pilot cohort of N=44 patients (N=29 responders, N=15 nonresponders) with metastatic melanoma. Analysis of our deconvolved profiles yields novel and biologically informative immune-derived and tumor-derived exosomal biomarkers that predict immunotherapy success. Moreover, time-series analysis of the deconvolved profiles show that we are able to identify significantly different tumor and immune related genes whose time dynamics differ significantly between responders and nonresponders, suggesting that plasma-derived exosomes can enable longitudinal tracking of both immune and tumor components of immunotherapy response. Finally, we show that a more sophisticated extension of our deconvolution model is able to provide an estimate of global tumor fraction for each patient, potentially allowing us to infer tumor burden through plasma-derived exosomal transcriptomic signatures. Overall, our plasma-derived exosomal deconvolution model paves the way for more widespread usage of plasma-derived exosomes as a clinical monitoring prediction and monitoring tool. Citation Format: Alvin Shi, Gyulnara Kasumova, Isabel Chien, Jessica Cintolo-Gonzalez, Dennie T. Frederick, Roman Alpatov, William A. Michaud, Deborah Plana, Ryan Corcoran, Keith Flaherty, Ryan Sullivan, Manolis Kellis, Genevieve Boland. Deconvolution of plasma-derived exosomes for tracking and prediction of immunotherapy across multiple tissues [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2018; 2018 Apr 14-18; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2018;78(13 Suppl):Abstract nr 4282.

Patent
11 May 2018
TL;DR: In this article, the authors describe a secure system for sharing private data and related systems and methods for incentivizing and validating private data sharing. But they do not discuss how private data providers can register to selectively share private data under controlled sharing conditions.
Abstract: Described herein are a secure system for sharing private data and related systems and methods for incentivizing and validating private data sharing. In some embodiments, private data providers may register to selectively share private data under controlled sharing conditions. The private data may be cryptographic ally secured using encryption information corresponding to one or more secure execution environments. To demonstrate to the private data providers that the secure execution environment is secure and trustworthy, attestations demonstrating the security of the secure execution environment may be stored in a distributed ledger (e.g., a public blockchain). Private data users that want access to shared private data may publish applications for operating on the private data to a secure execution environment and publish, in a distributed ledger, an indication that the application is available to receive private data. The distributed ledger may also store sharing conditions under which the private data will be shared.

Posted ContentDOI
01 Dec 2018-bioRxiv
TL;DR: The spatiotemporal transcriptome atlas provides a comprehensive resource to investigate the function of coding genes and noncoding RNAs during critical stages of early neurogenesis, revealing a complex cell type-specific network of mRNAs and lncRNAs.
Abstract: Cell type specification during early nervous system development in Drosophila melanogaster requires precise regulation of gene expression in time and space. Resolving the programs driving neurogenesis has been a major challenge owing to the complexity and rapidity with which distinct cell populations arise. To resolve the cell type-specific gene expression dynamics in early nervous system development, we have sequenced the transcriptomes of purified neurogenic cell types across consecutive time points covering critical events in neurogenesis. The resulting gene expression atlas comprises a detailed resource of global transcriptome dynamics that permits systematic analysis of how cells in the nervous system acquire distinct fates. We resolve known gene expression dynamics and uncover novel expression signatures for hundreds of genes among diverse neurogenic cell types, most of which remain unstudied. We also identified a set of conserved and processed long-noncoding RNAs (lncRNAs) that exhibit spatiotemporal expression during neurogenesis with exquisite specificity. LncRNA expression is highly dynamic and demarcates specific subpopulations within neurogenic cell types. Our spatiotemporal transcriptome atlas provides a comprehensive resource to investigate the function of coding genes and noncoding RNAs during critical stages of early neurogenesis.

Posted ContentDOI
14 Nov 2018-bioRxiv
TL;DR: Methyl-HiC, a method combining in situ Hi-C and whole genome bisulfite sequencing (WGBS) to simultaneously capture chromosome conformation and DNA methylome in a single assay is reported, which reveals coordinated DNA methylation between distant yet spatially proximal genomic regions.
Abstract: Dynamic DNA methylation and three-dimensional chromatin architecture compose a major portion of the epigenome and play an essential role in tissue specific gene expression programs. Currently, DNA methylation and chromatin organization are generally profiled in separate assays. Here, we report Methyl-HiC, a method combining in situ Hi-C and whole genome bisulfite sequencing (WGBS) to simultaneously capture chromosome conformation and DNA methylome in a single assay. Methyl-HiC analysis of mouse embryonic stem cells reveals coordinated DNA methylation between distant yet spatially proximal genomic regions. Extension of Methyl-HiC to single cells further enables delineation of the heterogeneity of both chromosomal conformation and DNA methylation in a mixed cell population, and uncovers increased dynamics of chromatin contacts and decreased stochasticity in DNA methylation in genomic regions that replicate early during cell cycle.

Posted ContentDOI
11 Nov 2018-bioRxiv
TL;DR: ConVERGE is the first computational tool to search for co-localization of GWAS causal variants with transcription factor binding sites in the same regulatory regions, without requiring direct overlap, and is useful for exploring the regulatory architecture of complex traits.
Abstract: Genomic regions associated with complex traits and diseases are primarily located in non-coding regions of the genome and have unknown mechanism of action. A critical step to understanding the genetics of complex traits is to fine-map each associated locus; that is, to find the causal variant(s) that underlie genetic associations with a trait. Fine-mapping approaches are currently focused on identifying genomic annotations, such as transcription factor binding sites, which are enriched in direct overlap with candidate causal variants. We introduce CONVERGE, the first computational tool to search for co-localization of GWAS causal variants with transcription factor binding sites in the same regulatory regions, without requiring direct overlap. As a proof of principle, we demonstrate that CONVERGE is able to identify five novel regulators of type 2 diabetes which subsequently validated in knockdown experiments in pancreatic beta cells, while existing fine-mapping methods were unable to find any statistically significant regulators. CONVERGE also recovers more established regulators for total cholesterol compared to other fine-mapping methods. CONVERGE is therefore unique and complementary to existing fine-mapping methods and is useful for exploring the regulatory architecture of complex traits.

Journal ArticleDOI
TL;DR: This paper presents a meta-analyses of the proteomics results obtained in a pilot study at the Barcelona Supercomputing Center using the Higgs boson gene, a type of “supercomputer” designed to distinguish between “clean” and “dirty” DNA samples.
Abstract: 1Wellcome Trust Sanger Institute, Hinxton CB10 1SA, Cambridgeshire, UK, 2Comparative Genomics Lab, Instituto de Biologica Evolutiva, Universitat Pompeu Fabra, Barcelona, Spain, 3MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA and Broad Institute of MIT and Harvard, Cambridge, MA, USA, 4Bioinformatics Unit, Spanish National Cancer Research Centre, Madrid, Spain, 5Computational Biology Life Sciences Group, Barcelona Supercomputing Center, Barcelona, Spain and 6Cardiovascular Proteomics Laboratory, Centro Nacional de

Journal Article
01 Dec 2018-Nature
TL;DR: HiDRA as discussed by the authors is a combined experimental and computational method for high-resolution genome-wide testing and dissection of putative regulatory regions, including enhancers and promoters, by combining accessible chromatin extraction with self-transcribing episomal reporters.
Abstract: Genome-wide epigenomic maps have revealed millions of putative enhancers and promoters, but experimental validation of their function and high-resolution dissection of their driver nucleotides remain limited. Here, we present HiDRA (High-resolution Dissection of Regulatory Activity), a combined experimental and computational method for high-resolution genome-wide testing and dissection of putative regulatory regions. We test ~7 million accessible DNA fragments in a single experiment, by coupling accessible chromatin extraction with self-transcribing episomal reporters (ATAC-STARR-seq). By design, fragments are highly overlapping in densely-sampled accessible regions, enabling us to pinpoint driver regulatory nucleotides by exploiting differences in activity between partially-overlapping fragments using a machine learning model (SHARPR-RE). In GM12878 lymphoblastoid cells, we find ~65,000 regions showing enhancer function, and pinpoint ~13,000 high-resolution driver elements. These are enriched for regulatory motifs, evolutionarily-conserved nucleotides, and disease-associated genetic variants from genome-wide association studies. Overall, HiDRA provides a high-throughput, high-resolution approach for dissecting regulatory regions and driver nucleotides.