scispace - formally typeset
Search or ask a question

Showing papers by "Manolis Kellis published in 2016"


Journal ArticleDOI
TL;DR: A use case of HaploReg is illustrated for attention deficit hyperactivity disorder (ADHD)-associated SNPs with putative brain regulatory mechanisms, and the number of chromatin state maps to 127 reference epigenomes is greatly expanded.
Abstract: More than 90% of common variants associated with complex traits do not affect proteins directly, but instead the circuits that control gene expression. This has increased the urgency of understanding the regulatory genome as a key component for translating genetic results into mechanistic insights and ultimately therapeutics. To address this challenge, we developed HaploReg (http://compbio.mit.edu/HaploReg) to aid the functional dissection of genome-wide association study (GWAS) results, the prediction of putative causal variants in haplotype blocks, the prediction of likely cell types of action, and the prediction of candidate target genes by systematic mining of comparative, epigenomic and regulatory annotations. Since first launching the website in 2011, we have greatly expanded HaploReg, increasing the number of chromatin state maps to 127 reference epigenomes from ENCODE 2012 and Roadmap Epigenomics, incorporating regulator binding data, expanding regulatory motif disruption annotations, and integrating expression quantitative trait locus (eQTL) variants and their tissue-specific target genes from GTEx, Geuvadis, and other recent studies. We present these updates as HaploReg v4, and illustrate a use case of HaploReg for attention deficit hyperactivity disorder (ADHD)-associated SNPs with putative brain regulatory mechanisms.

720 citations


Journal ArticleDOI
TL;DR: A comprehensive resource of 394 cell type– and tissue-specific gene regulatory networks for human, each specifying the genome-wide connectivity among transcription factors, enhancers, promoters and genes is developed.
Abstract: Mapping perturbed molecular circuits that underlie complex diseases remains a great challenge. We developed a comprehensive resource of 394 cell type- and tissue-specific gene regulatory networks for human, each specifying the genome-wide connectivity among transcription factors, enhancers, promoters and genes. Integration with 37 genome-wide association studies (GWASs) showed that disease-associated genetic variants--including variants that do not reach genome-wide significance--often perturb regulatory modules that are highly specific to disease-relevant cell types or tissues. Our resource opens the door to systematic analysis of regulatory programs across hundreds of human cell types and tissues (http://regulatorycircuits.org).

282 citations


01 Mar 2016
TL;DR: In this article, the authors developed a comprehensive resource of 394 cell type and tissue-specific gene regulatory networks for human, each specifying the genome-wide connectivity among transcription factors, enhancers, promoters and genes.
Abstract: Mapping perturbed molecular circuits that underlie complex diseases remains a great challenge. We developed a comprehensive resource of 394 cell type– and tissue-specific gene regulatory networks for human, each specifying the genome-wide connectivity among transcription factors, enhancers, promoters and genes. Integration with 37 genome-wide association studies (GWASs) showed that disease-associated genetic variants—including variants that do not reach genome-wide significance—often perturb regulatory modules that are highly specific to disease-relevant cell types or tissues. Our resource opens the door to systematic analysis of regulatory programs across hundreds of human cell types and tissues (http://regulatorycircuits.org).

203 citations


Journal ArticleDOI
TL;DR: Sharpr-MPRA as mentioned in this paper combines dense tiling of overlapping MPRA constructs with a probabilistic graphical model to recognize functional regulatory nucleotides, and to distinguish activating and repressive cells using their inferred contribution to reporter gene expression.
Abstract: Massively parallel reporter assays (MPRAs) enable nucleotide-resolution dissection of transcriptional regulatory regions, such as enhancers, but only few regions at a time. Here we present a combined experimental and computational approach, Systematic high-resolution activation and repression profiling with reporter tiling using MPRA (Sharpr-MPRA), that allows high-resolution analysis of thousands of regions simultaneously. Sharpr-MPRA combines dense tiling of overlapping MPRA constructs with a probabilistic graphical model to recognize functional regulatory nucleotides, and to distinguish activating and repressive nucleotides, using their inferred contribution to reporter gene expression. We used Sharpr-MPRA to test 4.6 million nucleotides spanning 15,000 putative regulatory regions tiled at 5-nucleotide resolution in two human cell types. Our results recovered known cell-type-specific regulatory motifs and evolutionarily conserved nucleotides, and distinguished known activating and repressive motifs. Our results also showed that endogenous chromatin state and DNA accessibility are both predictive of regulatory function in reporter assays, identified retroviral elements with activating roles, and uncovered 'attenuator' motifs with repressive roles in active chromatin.

130 citations


Journal ArticleDOI
25 Mar 2016-Science
TL;DR: In this article, a computational, structure-based approach was developed to evaluate TF variants for their impact on DNA binding activity and used universal protein-binding microarrays to assay sequence-specific DNA binding activities across 41 reference and 117 variant alleles found in individuals of diverse ancestries and families with Mendelian diseases.
Abstract: Sequencing of exomes and genomes has revealed abundant genetic variation affecting the coding sequences of human transcription factors (TFs), but the consequences of such variation remain largely unexplored. We developed a computational, structure-based approach to evaluate TF variants for their impact on DNA binding activity and used universal protein-binding microarrays to assay sequence-specific DNA binding activity across 41 reference and 117 variant alleles found in individuals of diverse ancestries and families with Mendelian diseases. We found 77 variants in 28 genes that affect DNA binding affinity or specificity and identified thousands of rare alleles likely to alter the DNA binding activity of human sequence-specific TFs. Our results suggest that most individuals have unique repertoires of TF DNA binding activities, which may contribute to phenotypic variation.

122 citations


Journal ArticleDOI
Pim van der Harst1, Jessica van Setten2, Niek Verweij1, Georg Vogler3  +182 moreInstitutions (54)
TL;DR: A genome-wide association meta-analysis of 4 QRS traits in up to 73,518 individuals of European ancestry provides new insights into genes and biological pathways controlling myocardial mass and may help identify novel therapeutic targets.

109 citations


Journal ArticleDOI
TL;DR: Several steps in the SEP discovery workflow are optimized to improve SEP isolation and identification, leading to the detection of several new human SEPs (novel human genes), improved confidence in theSEP assignments, and enabled quantification of SEPs under different cellular conditions.
Abstract: Computational, genomic, and proteomic approaches have been used to discover nonannotated protein-coding small open reading frames (smORFs). Some novel smORFs have crucial biological roles in cells and organisms, which motivates the search for additional smORFs. Proteomic smORF discovery methods are advantageous because they detect smORF-encoded polypeptides (SEPs) to validate smORF translation and SEP stability. Because SEPs are shorter and less abundant than average proteins, SEP detection using proteomics faces unique challenges. Here, we optimize several steps in the SEP discovery workflow to improve SEP isolation and identification. These changes have led to the detection of several new human SEPs (novel human genes), improved confidence in the SEP assignments, and enabled quantification of SEPs under different cellular conditions. These improvements will allow faster detection and characterization of new SEPs and smORFs.

105 citations


Journal ArticleDOI
10 May 2016-eLife
TL;DR: It is shown that enhancers significantly overlap known loci associated with the cardiac QT interval and QRS duration, and it is demonstrated that these 'sub-threshold' signals represent novel loci, and that epigenomic maps are effective at discriminating true biological signals from noise.
Abstract: Genetic variants identified by genome-wide association studies explain only a modest proportion of heritability, suggesting that meaningful associations lie 'hidden' below current thresholds. Here, we integrate information from association studies with epigenomic maps to demonstrate that enhancers significantly overlap known loci associated with the cardiac QT interval and QRS duration. We apply functional criteria to identify loci associated with QT interval that do not meet genome-wide significance and are missed by existing studies. We demonstrate that these 'sub-threshold' signals represent novel loci, and that epigenomic maps are effective at discriminating true biological signals from noise. We experimentally validate the molecular, gene-regulatory, cellular and organismal phenotypes of these sub-threshold loci, demonstrating that most sub-threshold loci have regulatory consequences and that genetic perturbation of nearby genes causes cardiac phenotypes in mouse. Our work provides a general approach for improving the detection of novel loci associated with complex human traits.

104 citations


Journal ArticleDOI
TL;DR: A new Bayesian model RiVIERA (Risk Variant Inference using Epigenomic Reference Annotations) is introduced for inference of driver variants from summary statistics across multiple traits using hundreds of epigenomic annotations to improve the functional enrichments compared to single-trait models.
Abstract: Genome wide association studies (GWAS) provide a powerful approach for uncovering disease-associated variants in human, but fine-mapping the causal variants remains a challenge. This is partly remedied by prioritization of disease-associated variants that overlap GWAS-enriched epigenomic annotations. Here, we introduce a new Bayesian model RiVIERA (Risk Variant Inference using Epigenomic Reference Annotations) for inference of driver variants from summary statistics across multiple traits using hundreds of epigenomic annotations. In simulation, RiVIERA promising power in detecting causal variants and causal annotations, the multi-trait joint inference further improved the detection power. We applied RiVIERA to model the existing GWAS summary statistics of 9 autoimmune diseases and Schizophrenia by jointly harnessing the potential causal enrichments among 848 tissue-specific epigenomics annotations from ENCODE/Roadmap consortium covering 127 cell/tissue types and 8 major epigenomic marks. RiVIERA identified meaningful tissue-specific enrichments for enhancer regions defined by H3K4me1 and H3K27ac for Blood T-Cell specifically in the nine autoimmune diseases and Brain-specific enhancer activities exclusively in Schizophrenia. Moreover, the variants from the 95% credible sets exhibited high conservation and enrichments for GTEx whole-blood eQTLs located within transcription-factor-binding-sites and DNA-hypersensitive-sites. Furthermore, joint modeling the nine immune traits by simultaneously inferring and exploiting the underlying epigenomic correlation between traits further improved the functional enrichments compared to single-trait models.

87 citations


01 Oct 2016
TL;DR: The results showed that endogenous chromatin state and DNA accessibility are both predictive of regulatory function in reporter assays, and identified retroviral elements with activating roles, and uncovered 'attenuator' motifs with repressive roles in active chromatin.
Abstract: Massively parallel reporter assays (MPRAs) enable nucleotide-resolution dissection of transcriptional regulatory regions, such as enhancers, but only few regions at a time. Here we present a combined experimental and computational approach, Systematic high-resolution activation and repression profiling with reporter tiling using MPRA (Sharpr-MPRA), that allows high-resolution analysis of thousands of regions simultaneously. Sharpr-MPRA combines dense tiling of overlapping MPRA constructs with a probabilistic graphical model to recognize functional regulatory nucleotides, and to distinguish activating and repressive nucleotides, using their inferred contribution to reporter gene expression. We used Sharpr-MPRA to test 4.6 million nucleotides spanning 15,000 putative regulatory regions tiled at 5-nucleotide resolution in two human cell types. Our results recovered known cell-type-specific regulatory motifs and evolutionarily conserved nucleotides, and distinguished known activating and repressive motifs. Our results also showed that endogenous chromatin state and DNA accessibility are both predictive of regulatory function in reporter assays, identified retroviral elements with activating roles, and uncovered 'attenuator' motifs with repressive roles in active chromatin.

87 citations


Posted ContentDOI
09 Sep 2016-bioRxiv
TL;DR: The utility of this large compendium of cis-eQTLs for understanding the tissue-specific etiology of complex traits, including coronary artery disease is demonstrated.
Abstract: Expression quantitative trait locus (eQTL) mapping provides a powerful means to identify functional variants influencing gene expression and disease pathogenesis. We report the identification of cis-eQTLs from 7,051 post-mortem samples representing 44 tissues and 449 individuals as part of the Genotype-Tissue Expression (GTEx) project. We find a cis-eQTL for 88% of all annotated protein-coding genes, with one-third having multiple independent effects. We identify numerous tissue-specific cis-eQTLs, highlighting the unique functional impact of regulatory variation in diverse tissues. By integrating large-scale functional genomics data and state-of-the-art fine-mapping algorithms, we identify multiple features predictive of tissue-specific and shared regulatory effects. We improve estimates of cis-eQTL sharing and effect sizes using allele specific expression across tissues. Finally, we demonstrate the utility of this large compendium of cis-eQTLs for understanding the tissue-specific etiology of complex traits, including coronary artery disease. The GTEx project provides an exceptional resource that has improved our understanding of gene regulation across tiss

Journal ArticleDOI
TL;DR: This work uses soft X-ray tomography (SXT) to image chromatin organization, distribution, and biophysical properties during neurogenesis in vivo and reveals that chromatin with similarBiophysical properties forms an elaborate connected network throughout the entire nucleus.

Journal ArticleDOI
TL;DR: In this paper, comparative genomic evidence across 21 Anopheles mosquitoes and 20 Drosophila species was used to identify evolutionary signatures of conserved, functional readthrough of 353 stop codons.
Abstract: Translational stop codon readthrough emerged as a major regulatory mechanism affecting hundreds of genes in animal genomes, based on recent comparative genomics and ribosomal profiling evidence, but its evolutionary properties remain unknown. Here, we leverage comparative genomic evidence across 21 Anopheles mosquitoes to systematically annotate readthrough genes in the malaria vector Anopheles gambiae, and to provide the first study of abundant readthrough evolution, by comparison with 20 Drosophila species. Using improved comparative genomics methods for detecting readthrough, we identify evolutionary signatures of conserved, functional readthrough of 353 stop codons in the malaria vector, Anopheles gambiae, and of 51 additional Drosophila melanogaster stop codons, including several cases of double and triple readthrough and of readthrough of two adjacent stop codons. We find that most differences between the readthrough repertoires of the two species arose from readthrough gain or loss in existing genes, rather than birth of new genes or gene death; that readthrough-associated RNA structures are sometimes gained or lost while readthrough persists; that readthrough is more likely to be lost at TAA and TAG stop codons; and that readthrough is under continued purifying evolutionary selection in mosquito, based on population genetic evidence. We also determine readthrough-associated gene properties that predate readthrough, and identify differences in the characteristic properties of readthrough genes between clades. We estimate more than 600 functional readthrough stop codons in mosquito and 900 in fruit fly, provide evidence of readthrough control of peroxisomal targeting, and refine the phylogenetic extent of abundant readthrough as following divergence from centipede.

Posted Content
TL;DR: A generalized graph alignment formulation that considers both matches and mismatches in a standard QAP formulation is proposed that significantly outperforms other methods in the alignment of regular graph structures, which is one of the most difficult graph alignment cases.
Abstract: Graph alignment refers to the problem of finding a bijective mapping across vertices of two graphs such that, if two nodes are connected in the first graph, their images are connected in the second graph. This problem arises in many fields such as computational biology, social sciences, and computer vision and is often cast as a quadratic assignment problem (QAP). Most standard graph alignment methods consider an optimization that maximizes the number of matches between the two graphs, ignoring the effect of mismatches. We propose a generalized graph alignment formulation that considers both matches and mismatches in a standard QAP formulation. This modification can have a major impact in aligning graphs with different sizes and heterogenous edge densities. Moreover, we propose two methods for solving the generalized graph alignment problem based on spectral decomposition of matrices. We compare the performance of proposed methods with some existing graph alignment algorithms including Natalie2, GHOST, IsoRank, NetAlign, Klau's approach as well as a semidefinite programming-based method over various synthetic and real graph models. Our proposed method based on simultaneous alignment of multiple eigenvectors leads to consistently good performance in different graph models. In particular, in the alignment of regular graph structures which is one of the most difficult graph alignment cases, our proposed method significantly outperforms other methods.

01 Sep 2016
TL;DR: It is found that most differences between the readthrough repertoires of the two species arose from readthrough gain or loss in existing genes, rather than birth of new genes or gene death; that readthrough-associated RNA structures are sometimes gained or lost while readthrough persists; and that read through is under continued purifying evolutionary selection in mosquito, based on population genetic evidence.
Abstract: Translational stop codon readthrough emerged as a major regulatory mechanism affecting hundreds of genes in animal genomes, based on recent comparative genomics and ribosomal profiling evidence, but its evolutionary properties remain unknown. Here, we leverage comparative genomic evidence across 21 Anopheles mosquitoes to systematically annotate readthrough genes in the malaria vector Anopheles gambiae, and to provide the first study of abundant readthrough evolution, by comparison with 20 Drosophila species. Using improved comparative genomics methods for detecting readthrough, we identify evolutionary signatures of conserved, functional readthrough of 353 stop codons in the malaria vector, Anopheles gambiae, and of 51 additional Drosophila melanogaster stop codons, including several cases of double and triple readthrough and of readthrough of two adjacent stop codons. We find that most differences between the readthrough repertoires of the two species arose from readthrough gain or loss in existing genes, rather than birth of new genes or gene death; that readthrough-associated RNA structures are sometimes gained or lost while readthrough persists; that readthrough is more likely to be lost at TAA and TAG stop codons; and that readthrough is under continued purifying evolutionary selection in mosquito, based on population genetic evidence. We also determine readthrough-associated gene properties that predate readthrough, and identify differences in the characteristic properties of readthrough genes between clades. We estimate more than 600 functional readthrough stop codons in mosquito and 900 in fruit fly, provide evidence of readthrough control of peroxisomal targeting, and refine the phylogenetic extent of abundant readthrough as following divergence from centipede.

Journal ArticleDOI
TL;DR: In this paper, a prospective case-control study was performed on human subjects to characterize the differential expression of mRNA and miRNA in unruptured cerebral aneurysms in comparison with healthy superficial temporal arteries (STA).
Abstract: OBJECTIVE The molecular mechanisms behind cerebral aneurysm formation and rupture remain poorly understood. In the past decade, microRNAs (miRNAs) have been shown to be key regulators in a host of biological processes. They are noncoding RNA molecules, approximately 21 nucleotides long, that posttranscriptionally inhibit mRNAs by attenuating protein translation and promoting mRNA degradation. The miRNA and mRNA interactions and expression levels in cerebral aneurysm tissue from human subjects were profiled. METHODS A prospective case-control study was performed on human subjects to characterize the differential expression of mRNA and miRNA in unruptured cerebral aneurysms in comparison with control tissue (healthy superficial temporal arteries [STA]). Ion Torrent was used for deep RNA sequencing. Affymetrix miRNA microarrays were used to analyze miRNA expression, whereas NanoString nCounter technology was used for validation of the identified targets. RESULTS Overall, 7 unruptured cerebral aneurysm and 10 STA specimens were collected. Several differentially expressed genes were identified in aneurysm tissue, with MMP-13 (fold change 7.21) and various collagen genes (COL1A1, COL5A1, COL5A2) being among the most upregulated. In addition, multiple miRNAs were significantly differentially expressed, with miR-21 (fold change 16.97) being the most upregulated, and miR-143-5p (fold change -11.14) being the most downregulated. From these, miR-21, miR-143, and miR-145 had several significantly anticorrelated target genes in the cohort that are associated with smooth muscle cell function, extracellular matrix remodeling, inflammation signaling, and lipid accumulation. All these processes are crucial to the pathophysiology of cerebral aneurysms. CONCLUSIONS This analysis identified differentially expressed genes and miRNAs in unruptured human cerebral aneurysms, suggesting the possibility of a role for miRNAs in aneurysm formation. Further investigation for their importance as therapeutic targets is needed.


Journal ArticleDOI
TL;DR: In this article, the authors identified and replicated an association for a genetic variant on chromosome 5q22 with 36% increased risk of death in subjects with heart failure (rs9885413, P = 2.7x10-9).
Abstract: Failure of the human heart to maintain sufficient output of blood for the demands of the body, heart failure, is a common condition with high mortality even with modern therapeutic alternatives. To identify molecular determinants of mortality in patients with new-onset heart failure, we performed a meta-analysis of genome-wide association studies and follow-up genotyping in independent populations. We identified and replicated an association for a genetic variant on chromosome 5q22 with 36% increased risk of death in subjects with heart failure (rs9885413, P = 2.7x10-9). We provide evidence from reporter gene assays, computational predictions and epigenomic marks that this polymorphism increases activity of an enhancer region active in multiple human tissues. The polymorphism was further reproducibly associated with a DNA methylation signature in whole blood (P = 4.5x10-40) that also associated with allergic sensitization and expression in blood of the cytokine TSLP (P = 1.1x10-4). Knockdown of the transcription factor predicted to bind the enhancer region (NHLH1) in a human cell line (HEK293) expressing NHLH1 resulted in lower TSLP expression. In addition, we observed evidence of recent positive selection acting on the risk allele in populations of African descent. Our findings provide novel genetic leads to factors that influence mortality in patients with heart failure.

01 May 2016
TL;DR: This work identified and replicated an association for a genetic variant on chromosome 5q22 with 36% increased risk of death in subjects with heart failure and observed evidence of recent positive selection acting on the risk allele in populations of African descent.
Abstract: Failure of the human heart to maintain sufficient output of blood for the demands of the body, heart failure, is a common condition with high mortality even with modern therapeutic alternatives. To identify molecular determinants of mortality in patients with new-onset heart failure, we performed a meta-analysis of genome-wide association studies and follow-up genotyping in independent populations. We identified and replicated an association for a genetic variant on chromosome 5q22 with 36% increased risk of death in subjects with heart failure (rs9885413, P = 2.7x10-9). We provide evidence from reporter gene assays, computational predictions and epigenomic marks that this polymorphism increases activity of an enhancer region active in multiple human tissues. The polymorphism was further reproducibly associated with a DNA methylation signature in whole blood (P = 4.5x10-40) that also associated with allergic sensitization and expression in blood of the cytokine TSLP (P = 1.1x10-4). Knockdown of the transcription factor predicted to bind the enhancer region (NHLH1) in a human cell line (HEK293) expressing NHLH1 resulted in lower TSLP expression. In addition, we observed evidence of recent positive selection acting on the risk allele in populations of African descent. Our findings provide novel genetic leads to factors that influence mortality in patients with heart failure.

Posted ContentDOI
30 Dec 2016-bioRxiv
TL;DR: A convergence framework for recurrence analysis of non-coding mutations using three-dimensional co-localization of epigenomically-identified regions is presented and the PLCB4 plexus and its ability to affect the canonical PI3K cancer pathway is experimentally validated.
Abstract: Cancer sequencing predicts driver genes using recurrent protein-altering mutations, but detecting recurrence for non-coding mutations remains unsolved. Here, we present a convergence framework for recurrence analysis of non-coding mutations, using three-dimensional co-localization of epigenomically-defined regions. We define the regulatory plexus of each gene as its cell-type-specific three-dimensional gene-regulatory neighborhood, inferred using Hi-C chromosomal interactions and chromatin state annotations. Using 16 matched tumor-normal prostate transcriptomes, we predict tumor-upregulated genes, and find enriched plexus mutations in distal regulatory regions normally repressed in prostate, suggesting out-of-context de-repression. Using 55 matched tumor-normal prostate genomes, we predict 15 driver genes by convergence of dispersed, low-frequency mutations into high-frequency dysregulatory events along prostate-specific plexi, controlling for mutational heterogeneity across regions, chromatin states, and patients. These play roles in growth signaling, immune evasion, mitochondrial function, and vascularization, suggesting higher-order pathway-level convergence. We experimentally validate the PLCB4 plexus and its ability to affect the canonical PI3K cancer pathway.

Posted ContentDOI
09 Sep 2016-bioRxiv
TL;DR: These analyses provide a comprehensive characterization of trans-eQTLs across human tissues, which contribute to an improved understanding of the tissue-specific cellular mechanisms of regulatory genetic variation.
Abstract: Understanding the genetics of gene regulation provides information on the cellular mechanisms through which genetic variation influences complex traits. Expression quantitative trait loci, or eQTLs, are enriched for polymorphisms that have been found to be associated with disease risk. While most analyses of human data has focused on regulation of expression by nearby variants (cis-eQTLs), distal or trans-eQTLs may have broader effects on the transcriptome and important phenotypic consequences, necessitating a comprehensive study of the effects of genetic variants on distal gene transcription levels. In this work, we identify trans-eQTLs in the Genotype Tissue Expression (GTEx) project data, consisting of 449 individuals with RNA-sequencing data across 44 tissue types. We find 81 genes with a trans-eQTL in at least one tissue, and we demonstrate that trans-eQTLs are more likely than cis-eQTLs to have effects specific to a single tissue. We evaluate the genomic and functional properties of trans-eQTL variants, identifying strong enrichment in enhancer elements and Piwi-interacting RNA clusters. Finally, we describe three tissue-specific regulatory loci underlying relevant disease associations: 9q22 in thyroid that has a role in thyroid cancer, 5q31 in skeletal muscle, and a previously reported master regulator near KLF14 in adipose. These analyses provide a comprehensive characterization of trans-eQTLs across human tissues, which contribute to an improved understanding of the tissue-specific cellular mechanisms of regulatory genetic variation.

Posted Content
TL;DR: Network Infusion (NI) as mentioned in this paper uses a diffusion kernel to approximate standard diffusion models, but lends itself to inversion, by design, via likelihood maximization or error minimization.
Abstract: Several significant models have been developed that enable the study of diffusion of signals across biological, social and engineered networks. Within these established frameworks, the inverse problem of identifying the source of the propagated signal is challenging, owing to the numerous alternative possibilities for signal progression through the network. In real world networks, the challenge of determining sources is compounded as the true propagation dynamics are typically unknown, and when they have been directly measured, they rarely conform to the assumptions of any of the well-studied models. In this paper we introduce a method called Network Infusion (NI) that has been designed to circumvent these issues, making source inference practical for large, complex real world networks. The key idea is that to infer the source node in the network, full characterization of diffusion dynamics, in many cases, may not be necessary. This objective is achieved by creating a diffusion kernel that well-approximates standard diffusion models, but lends itself to inversion, by design, via likelihood maximization or error minimization. We apply NI for both single-source and multi-source diffusion, for both single-snapshot and multi-snapshot observations, and for both homogeneous and heterogeneous diffusion setups. We prove the mean-field optimality of NI for different scenarios, and demonstrate its effectiveness over several synthetic networks. Moreover, we apply NI to a real-data application, identifying news sources in the Digg social network, and demonstrate the effectiveness of NI compared to existing methods. Finally, we propose an integrative source inference framework that combines NI with a distance centrality-based method, which leads to a robust performance in cases where the underlying dynamics are unknown.

Posted ContentDOI
11 Apr 2016-bioRxiv
TL;DR: This work uses epigenomic annotations across 127 tissues and cell types to investigate weak regulatory associations, the specific enhancers they reside in, their downstream target genes, their upstream regulators, and the biological pathways they disrupt in eight common diseases.
Abstract: For most complex traits, known genetic associations only explain a small fraction of the narrow sense heritability prompting intense debate on the genetic basis of complex traits. Joint analysis of all common variants together explains much of this missing heritability and reveals that large numbers of weakly associated loci are enriched in regulatory regions, but fails to identify specific regions or biological pathways. Here, we use epigenomic annotations across 127 tissues and cell types to investigate weak regulatory associations, the specific enhancers they reside in, their downstream target genes, their upstream regulators, and the biological pathways they disrupt in eight common diseases. We show weak associations are significantly enriched in disease-relevant regulatory regions across thousands of independent loci. We develop methods to control for LD between weak associations and overlap between annotations. We show that weak non-coding associations are additionally enriched in relevant biological pathways implicating additional downstream target genes and upstream disease-specific master regulators. Our results can help guide the discovery of biologically meaningful, but currently undetectable regulatory loci underlying a number of common diseases.

Posted ContentDOI
16 Jun 2016-bioRxiv
TL;DR: A new Bayesian model RiVIERA-beta (Risk Variant Inference using Epigenomic Reference Annotations) is introduced for inference of driver variants by modelling summary statistics p-values in Beta density function across multiple traits using hundreds of epigenomic annotations to model GWAS summary statistics of 9 autoimmune diseases and Schizophrenia.
Abstract: Genome wide association studies (GWAS) provide a powerful approach for uncovering disease-associated variants in human, but fine-mapping the causal variants remains a challenge. This is partly remedied by prioritization of disease-associated variants that overlap GWAS-enriched epigenomic annotations. Here, we introduce a new Bayesian model RiVIERA-beta (Risk Variant Inference using Epigenomic Reference Annotations) for inference of driver variants by modelling summary statistics p-values in Beta density function across multiple traits using hundreds of epigenomic annotations. In simulation, RiVIERA-beta promising power in detecting causal variants and causal annotations, the multi-trait joint inference further improved the detection power.} We applied RiVIERA-beta to model the existing GWAS summary statistics of 9 autoimmune diseases and Schizophrenia by jointly harnessing the potential causal enrichments among 848 tissue-specific epigenomics annotations from ENCODE/Roadmap consortium covering 127 cell/tissue types and 8 major epigenomic marks. RiVIERA-beta identified meaningful tissue-specific enrichments for enhancer regions defined by H3K4me1 and H3K27ac for Blood T-Cell specifically in the 9 autoimmune diseases and Brain-specific enhancer activities exclusively in Schizophrenia. Moreover, the variants from the 95% credible sets exhibited high conservation and enrichments for GTEx whole-blood eQTLs located within transcription-factor-binding-sites and DNA-hypersensitive-sites. Furthermore, joint modeling the nine immune traits by simultaneously inferring and exploiting the underlying epigenomic correlation between traits further improved the functional enrichments compared to single-trait models.

Journal ArticleDOI
TL;DR: The proposed SwiSpot approach is capable of identifying the switching sequence inside a putative, complete riboswitch sequence, on the basis of pairing behaviors, which are evaluated on proper sets of configurations and is able to model the switching behavior of riboswitches whose generated ensemble covers both alternate configurations.
Abstract: Motivation: Riboswitches are cis-regulatory elements in mRNA, mostly found in Bacteria, which exhibit two main secondary structure conformations. Although one of them prevents the gene from being expressed, the other conformation allows its expression, and this switching process is typically driven by the presence of a specific ligand. Although there are a handful of known riboswitches, our knowledge in this field has been greatly limited due to our inability to identify their alternate structures from their sequences. Indeed, current methods are not able to predict the presence of the two functionally distinct conformations just from the knowledge of the plain RNA nucleotide sequence. Whether this would be possible, for which cases, and what prediction accuracy can be achieved, are currently open questions. Results: Here we show that the two alternate secondary structures of riboswitches can be accurately predicted once the ‘switching sequence’ of the riboswitch has been properly identified. The proposed SwiSpot approach is capable of identifying the switching sequence inside a putative, complete riboswitch sequence, on the basis of pairing behaviors, which are evaluated on proper sets of configurations. Moreover, it is able to model the switching behavior of riboswitches whose generated ensemble covers both alternate configurations. Beyond structural predictions, the approach can also be paired to homology-based riboswitch searches. Availability and Implementation: SwiSpot software, along with the reference dataset files, is available at: http://www.iet.unipi.it/a.bechini/swispot/ Supplementary information: Supplementary data are available at Bioinformatics online. Contact: a.bechini@ing.unipi.it

Posted ContentDOI
16 Jun 2016-bioRxiv
TL;DR: The general benefits of the proposed integrative framework in elucidating meaningful tissue-specific epigenomic elements from large-scale correlated annotations and the implicated functional variants for future experimental interrogation are demonstrated.
Abstract: Dissecting the physiological circuitry underlying diverse human complex traits associated with heritable common mutations is an ongoing effort. The primary challenge involves identifying the relevant cell types and the causal variants among the vast majority of the associated mutations in the noncoding regions. To address this challenge, we developed an efficient probabilistic framework. First, we propose a sparse group-guided learning algorithm to infer cell-type-specific enrichments. Second, we propose a fine-mapping Bayesian model that incorporates as Bayesian priors the sparse enrichments to infer risk variants. Using the proposed framework to analyze 32 complex human traits revealed meaningful tissue-specific epigenomic enrichments indicative of the relevant disease pathologies. The prioritized variants exhibit prominent tissue-specific epigenomic signatures and significant enrichments for eQTL and conserved elements. Together, we demonstrate the general benefits of the proposed integrative framework in elucidating meaningful tissue-specific epigenomic elements from large-scale correlated annotations and the implicated functional variants for future experimental interrogation.

Posted ContentDOI
16 Jun 2016-bioRxiv
TL;DR: Joint modeling multiple traits confers further improvement over the single-trait mode of the same model, which is attributable to the more robust estimation of the enrichment parameters especially when the annotation measurements are noisy.
Abstract: Fine-mapping causal variants is challenging due to linkage disequilibrium and the lack of interpretation of noncoding mutations. Existing fine-mapping methods do not scale well on inferring multiple causal variants per locus and causal variants across multiple related diseases. Moreover, many complex traits are not only genetically related but also potentially share causal mechanisms. We develop a novel integrative Bayesian fine-mapping model named RiVIERA-MT. The key features of RiVIERA-MT include 1) ability to model epigenomic covariance of multiple related traits; 2) efficient posterior inference of causal configuration; 3) efficient full Bayesian inference of enrichment parameters, allowing incorporation of large number of functional annotations; 4) simultaneously modeling the underlying heritability parameters. We conducted a comprehensive simulation studies using 1000 Genome and ENCODE/Roadmap epigenomic data to demonstrate that RiVIERA-MRiVIERA-MTT compares quite favorably with existing methods. In particular, the efficient inference of multiple causal variants per locus led to significantly improved estimation of causal posterior and functional enrichments compared to the state-of-the-art fine-mapping methods. Furthermore, joint modeling multiple traits confers further improvement over the single-trait mode of the same model, which is attributable to the more robust estimation of the enrichment parameters especially when the annotation measurements (i.e., ChIP-seq) themselves are noisy. We applied RiVIERA-MT to separately and jointly model 7 well-powered GWAS traits including body mass index, coronary artery disease, four lipid traits, and type 2 diabetes. To leverage potential tissue-specific epigenomic co-enrichments among these traits, we harness 52 baseline functional annotations and 220 tissue-specific epigenomic annotations from well-characterized cell types compiled from ENCODE/Roadmap consortium. Overall, we observed an improved enrichments for GTEx whole blood and tissue-specific eQTL SNPs based on the prioritized SNPs by RiVIERA-MT compared to existing methods.

Posted Content
TL;DR: The Network Maximal Correlation (NMC) as discussed by the authors measure of nonlinear association among random variables is defined via an optimization that infers transformations of variables by maximizing aggregate inner products between transformed variables.
Abstract: We introduce Network Maximal Correlation (NMC) as a multivariate measure of nonlinear association among random variables. NMC is defined via an optimization that infers transformations of variables by maximizing aggregate inner products between transformed variables. For finite discrete and jointly Gaussian random variables, we characterize a solution of the NMC optimization using basis expansion of functions over appropriate basis functions. For finite discrete variables, we propose an algorithm based on alternating conditional expectation to determine NMC. Moreover we propose a distributed algorithm to compute an approximation of NMC for large and dense graphs using graph partitioning. For finite discrete variables, we show that the probability of discrepancy greater than any given level between NMC and NMC computed using empirical distributions decays exponentially fast as the sample size grows. For jointly Gaussian variables, we show that under some conditions the NMC optimization is an instance of the Max-Cut problem. We then illustrate an application of NMC in inference of graphical model for bijective functions of jointly Gaussian variables. Finally, we show NMC's utility in a data application of learning nonlinear dependencies among genes in a cancer dataset.

Posted ContentDOI
15 Apr 2016-bioRxiv
TL;DR: Recombination valleys show increased DNA methylation, reduced double-stranded break initiation, and increased repair efficiency, specifically in the lineage leading to the germ line, providing a potential molecular mechanism facilitating their maintenance by exclusion of recombination events.
Abstract: Human recombination rate varies greatly, but the forces shaping it remain incompletely understood. Here, we study the relationship between recombination rate and gene-regulatory domains defined by a gene and its linked control elements. We define these links using methylation quantitative trait loci (meQTLs), expression quantitative trait loci (eQTLs), chromatin conformation, and correlated activity across cell types. Each link type shows a “recombination valley” of significantly-reduced recombination rate compared to control regions, indicating preferential co-inheritance of genes and linked regulatory elements as a single unit. This recombination valley is most pronounced for gene-regulatory domains of early embryonic developmental genes, housekeeping genes, and constitutive regulatory elements, which are known to show increased evolutionary constraint across species. Recombination valleys show increased DNA methylation, reduced double-stranded break initiation, and increased repair efficiency, specifically in the lineage leading to the germ line, providing a potential molecular mechanism facilitating their maintenance by exclusion of recombination events.

Posted ContentDOI
03 May 2016-bioRxiv
TL;DR: Comparisons between Anopheles and Drosophila allow us to transcend the static picture provided by single-clade analysis to explore the evolutionary dynamics of abundant readthrough and find that most differences between the readthrough repertoires of the two species are due to readthrough gain or loss in existing genes, rather than to birth of new genes or to gene death.
Abstract: Translational stop codon readthrough was virtually unknown in eukaryotic genomes until recent developments in comparative genomics and new experimental techniques revealed evidence of readthrough in hundreds of fly genes and several human, worm, and yeast genes. Here, we use the genomes of 21 species of Anopheles mosquitoes and improved comparative techniques to identify evolutionary signatures of conserved, functional readthrough of 353 stop codons in the malaria vector, Anopheles gambiae, and 51 additional Drosophila melanogaster stop codons, with several cases of double and triple readthrough including readthrough of two adjacent stop codons, supporting our earlier prediction of abundant readthrough in pancrustacea genomes. Comparisons between Anopheles and Drosophila allow us to transcend the static picture provided by single-clade analysis to explore the evolutionary dynamics of abundant readthrough. We find that most differences between the readthrough repertoires of the two species are due to readthrough gain or loss in existing genes, rather than to birth of new genes or to gene death; that RNA structures are sometimes gained or lost while readthrough persists; and that readthrough is more likely to be lost at TAA and TAG stop codons. We also determine which characteristic properties of readthrough predate readthrough and which are clade-specific. We estimate that there are more than 600 functional readthrough stop codons in A. gambiae and 900 in D. melanogaster. We find evidence that readthrough is used to regulate peroxisomal targeting in two genes. Finally, we use the sequenced centipede genome to refine the phylogenetic extent of abundant readthrough.