scispace - formally typeset
Open AccessPosted ContentDOI

Multi-tissue integrative analysis of personal epigenomes

Joel Rozowsky, +104 more
- 26 Apr 2021 - 
TLDR
In this paper, the authors constructed phased, diploid genomes for four cadaveric donors (using long-read sequencing) and systematically charted noncoding regulatory elements and transcriptional activity across more than 25 tissues from these donors.
Abstract
Evaluating the impact of genetic variants on transcriptional regulation is a central goal in biological science that has been constrained by reliance on a single reference genome. To address this, we constructed phased, diploid genomes for four cadaveric donors (using long-read sequencing) and systematically charted noncoding regulatory elements and transcriptional activity across more than 25 tissues from these donors. Integrative analysis revealed over a million variants with allele-specific activity, coordinated, locus-scale allelic imbalances, and structural variants impacting proximal chromatin structure. We relate the personal genome analysis to the ENCODE encyclopedia, annotating allele- and tissue-specific elements that are strongly enriched for variants impacting expression and disease phenotypes. These experimental and statistical approaches, and the corresponding EN-TEx resource, provide a framework for personalized functional genomics.

read more

Content maybe subject to copyright    Report

1
Multi-tissue integrative analysis of personal epigenomes
Joel Rozowsky
1,2
, Jorg Drenkow
3
, Yucheng T Yang
1,2
, Gamze Gursoy
1,2
, Timur Galeev
1,2
, Beatrice
Borsari
4
, Charles B Epstein
5
, Kun Xiong
1,2
, Jinrui Xu
1,2
, Jiahao Gao
1,2
, Keyang Yu
6
, Ana Berthel
1,2
, Zhanlin
Chen
1,2
, Fabio Navarro
1,2
, Jason Liu
1,2
, Maxwell S Sun
1,2
, James Wright
7
, Justin Chang
1,2
, Christopher JF
Cameron
1,2
, Noam Shoresh
5
, Elizabeth Gaskell
5
, Jessika Adrian
8
, Sergey Aganezov
9
, Gabriela
Balderrama-Gutierrez
10
, Samridhi Banskota
5
, Guillermo Barreto Corona
5
, Sora Chee
11
, Surya B Chhetri
12
,
Gabriel Conte Cortez Martins
1,2
, Cassidy Danyko
3
, Carrie A Davis
3
, Daniel Farid
1,2
, Nina P Farrell
5
, Idan
Gabdank
8
, Yoel Gofin
6
, David U Gorkin
11
, Mengting Gu
1,2
, Vivian Hecht
5
, Benjamin C Hitz
8
, Robbyn
Issner
5
, Melanie Kirsche
9
, Xiangmeng Kong
1,2
, Bonita R Lam
8
, Shantao Li
1,2
, Bian Li
1,2
, Tianxiao Li
1,2
, Xiqi
Li
6
, Khine Zin Lin
8
, Ruibang Luo
13
, Mark Mackiewicz
14
, Jill E Moore
15
, Jonathan Mudge
16
, Nicholas
Nelson
5
, Chad Nusbaum
5
, Ioann Popov
1,2
, Henry E Pratt
15
, Yunjiang Qiu
11
, Srividya Ramakrishnan
9
, Joe
Raymond
5
, Leonidas Salichos
1,2,17
, Alexandra Scavelli
3
, Jacob M Schreiber
18
, Fritz J Sedlazeck
9,19,20
, Lei
Hoon See
3
, Rachel M Sherman
9
, Xu Shi
1,2
, Minyi Shi
8
, Cricket Alicia Sloan
8
, J Seth Strattan
8
, Zhen Tan
1,2
,
Forrest Y Tanaka
8
, Anna Vlasova
4,21,22
, Jun Wang
1,2
, Jonathan Werner
3
, Brian Williams
23
, Min Xu
1,2
,
Chengfei Yan
1,2
, Lu Yu
7
, Christopher Zaleski
3
, Jing Zhang
1,2,24
, J Michael Cherry
8,
Eric M Mendenhall
12
,
William S Noble
18
, Zhiping Weng
15
, Morgan E Levine
1,25
, Alexander Dobin
3
, Barbara Wold
23
, Ali
Mortazavi
10
, Bing Ren
11
, Jesse Gillis
3
, Richard M Myers
14
, Michael P Snyder
8
, Jyoti Choudhary
7
,
Aleksandar Milosavljevic
6
, Michael C Schatz
9,19
, Roderic Guigó
4,26
, Bradley E Bernstein
5,27
, Thomas R
Gingeras
3
, Mark Gerstein
1,2
1 - Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
2 - Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
3 - Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
4 - Centre for Genomic Regulation, The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain
5 - Broad Institute of MIT and Harvard, Cambridge, MA, USA
6 - Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
7 - Institute of Cancer Research, London, UK
8 - Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
9 - Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD, USA
10 - Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA
11 - Ludwig Institute for Cancer Research, University of California, San Diego, La Jolla, CA, USA
12 - Biological Sciences, University of Alabama in Huntsville, Huntsville, AL, USA
13 - Department of Computer Science, The University of Hong Kong, Hong Kong, CHN
14 - HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA
15 - Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA,
USA
16 - European Bioinformatics Institute, Cambridge, Cambridgeshire, GB
17 - Department of Biological and Chemical Sciences, New York Institute of Technology, Old Westbury, NY, USA
18 - Department of Genome Sciences, University of Washington, Seattle, WA, USA
19 - Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
20 - Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
21 - Comparative Genomics Group, Life Science Programme, Barcelona Supercomputing Centre, Barcelona, Spain
22 - Institute of Research in Biomedicine, Barcelona, Spain
23 - Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
24 - Department of Computer Science, University of California, Irvine, CA, USA
25 - Department of Pathology, Yale University School of Medicine, New Haven, CT, USA
26 - Universitat Pompeu Fabra, Barcelona, Catalonia, Spain
27 - Department of Pathology and Center for Cancer Research, Massachusetts General Hospital and Harvard
Medical School, Boston, MA, USA
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 26, 2021. ; https://doi.org/10.1101/2021.04.26.441442doi: bioRxiv preprint

2
Abstract
Evaluating the impact of genetic variants on transcriptional regulation is a central goal in
biological science that has been constrained by reliance on a single reference genome. To
address this, we constructed phased, diploid genomes for four cadaveric donors (using long-
read sequencing) and systematically charted noncoding regulatory elements and transcriptional
activity across more than 25 tissues from these donors. Integrative analysis revealed over a
million variants with allele-specific activity, coordinated, locus-scale allelic imbalances, and
structural variants impacting proximal chromatin structure. We relate the personal genome
analysis to the ENCODE encyclopedia, annotating allele- and tissue-specific elements that are
strongly enriched for variants impacting expression and disease phenotypes. These
experimental and statistical approaches, and the corresponding EN-TEx resource, provide a
framework for personalized functional genomics.
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 26, 2021. ; https://doi.org/10.1101/2021.04.26.441442doi: bioRxiv preprint

3
The Human Genome Project assembled one representative haploid genome sequence 20 years
ago (1). Since then, millions of individual genomes have been sequenced (2). Compared to the
reference, a personal genome typically contains ~4.5 million variants (3). Understanding their
functional impact is a fundamental question in biology and medicine. To this end, researchers
have conducted many genome-wide association studies (GWASes) and expression quantitative
trait loci (eQTL) analyses, associating genetic variants with changes in gene expression and
phenotypic traits. In part(4)icular, the Genotype-Tissue Expression (GTEx) project performed
RNA-seq on >40 human tissues from nearly 1000 individuals, allowing for the identification of
>175K eQTLs (5, 6). In complementary fashion, the Encyclopedia of DNA Elements (ENCODE)
project was initiated in 2003 to identify and annotate genomic regions (7). During the ensuing
decades, the project utilized functional genomic techniques to chart the transcriptional and
epigenomic landscapes of numerous human tissues and cell lines, producing a catalog of
candidate cis-regulatory elements (cCREs) on the reference genome (8-10). These are widely
used for predicting the impact of genetic variants (10-13). However, there is a lack of one-to-one
correspondence between this epigenetic annotation, based on the generic reference genome,
and genetic variants, which fundamentally relate to an individual's personal genome.
To overcome this limitation, we initiated the EN-TEx study (ENCODE assays applied to GTEx
samples) to connect personal genomes and functional genomics. First, we built the diploid
genomes for each of four individuals with long-read sequencing. Second, for each individual, we
uniformly carried out a full range of functional genomic assays for 25 tissues, resulting in >1,500
datasets for histone modifications, gene expression, protein abundance, and three-dimensional
genome structure. These raw data were processed in relation to each individual's personal
genome, making the interpretation of genetic variants more direct.
In particular, by using an individual’s diploid genome, heterozygous loci can distinguish reads
that arise from each haplotype, assigning distinct molecular signals (e.g., RNA expression or TF
binding) to each. The imbalance between the haplotypes can be accurately measured by taking
the wild-type allele as a baseline, avoiding biological and technical biases, and if the imbalance
is statistically significant, the heterozygous variant is termed allele-specific (AS). AS variants
have been determined in numerous previous studies(14-20). (Note that only some AS variants
are causal for the observed changes, such as those directly affecting TF-binding sites on one
haplotype.)
Personal genomes & matched data matrix
Phasing & SVs. We sequenced the genomes of four individuals from the GTEx cohort (identified
as 1 through 4), with a variety of sequencing technologies (10x Genomics linked-read, Illumina,
and PacBio). After calling single-nucleotide variants (SNVs) and small insertions and deletions,
we integrated the haplotype information from linked-reads and proximal ligation sequencing (Hi-
C) to phase the variants (Fig. S1.1) (21). This step generated large blocks of phased variants
across the genome, which were stitched together, forming phased personal genomes for the
four individuals (Fig. 1A). We further determined the paternal/maternal origin of the phased
segments by checking the AS expression levels of known imprinted loci (Fig. 1A and
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 26, 2021. ; https://doi.org/10.1101/2021.04.26.441442doi: bioRxiv preprint

4
Supplement). For individuals 2 and 3, we also identified 17,649 and 18,542 structural variants
(SVs, greater than 50 bp; Supplement, Fig. 1B & S1.2), incorporating them into their personal
genomes. We found that the SVs tended to be short (<1 Kb) and depleted in most functional
regions (e.g., exons and cCREs), to be insertions, and to have typical allele-frequency spectra,
all of which agree with previous findings (Fig. S1.2) (22, 23).
Diploid Mappings from >1500 Experiments. Next, we carried out a comprehensive set of 1635
experiments on the four individuals (i.e., ChIP-seq, ATAC-seq, Hi-C, DNase-seq, whole-genome
bisulfite sequencing [WGBS], short and long-read RNA-seq, eCLIP, and labeled proteomic
mass-spectrometry; Fig. 1D & S1.3a). All our datasets were processed according to both the
personal diploid and reference genomes, giving rise to three mappings and signal tracks for
each assay (maternal and paternal haplotypes and the reference; Fig. S1.4). When we applied
strict mapping criteria (in terms of allowed mismatches) we found ~2.5% more reads mapped to
the personal genomes than to the reference (Supplement). The increase was smaller in
annotated regions (genes and cCREs) than in the genome overall. Still, mapping to the personal
versus reference genome has an effect on gene expression quantification (e.g. resolving better
the expression levels of immune-related genes; Fig. S1.4).
Measuring AS activity in diverse assays
(RNA/ChIP/ATAC/DNase)-seq. For the assays making up the bulk of the dataset, AS
measurement involves the direct comparison of the number of mapped reads at a locus
containing heterozygous SNVs (hetSNVs), and we report the number of significantly imbalanced
hetSNVs relative to accessible hetSNVs (i.e., hetSNVs with enough sequencing depth to be
able to detect statistically significant imbalances; Fig. 2A and Supplement). We performed these
calculations uniformly on a large scale with a standard pipeline, making possible consistent call-
set comparison, and with reads mapped to personal genomes, avoiding reference and
ambiguous-mapping biases (Fig. 2 & Supplement) (7, 8, 17, 24-27). We also developed
alternate call sets, including "high-power" ones based on joint calling across tissues (Fig.
S2.1e). As shown in Fig. 2D, we consistently detected ~800 AS hetSNVs per sample, about 3%
of the potential 27.5K accessible hetSNVs.
WGBS, Hi-C & Proteomics. For three assays we have had to assess AS activity in a specialized
fashion. In particular, for WGBS, we accounted for base changes at potentially methylated CpG
sites (Supplement and Fig. S2.2). We identified ~130K AS methylation events per sample. For
Hi-C, we mapped the reads onto the personal genomes and generated haplotype-resolved
contact matrices, partitioning the Hi-C contacts, where at least one of the contacting regions
were AS, into AS interactions (Supplement & Fig. S2.3a). Of the average ~6.5M interactions per
sample, ~500K showed significant AS behavior (Fig. S2.3). Finally, for proteomics we mapped
peptides directly to the personal genomes, calling AS peptides in consistent fashion to the
processing for AS RNA-seq (contrasting to other approaches (28-30)); in total, we found 2,028
potential AS peptides (Fig. S4.4c and Supplement).
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 26, 2021. ; https://doi.org/10.1101/2021.04.26.441442doi: bioRxiv preprint

5
Aggregating AS events, forming a catalog
AS Elements. In addition to determining AS activity at the SNV level, it is possible to pool the
reads from multiple phased SNVs into a single genomic element, allowing the determination of
AS elements (cCREs and genes, Fig. 2A and Fig. S3.1). In particular, for each individual and
tissue, 182 cCREs and 351 genes showed a significant AS imbalance per assay; further
aggregating across individuals resulted in ~400 AS elements per tissue (Fig. 2D). When
comparing the resulting list with genes associated with specific diseases, we found sensible
correlations; for example, TSHR, TG, and PAX8, which are associated with hyperthyroidism,
showed AS behavior in thyroid (more examples in Supplement).
Tissue & Assay Merging. Next, we merged the 25 tissues, using a simple union of the tissue-
specific AS call sets, detecting ~5.5K unique AS hetSNVs (for either binding or expression) and
~1K AS genomic elements for each individual per assay (Fig. 2C). Pooling the reads from each
assay across all tissues dramatically increased (by >5X) the detection power, making it possible
to identify ~27K AS hetSNVs per assay for each individual (Fig. 2D). Finally, merging across all
assays provided a catalog of all loci where AS activity could be assessed in any of the tissues of
the four individuals (Fig. 2D). For (RNA/ChIP/ATAC)-seq, the catalog contains 232K unique AS
hetSNVs and 37K AS elements (28K cCREs and 9K genes, occurring in at least one donor and
assay). The number of AS hetSNVs increases by ~2-fold (to 0.5M hetSNVs) when aggregating
across tissues by pooling all the available reads. When AS sites from DNase-seq and
methylation are added, the total number of hetSNVs increases to 1.3M (many in relation to
previous efforts, Supplement).
Mining the catalog
Rare Variants. After constructing the catalog, we mined it for features associated with AS
activity. First, consistent with previous studies (17, 18, 26, 31), we found that AS elements,
particularly distal ones, were under less purifying selection (depleted in rare variants) than non-
AS ones (Fig. 3A & S4.2). That said, a substantial number of AS variants are rare (8294 and
2961 for binding and expression, respectively; Supplement). Moreover, 14 of these were
deleterious and pathogenic, based on inter-relating with ClinGen/ClinVar (Supplement) (32).
Model. We built a deep-learning model to predict whether a hetSNV position in an individual is
AS in a particular assay based solely on the surrounding nucleotide sequence (33). In particular,
the model was trained as a binary classifier in one individual and was then used to predict on
non-shared hetSNVs in another (Supplement). As shown in Fig. 3B, the CTCF model has
stronger performance than the ones for other assays (e.g. RNA-seq) and attaches higher
importance to the central region surrounding the hetSNV, perhaps because of the well-defined
CTCF binding motif.
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 26, 2021. ; https://doi.org/10.1101/2021.04.26.441442doi: bioRxiv preprint

Citations
More filters
Journal ArticleDOI

Variability of cross-tissue X-chromosome inactivation characterizes timing of human embryonic lineage specification events.

TL;DR: In this paper , a set of X-chromosome inactivation escape genes were established and the authors harnessed these features to investigate characteristics of early lineage specification events during human development, finding that XCI is completed in the epiblast (in at least 6-16 cells) before specification of the germ layers.
Journal ArticleDOI

Building integrative functional maps of gene regulation.

TL;DR: Recent and ongoing efforts to build gene regulatory maps, which aim to characterize all sequences in a genome for their roles in regulating gene expression, are discussed.
Posted ContentDOI

Structural variation across 138,134 samples in the TOPMed consortium

TL;DR: This paper presented a catalog of 355,667 Structural Variants (SV) across autosomes and the X chromosome from 138,134 individuals in the diverse TOPMed consortium, with high variant quality and >90% allele concordance compared to long-read de-novo assemblies of well-characterized control samples.
Journal ArticleDOI

Unified views on variant impact across many diseases.

Sushant Kumar, +1 more
- 01 Feb 2023 - 
TL;DR: In this article , a unified perspective on relating variant impact to various genomic disorders is presented, and the authors argue that properly addressing them will require a more unified vocabulary and approach across disease communities.
References
More filters
Journal ArticleDOI

Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project

Ewan Birney, +320 more
- 14 Jun 2007 - 
TL;DR: Functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project are reported, providing convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts.
Journal ArticleDOI

Global analysis of protein expression in yeast

TL;DR: A Saccharomyces cerevisiae fusion library is created where each open reading frame is tagged with a high-affinity epitope and expressed from its natural chromosomal location, and it is found that about 80% of the proteome is expressed during normal growth conditions.
Journal ArticleDOI

LD score regression distinguishes confounding from polygenicity in genome-wide association studies :

TL;DR: It is found that polygenicity accounts for the majority of the inflation in test statistics in many GWAS of large sample size, and the LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control.
Journal ArticleDOI

Genetic effects on gene expression across human tissues.

TL;DR: It is found that local genetic variation affects gene expression levels for the majority of genes, and inter-chromosomal genetic effects for 93 genes and 112 loci are identified, enabling a mechanistic interpretation of gene regulation and the genetic basis of disease.
Journal ArticleDOI

DNA methylation landscapes: provocative insights from epigenomics

TL;DR: The conventional view that DNA methylation functions predominantly to irreversibly silence transcription is being challenged and not only is promoter methylation often highly dynamic during development, but many organisms also seem to targetDNA methylation specifically to the bodies of active genes.
Related Papers (5)
Frequently Asked Questions (5)
Q1. What have the authors contributed in "Multi-tissue integrative analysis of personal epigenomes" ?

Joel Rozowsky, Jorg Drenkow3, Yucheng T Yang 1,2, Gamze Gursoy, Timur Galeev, this paper, this paper 1. 

In this tissue, 290k cCREs have functional genomic signals and can be categorized as active (117k), repressed (154k), or bivalent (19k). 

For the eQTL effect, the slope (beta coefficient) of the leading eQTL associated with an AS gene (6) is correlated with the fraction of RNA-seq reads mapped to the alternative allele on that gene (overall Pearson’s correlation coefficient = 0.6, p = 0.01). 

Because of the genetic variants in the personal genome, the diploid representation and the reference have different coordinate systems. 

Due to the genetic variants in the diploid personal genome, some reads can be only mapped to this but not to the reference genome.