scispace - formally typeset
Open AccessJournal ArticleDOI

Identification of methylation haplotype blocks aids in deconvolution of heterogeneous tissue samples and tumor tissue-of-origin mapping from plasma DNA

Reads0
Chats0
TLDR
This work focused on a systematic search and investigation of regions in the full human genome that show highly coordinated methylation and demonstrated quantitative estimation of tumor load and tissue-of-origin mapping in the circulating cell-free DNA of 59 patients with lung or colorectal cancer.
Abstract
Adjacent CpG sites in mammalian genomes can be co-methylated owing to the processivity of methyltransferases or demethylases, yet discordant methylation patterns have also been observed, which are related to stochastic or uncoordinated molecular processes. We focused on a systematic search and investigation of regions in the full human genome that show highly coordinated methylation. We defined 147,888 blocks of tightly coupled CpG sites, called methylation haplotype blocks, after analysis of 61 whole-genome bisulfite sequencing data sets and validation with 101 reduced-representation bisulfite sequencing data sets and 637 methylation array data sets. Using a metric called methylation haplotype load, we performed tissue-specific methylation analysis at the block level. Subsets of informative blocks were further identified for deconvolution of heterogeneous samples. Finally, using methylation haplotypes we demonstrated quantitative estimation of tumor load and tissue-of-origin mapping in the circulating cell-free DNA of 59 patients with lung or colorectal cancer.

read more

Content maybe subject to copyright    Report

UC San Diego
UC San Diego Previously Published Works
Title
Identification of methylation haplotype blocks aids in deconvolution of heterogeneous tissue
samples and tumor tissue-of-origin mapping from plasma DNA.
Permalink
https://escholarship.org/uc/item/0nj3h020
Journal
Nature genetics, 49(4)
ISSN
1061-4036
Authors
Guo, Shicheng
Diep, Dinh
Plongthongkum, Nongluk
et al.
Publication Date
2017-04-01
DOI
10.1038/ng.3805
Peer reviewed
eScholarship.org Powered by the California Digital Library
University of California

Identification of methylation haplotype blocks aids in
deconvolution of heterogeneous tissue samples and tumor
tissue-of-origin mapping from plasma DNA
Shicheng Guo
1,3
, Dinh Diep
1,3
, Nongluk Plongthongkum
1
, Ho-Lim Fung
1
, Kang Zhang
2
,
and Kun Zhang
1,2,*
1
Department of Bioengineering, University of California at San Diego, La Jolla, California, USA
2
Institute for Genomic Medicine, University of California at San Diego, La Jolla, California, USA
Abstract
Adjacent CpG sites in mammalian genomes can be co-methylated due to the processivity of
methyltransferases or demethylases. Yet discordant methylation patterns have also been observed,
and found related to stochastic or uncoordinated molecular processes. We focused on a systematic
search and investigation of regions in the full human genome that exhibit highly coordinated
methylation. We defined 147,888 blocks of tightly coupled CpG sites, called methylation
haplotype blocks (MHBs) with 61 sets of whole genome bisulfite sequencing (WGBS) data, and
further validated with 101 sets of reduced representation bisulfite sequencing (RRBS) data and
637 sets of methylation array data. Using a metric called methylation haplotype load (MHL), we
performed tissue-specific methylation analysis at the block level. Subsets of informative blocks
were further identified for deconvolution of heterogeneous samples. Finally, we demonstrated
quantitative estimation of tumor load and tissue-of-origin mapping in the circulating cell-free
DNA of 59 cancer patients using methylation haplotypes.
Introduction
Mammalian CpG methylation is a relatively stable epigenetic modification, which can be
transmitted across cell division
1
through DNMT1, and dynamically established, or removed
by DNMT3 A/B and TET proteins. Due to the locally coordinated activities of these
enzymes, adjacent CpG sites on the same DNA molecules can share similar methylation
status, although discordant CpG methylation has been observed, especially in cancer
2
. The
Users may view, print, copy, and download text and data-mine the content in such documents, for the purposes of academic research,
subject always to the full Conditions of use: http://www.nature.com/authors/editorial_policies/license.html#terms
*
Corresponding authors: Kun Zhang, kzhang@bioeng.ucsd.edu.
3
Equally contributed authors.
Author’s Contributions
Ku.Z. conceived the initial concept and oversaw the study. S.G., D.D. and Ku.Z. performed bioinformatics analyses. N.P., D.D., and
H.F. performed experiments. Ka. Z. contributed normal plasma samples. Ku. Z., S.G. and D.D. wrote the manuscript with inputs from
all co-authors.
Competing Financial interests
S. Guo, D. Diep and Ku. Zhang were listed as inventors in patent applications related to the methods disclosed in this manuscript. Ku.
Z. is a co-founder and scientific advisor of Singlera Genomics Inc.
HHS Public Access
Author manuscript
Nat Genet
. Author manuscript; available in PMC 2017 September 06.
Published in final edited form as:
Nat Genet
. 2017 April ; 49(4): 635–642. doi:10.1038/ng.3805.
Author Manuscript Author Manuscript Author Manuscript Author Manuscript

theoretical framework of linkage disequilibrium
3
, which was developed to model the co-
segregation of adjacent genetic variants on human chromosomes in human populations, can
be applied to the analysis of CpG co-methylation in cell populations. A number of studies
related to the concepts of methylation haplotypes
4
, epi-alleles
5
, or epi-haplotypes
6
have been
reported, albeit at small numbers of genomic regions or limited numbers of cell/tissue types.
Recent data production efforts, especially by large consortia
7
, have produced a large number
of whole-genome, base-resolution bisulfite sequencing data sets for many tissue and cell
types. These public data sets, in combination with additional WGBS data generated in this
study, allowed us to perform full-genome characterization of locally coupled CpG
methylation across the largest set of human tissue types available to date, and annotate these
blocks of co-methylated CpGs as a distinct set of genomic features.
DNA methylation is cell-type specific, and the pattern can be harnessed for analyzing the
relative cell composition of heterogeneous samples, such as different white blood cells in
whole blood
8
, fetal components in maternal circulating cell-free DNA(cfDNA)
9
, or
circulating tumor DNA (ctDNA) in plasma
9
. Most of these recent efforts relies on the
methylation level of individual CpG sites, and are fundamentally limited by the technical
noise and sensitivity in measuring single CpG methylation. Recently, Lehmann-Werman
demonstrated a superior sensitivity with multi-CpG haplotypes in detecting tissue-specific
signatures in cfDNA
10
, although based on the sparse genome coverage of Illumina 450k
methylation arrays (HM450K). Here we performed an exhaustive search of tissue-specific
methylation haplotype blocks across the full genome, and proposed a block-level metric,
termed methylated haplotype load (MHL), for a systematic discovery of informative
markers. Applying our analytic framework and identified markers, we demonstrated accurate
determination of tissue origin and prediction of cancer status in clinical plasma samples
from patients of lung cancer (LC) and colorectal cancer (CRC) (Fig. 1a).
Results
Identification and characterization of methylation haplotype blocks
To investigate the co-methylation status of adjacent CpG sites along single DNA molecules,
we extended the concept of genetic linkage disequilibrium
3,4
and the r
2
metric to quantify
the degree of coupled CpG methylation among different DNA molecules. CpG methylation
status of multiple CpG sites in single- or paired-end Illumina sequencing reads were
extracted to form methylation haplotypes, and pairwise “linkage disequilibrium” of CpG
methylation r
2
was calculated from the fractions of different methylation haplotypes (see
Methods).
We started with 51 sets of published WGBS data from human primary tissues
11,12
, as well
as the H1 human embryonic stem cells,
in vitro
derived progenitors
13
and human cancer cell
line
14,15
. We also included an in-house generated WGBS dataset from 10 adult tissues of
one human donor. Across these 61 samples (>2000x combined genome coverage) we
identified a total of ~ 55 billion methylation haplotype informative reads that cover 58.2% of
autosomal CpGs. The uncovered CpG sites were either in regions with low mappability, or
CpG sparse regions where there are too few CpG sites within Illumina read pairs for
deriving informative haplotypes. We partitioned the human genome into blocks of tightly
Guo et al. Page 2
Nat Genet
. Author manuscript; available in PMC 2017 September 06.
Author Manuscript Author Manuscript Author Manuscript Author Manuscript

coupled CpG methylation sites, called methylation haplotype blocks (MHBs, Fig. 1b), using
a r
2
cutoff of 0.5. We identified 147,888 MHBs at the average size of 95bp and minimum 3
CpGs per block, which represents ~0.5% of the human genome that tends to be tightly co-
regulated on the epigenetic status at the level of single DNA molecules (Supplementary
Table 1, Supplementary Fig. 1a, b). The majority of CpG sites within the same MHBs are
near perfectly coupled (r
2
~1.0) regardless of the sample type. We found that the fraction of
tightly coupled CpG pairs (r
2
> 0.9, Fig. 1c) slightly decreased over CpG spacing from stem
and progenitor cells (94.8%, mostly cultured cells) to somatic cells (91.2%, mixture of
primary adult tissues) to cancer cells (87.8%, mixture of CRC tissues and LC cell lines). The
loss of LD in cancer cells was validated by another independent WGBS data from primary
kidney cancer tissues
16
(Supplementary Fig. 2). Although the WGBS data came from
different laboratories that might have batch technical differences, we found that that
methylation LD extends further over CpG distance in stem and progenitor cells, which is
consistent with our previous observations on 2,020 CpG islands
4
for culture cell lines and
with another report
17
. Interestingly, in cancer samples, we observed a reduction of perfectly
coupled CpG pairs, which could be related to the pattern of discordant methylation recently
reported in variable methylation regions (VMR)
2,18
. The cancer-specific decayed MHBs
were enriched for cancer related pathways and functions (Supplementary Table 2).
Nonetheless, the majority of MHBs in cancers still contains tightly coupled CpGs (87.8%),
allowing us to harness the pattern for detecting tumor in plasma. We further validated the co-
methylation of these MHBs in 101 ENCODE RRBS datasets and 637 TCGA HM450K
datasets (Supplementary Note, Supplementary Fig. 3).
Co-localization of methylation haplotype blocks with known regulatory elements
The MHBs established by 61 sets of WGBS data represent a distinct type of genomic feature
that partially overlaps with multiple known genomic elements (Fig. 1d). Among all MHBs,
60,828 (41.1%) located in intergenic regions while 87,060 (58.9%) regions in transcribed
regions. These MHBs were significantly enriched (
P
<1.0×10
−6
) in enhancers, super
enhancers, promoters, CpG islands and imprinted genes. In addition, we observed modest
depletion in the lamina-associated domains (LAD)
19
and the large organized chromatin K9
modifications (LOCK) regions
20
modest enrichment in TAD
21
. Importantly, we observed a
strong (26-fold) enrichment in VMR (Fig. 1e), suggesting that increased epigenetic
variability in a cell population or tissue can be coordinated locally among hundreds of
thousands of genomic regions
22
. We further examined a subset of MHBs that do not overlap
with CpG islands, and observed a consistent enrichment pattern (Fig. 1e, Supplementary
Fig. 1c), suggesting that local CpG density alone does not account for the enrichment.
Previous studies on mouse and human
23,24
demonstrated that dynamically methylated
regions were associated with regulatory regions such as enhancer-like regions marked by
H3K27ac and transcription factor binding sites. Using publicly histone mapping data for
human adult tissues, we found co-localization of methylation haplotype blocks with marks
for active promoters (H3K4me3 with H3K27ac), but not for active enhancers
25
(no peak for
H3K4me1) (Supplementary Fig. 4). We found that enhancers tend to overlap with CpG
sparse MHBs, whereas the co-localization with super enhancers were independent of CpG
density (Supplementary Fig. 1c). Therefore, MHBs likely capture the local coherent
epigenetic signatures that are directly or indirectly coupled to transcriptional regulation.
Guo et al. Page 3
Nat Genet
. Author manuscript; available in PMC 2017 September 06.
Author Manuscript Author Manuscript Author Manuscript Author Manuscript

Block-level analysis of human normal tissues and stem cell lines with methylation
haplotype load
To enable quantitative analysis of the methylation patterns within individual MHBs across
many samples, we need a single metric to define the methylated pattern of multiple CpG
sites within each block. Ideally this metric is not only a function of average methylation
level for all the CpG sites in the block, but also can capture the pattern of co-methylation on
single DNA molecules. Therefore, we defined methylation haplotype load (MHL), a
weighted mean of the fraction of fully methylated haplotypes and substrings at different
lengths (i.e. all possible substrings, see Methods). Compared with other metrics used in the
literature (methylation level, methylation entropy, epi-polymorphism and haplotypes
counts), MHL is capable of distinguishing blocks that have the same average methylation
but various degrees of coordinated methylation (Fig. 2). In addition, MHL is bounded
between 0 and 1, which allows for direct comparison of different regions across many data
sets.
We next asked whether treating MHBs as individual genomic features and performing
quantitative analysis based on MHL would provide an advantage over previous approaches
using individual CpG sites or weighted (or unweighted) averaging of multiple CpG sites in
certain genomic windows. Therefore, we clustered 65 WGBS data sets (including 4
additional colon and lung cancer WGBS sets
26
) from human solid tissues based on MHL.
Unsupervised clustering with the 15% most variable MHBs showed that, regardless of the
data sources, samples of the same tissue origin clustered together (Fig. 3a), while cancer
samples and stem cell samples exhibit distinct patterns from human adult tissues. PCA
analysis on all MHBs yielded a similar pattern (Supplementary Fig. 5). To identify a subset
of MHBs for effective clustering of human somatic tissues, we calculated a tissue specific
index (TSI) for each MHB. Feature selection using random forest identified a set of 1,365
tissue-specific MHBs (Supplementary Table 3) that can predict tissue type at an accuracy of
0.89 (95%CI: 0.84–0.93), although several tissue types share rather similar cell
compositions (i.e. muscle vs. heart). Using these MHBs, we compared the performance
between MHL, average methylation fraction(AMF) in the MHBs and all individual CpG
methylation fraction(IMF). MHL and AMF provided similar tissue specificity, while MHL
has a lower noise (background: 0.29, 95%CI: 0.23–0.35) compared with AMF (background:
0.4, 95%CI: 0.32–0.48). Clustering based on individual CpGs in the blocks has the worst
performance, which might be due to higher biological or technical viability of individual
CpG sites (Fig. 3c). Thus, block-level analysis based on MHL is advantageous over single
CpG or local averaging of multiple CpG sites in distinguishing tissue types.
The human adult tissues that we used have various degrees of similarity amongst each other.
We hypothesize that this is primarily defined by their developmental lineage, and that the
related MHBs might reveal epigenetic insights relevant to germ layer speciation. We
searched for MHBs that have differential MHL among data sets from the three germ layers.
In total we identified 114 ectoderm-specific MHBs (99 hyper- and 15 hypo-methylated), 75
endoderm specific MHBs (58 hyper and 17 hypo-methylated) and 31 mesoderm specific
MHBs (9 hyper and 22 hypo-methylated) (Supplementary Table 4). Cluster analysis based
on layer specific MHBs shows expected aggregation among tissues of same the lineage (Fig.
Guo et al. Page 4
Nat Genet
. Author manuscript; available in PMC 2017 September 06.
Author Manuscript Author Manuscript Author Manuscript Author Manuscript

Citations
More filters

Integrative analysis of 111 reference human epigenomes

TL;DR: In this article, the authors describe the integrative analysis of 111 reference human epigenomes generated as part of the NIH Roadmap Epigenomics Consortium, profiled for histone modification patterns, DNA accessibility, DNA methylation and RNA expression.
Journal ArticleDOI

Current and future perspectives of liquid biopsies in genomics-driven oncology.

TL;DR: The potential of liquid biopsies is highlighted by studies that show they can track the evolutionary dynamics and heterogeneity of tumours and can detect very early emergence of therapy resistance, residual disease and recurrence, but their analytical validity and clinical utility must be rigorously demonstrated before this potential can be realized.
Journal ArticleDOI

Principles of DNA methylation and their implications for biology and medicine.

TL;DR: Taking advantage of tissue-specific differences, methylation can be used to detect cell death and thereby monitor many common diseases with a simple cell-free circulating-DNA blood test.
Journal ArticleDOI

The emerging role of cell-free DNA as a molecular marker for cancer management.

TL;DR: Recent advancements are explored and the current gaps in knowledge concerning each point of contact between cfDNA analysis and the different stages of cancer management are highlighted.
References
More filters
Journal Article

R: A language and environment for statistical computing.

R Core Team
- 01 Jan 2014 - 
TL;DR: Copyright (©) 1999–2012 R Foundation for Statistical Computing; permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and permission notice are preserved on all copies.
Journal ArticleDOI

Random Forests

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Journal ArticleDOI

An integrated encyclopedia of DNA elements in the human genome

TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.
Journal Article

An integrated encyclopedia of DNA elements in the human genome.

ENCODEConsortium
- 01 Jan 2012 - 
TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.
Journal ArticleDOI

Adjusting batch effects in microarray expression data using empirical Bayes methods

TL;DR: This paper proposed parametric and non-parametric empirical Bayes frameworks for adjusting data for batch effects that is robust to outliers in small sample sizes and performs comparable to existing methods for large samples.
Related Papers (5)

Integrative analysis of 111 reference human epigenomes

Anshul Kundaje, +123 more
- 19 Feb 2015 - 

Detection of Circulating Tumor DNA in Early- and Late-Stage Human Malignancies

Chetan Bettegowda, +69 more