Unsupervised pattern discovery in human chromatin structure through genomic segmentation

doi:10.1145/2506583.2506701

Home
/
Papers
/
Unsupervised pattern discovery in human chromatin structure through genomic segmentation

Proceedings Article•DOI•

Unsupervised pattern discovery in human chromatin structure through genomic segmentation

Michael M. Hoffman¹, Orion J. Buske¹, Jie Wang², Zhiping Weng³, Jeff A. Bilmes¹, William Stafford Noble¹ - Show less +2 more•Institutions (3)

University of Washington¹, University at Buffalo², University of Massachusetts Medical School³

22 Sep 2013-Vol. 9, Iss: 5, pp 813

TL;DR: An integrative method is developed to identify patterns from multiple experiments simultaneously while taking full advantage of high-resolution data, discovering joint patterns across different assay types, and yields a model which elucidates the relationship between assay observations and functional elements in the genome.

read less

Abstract: Sequence census methods like ChIP-seq now produce an unprecedented amount of genome-anchored data. We have developed an integrative method to identify patterns from multiple experiments simultaneously while taking full advantage of high-resolution data, discovering joint patterns across different assay types. We apply this method to ENCODE chromatin data for the human chronic myeloid leukemia cell line K562, including ChIP-seq data on covalent histone modifications and transcription factor binding, and DNase-seq and FAIRE-seq readouts of open chromatin. In an unsupervised fashion, we identify patterns associated with transcription start sites, gene ends, enhancers, CTCF elements, and repressed regions. The method yields a model which elucidates the relationship between assay observations and functional elements in the genome. This model identifies sequences likely to affect transcription, and we verify these predictions in laboratory experiments. We have made software and an integrative genome browser track freely available (noble.gs.washington.edu/proj/segway/).

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

An integrated encyclopedia of DNA elements in the human genome

[...]

Principal investigators¹, Nhgri groups², Data production leads³, Lead analysts³•Institutions (3)

Wellcome Trust¹, University of Washington², Pennsylvania State University³

06 Sep 2012-Nature

TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.

...read moreread less

Abstract: The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

...read moreread less

13,548 citations

Journal Article•

An integrated encyclopedia of DNA elements in the human genome.

[...]

ENCODEConsortium

01 Jan 2012-Nature

...read moreread less

8,106 citations

Journal Article•DOI•

A general framework for estimating the relative pathogenicity of human genetic variants

[...]

Martin Kircher¹, Daniela Witten¹, Preti Jain, Brian J. O'Roak², Brian J. O'Roak¹, Gregory M. Cooper, Jay Shendure¹ - Show less +3 more•Institutions (2)

University of Washington¹, Oregon Health & Science University²

01 Mar 2014-Nature Genetics

TL;DR: The ability of CADD to prioritize functional, deleterious and pathogenic variants across many functional categories, effect sizes and genetic architectures is unmatched by any current single-annotation method.

...read moreread less

Abstract: Our capacity to sequence human genomes has exceeded our ability to interpret genetic variation. Current genomic annotations tend to exploit a single information type (e.g. conservation) and/or are restricted in scope (e.g. to missense changes). Here, we describe Combined Annotation Dependent Depletion (CADD), a framework that objectively integrates many diverse annotations into a single, quantitative score. We implement CADD as a support vector machine trained to differentiate 14.7 million high-frequency human derived alleles from 14.7 million simulated variants. We pre-compute “C-scores” for all 8.6 billion possible human single nucleotide variants and enable scoring of short insertions/deletions. C-scores correlate with allelic diversity, annotations of functionality, pathogenicity, disease severity, experimentally measured regulatory effects, and complex trait associations, and highly rank known pathogenic variants within individual genomes. The ability of CADD to prioritize functional, deleterious, and pathogenic variants across many functional categories, effect sizes and genetic architectures is unmatched by any current annotation.

...read moreread less

4,956 citations

Journal Article•DOI•

An atlas of active enhancers across human cell types and tissues

[...]

Robin Andersson¹, Claudia Gebhard², Irene Miguel-Escalada³, Ilka Hoof¹, Jette Bornholdt¹, Mette Boyd¹, Yun Chen¹, Xiaobei Zhao⁴, Xiaobei Zhao¹, Christian Schmidl², Takahiro Suzuki, Evgenia Ntini, Erik Arner, Eivind Valen¹, Eivind Valen⁵, Kang Li¹, Lucia Schwarzfischer², Dagmar Glatz², Johanna Raithel², Berit Lilje¹, Nicolas Rapin¹, Frederik Otzen Bagger¹, Mette Rose Jørgensen¹, Peter Refsing Andersen⁶, Nicolas Bertin, Owen J. L. Rackham, A. Maxwell Burroughs, J Kenneth Baillie⁷, Yuri Ishizu, Yuri Shimizu, Erina Furuhata, Shiori Maeda, Yutaka Negishi, Christopher J. Mungall⁸, Terrence F. Meehan⁹, Timo Lassmann, Masayoshi Itoh, Hideya Kawaji, Naoto Kondo, Jun Kawai, Andreas Lennartsson¹⁰, Carsten O. Daub¹⁰, Peter Heutink¹¹, David A. Hume⁷, Torben Heick Jensen⁶, Harukazu Suzuki, Yoshihide Hayashizaki, Ferenc Müller³, Alistair R. R. Forrest, Piero Carninci, Michael Rehli², Albin Sandelin¹ - Show less +48 more•Institutions (11)

University of Copenhagen¹, University Hospital Regensburg², University of Birmingham³, University of North Carolina at Chapel Hill⁴, Harvard University⁵, Aarhus University⁶, University of Edinburgh⁷, Lawrence Berkeley National Laboratory⁸, European Bioinformatics Institute⁹, Karolinska Institutet¹⁰, VU University Medical Center¹¹

27 Mar 2014-Nature

TL;DR: It is shown that enhancers share properties with CpG-poor messenger RNA promoters but produce bidirectional, exosome-sensitive, relatively short unspliced RNAs, the generation of which is strongly related to enhancer activity.

...read moreread less

Abstract: Enhancers control the correct temporal and cell-type-specific activation of gene expression in multicellular eukaryotes. Knowing their properties, regulatory activity and targets is crucial to understand the regulation of differentiation and homeostasis. Here we use the FANTOM5 panel of samples, covering the majority of human tissues and cell types, to produce an atlas of active, in vivo-transcribed enhancers. We show that enhancers share properties with CpG-poor messenger RNA promoters but produce bidirectional, exosome-sensitive, relatively short unspliced RNAs, the generation of which is strongly related to enhancer activity. The atlas is used to compare regulatory programs between different cells at unprecedented depth, to identify disease-associated regulatory single nucleotide polymorphisms, and to classify cell-type-specific and ubiquitous enhancers. We further explore the utility of enhancer redundancy, which explains gene expression strength rather than expression patterns. The online FANTOM5 enhancer atlas represents a unique resource for studies on cell-type-specific enhancers and gene regulation.

...read moreread less

2,260 citations

Journal Article•DOI•

Machine learning applications in genetics and genomics

[...]

Maxwell W. Libbrecht¹, William Stafford Noble¹•Institutions (1)

University of Washington¹

07 May 2015-Nature Reviews Genetics

TL;DR: An overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data is provided.

...read moreread less

Abstract: The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. Here, we provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. We present considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. We provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets.

...read moreread less

1,317 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Maximum likelihood from incomplete data via the EM algorithm

[...]

Arthur P. Dempster¹, Nan M. Laird¹, Donald B. Rubin¹•Institutions (1)

Harvard University¹

01 Sep 1977-Journal of the royal statistical society series b-methodological

49,597 citations

Journal Article•DOI•

Error bounds for convolutional codes and an asymptotically optimum decoding algorithm

[...]

Andrew J. Viterbi¹•Institutions (1)

University of California, Los Angeles¹

01 Apr 1967-IEEE Transactions on Information Theory

TL;DR: The upper bound is obtained for a specific probabilistic nonsequential decoding algorithm which is shown to be asymptotically optimum for rates above R_{0} and whose performance bears certain similarities to that of sequential decoding algorithms.

...read moreread less

Abstract: The probability of error in decoding an optimal convolutional code transmitted over a memoryless channel is bounded from above and below as a function of the constraint length of the code. For all but pathological channels the bounds are asymptotically (exponentially) tight for rates above R_{0} , the computational cutoff rate of sequential decoding. As a function of constraint length the performance of optimal convolutional codes is shown to be superior to that of block codes of the same length, the relative improvement increasing with rate. The upper bound is obtained for a specific probabilistic nonsequential decoding algorithm which is shown to be asymptotically optimum for rates above R_{0} and whose performance bears certain similarities to that of sequential decoding algorithms.

...read moreread less

6,804 citations

Proceedings Article•DOI•

The relationship between Precision-Recall and ROC curves

[...]

Jesse Davis¹, Mark Goadrich¹•Institutions (1)

University of Wisconsin-Madison¹

25 Jun 2006

TL;DR: It is shown that a deep connection exists between ROC space and PR space, such that a curve dominates in R OC space if and only if it dominates in PR space.

...read moreread less

Abstract: Receiver Operator Characteristic (ROC) curves are commonly used to present results for binary decision problems in machine learning. However, when dealing with highly skewed datasets, Precision-Recall (PR) curves give a more informative picture of an algorithm's performance. We show that a deep connection exists between ROC space and PR space, such that a curve dominates in ROC space if and only if it dominates in PR space. A corollary is the notion of an achievable PR curve, which has properties much like the convex hull in ROC space; we show an efficient algorithm for computing this curve. Finally, we also note differences in the two types of curves are significant for algorithm design. For example, in PR space it is incorrect to linearly interpolate between points. Furthermore, algorithms that optimize the area under the ROC curve are not guaranteed to optimize the area under the PR curve.

...read moreread less

5,063 citations

Journal Article•DOI•

Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes

[...]

Adam Siepel¹, Gill Bejerano, Jakob Skou Pedersen², Angie S. Hinrichs, Minmei Hou, Kate R. Rosenbloom, Hiram Clawson, John Spieth, LaDeana W. Hillier, Stephen Richards, George M. Weinstock, Richard K. Wilson, Richard A. Gibbs, W. James Kent, Webb Miller, David Haussler - Show less +12 more•Institutions (2)

University of California, Santa Cruz¹, Aarhus University²

01 Aug 2005-Genome Research

TL;DR: A comprehensive search for conserved elements in vertebrate genomes is conducted, using genome-wide multiple alignments of five vertebrate species (human, mouse, rat, chicken, and Fugu rubripes), using a two-state phylogenetic hidden Markov model (phylo-HMM).

...read moreread less

Abstract: We have conducted a comprehensive search for conserved elements in vertebrate genomes, using genome-wide multiple alignments of five vertebrate species (human, mouse, rat, chicken, and Fugu rubripes). Parallel searches have been performed with multiple alignments of four insect species (three species of Drosophila and Anopheles gambiae), two species of Caenorhabditis, and seven species of Saccharomyces. Conserved elements were identified with a computer program called phastCons, which is based on a two-state phylogenetic hidden Markov model (phylo-HMM). PhastCons works by fitting a phylo-HMM to the data by maximum likelihood, subject to constraints designed to calibrate the model across species groups, and then predicting conserved elements based on this model. The predicted elements cover roughly 3%-8% of the human genome (depending on the details of the calibration procedure) and substantially higher fractions of the more compact Drosophila melanogaster (37%-53%), Caenorhabditis elegans (18%-37%), and Saccharaomyces cerevisiae (47%-68%) genomes. From yeasts to vertebrates, in order of increasing genome size and general biological complexity, increasing fractions of conserved bases are found to lie outside of the exons of known protein-coding genes. In all groups, the most highly conserved elements (HCEs), by log-odds score, are hundreds or thousands of bases long. These elements share certain properties with ultraconserved elements, but they tend to be longer and less perfectly conserved, and they overlap genes of somewhat different functional categories. In vertebrates, HCEs are associated with the 3' UTRs of regulatory genes, stable gene deserts, and megabase-sized regions rich in moderately conserved noncoding sequences. Noncoding HCEs also show strong statistical evidence of an enrichment for RNA secondary structure.

...read moreread less

3,719 citations

Journal Article•DOI•

FIMO: scanning for occurrences of a given motif.

[...]

Charles E. Grant¹, Timothy L. Bailey², William Stafford Noble²•Institutions (2)

University of Washington¹, University of Queensland²

01 Apr 2011-Bioinformatics

TL;DR: Find Individual Motif Occurrences (FIMO), a software tool for scanning DNA or protein sequences with motifs described as position-specific scoring matrices, and provides output in a variety of formats, including HTML, XML and several Santa Cruz Genome Browser formats.

...read moreread less

Abstract: Summary: A motif is a short DNA or protein sequence that contributes to the biological function of the sequence in which it resides. Over the past several decades, many computational methods have been described for identifying, characterizing and searching with sequence motifs. Critical to nearly any motif-based sequence analysis pipeline is the ability to scan a sequence database for occurrences of a given motif described by a position-specific frequency matrix. Results: We describe Find Individual Motif Occurrences (FIMO), a software tool for scanning DNA or protein sequences with motifs described as position-specific scoring matrices. The program computes a log-likelihood ratio score for each position in a given sequence database, uses established dynamic programming methods to convert this score to a P-value and then applies false discovery rate analysis to estimate a q-value for each position in the given sequence. FIMO provides output in a variety of formats, including HTML, XML and several Santa Cruz Genome Browser formats. The program is efficient, allowing for the scanning of DNA sequences at a rate of 3.5 Mb/s on a single CPU. Availability and Implementation: FIMO is part of the MEME Suite software toolkit. A web server and source code are available at

...read moreread less

3,266 citations