scispace - formally typeset
Search or ask a question

Showing papers by "Mark Gerstein published in 2013"


Journal ArticleDOI
04 Oct 2013-Science
TL;DR: In this article, the authors used patterns of polymorphisms in functionally annotated regions in 1092 humans to identify deleterious variants; then, they experimentally validated candidates, finding regions particularly sensitive to mutations and variants that are disruptive because of mechanistic effects on transcription-factor binding.
Abstract: Interpreting variants, especially noncoding ones, in the increasing number of personal genomes is challenging. We used patterns of polymorphisms in functionally annotated regions in 1092 humans to identify deleterious variants; then we experimentally validated candidates. We analyzed both coding and noncoding regions, with the former corroborating the latter. We found regions particularly sensitive to mutations ("ultrasensitive") and variants that are disruptive because of mechanistic effects on transcription-factor binding (that is, "motif-breakers"). We also found variants in regions with higher network centrality tend to be deleterious. Insertions and deletions followed a similar pattern to single-nucleotide variants, with some notable exceptions (e.g., certain deletions and enhancers). On the basis of these patterns, we developed a computational tool (FunSeq), whose application to ~90 cancer genomes reveals nearly a hundred candidate noncoding drivers.

371 citations


Journal ArticleDOI
TL;DR: It is found that indel length modulates selection strength, and that indels affecting multiple functionally constrained nucleotides undergo stronger purifying selection than SNPs, and the causal variant underlying some of these associations may be indels.
Abstract: Short insertions and deletions (indels) are the second most abundant form of human genetic variation, but our understanding of their origins and functional effects lags behind that of other types of variants. Using population-scale sequencing, we have identified a high-quality set of 1.6 million indels from 179 individuals representing three diverse human populations. We show that rates of indel mutagenesis are highly heterogeneous, with 43%-48% of indels occurring in 4.03% of the genome, whereas in the remaining 96% their prevalence is 16 times lower than SNPs. Polymerase slippage can explain upwards of three-fourths of all indels, with the remainder being mostly simple deletions in complex sequence. However, insertions do occur and are significantly associated with pseudo-palindromic sequence features compatible with the fork stalling and template switching (FoSTeS) mechanism more commonly associated with large structural variations. We introduce a quantitative model of polymerase slippage, which enables us to identify indel-hypermutagenic protein-coding genes, some of which are associated with recurrent mutations leading to disease. Accounting for mutational rate heterogeneity due to sequence context, we find that indels across functional sequence are generally subject to stronger purifying selection than SNPs. We find that indel length modulates selection strength, and that indels affecting multiple functionally constrained nucleotides undergo stronger purifying selection. We further find that indels are enriched in associations with gene expression and find evidence for a contribution of nonsense-mediated decay. Finally, we show that indels can be integrated in existing genome-wide association studies (GWAS); although we do not find direct evidence that potentially causal protein-coding indels are enriched with associations to known disease-associated SNPs, our findings suggest that the causal variant underlying some of these associations may be indels.

217 citations


Journal ArticleDOI
TL;DR: A systems-based classifier is built to quantitatively estimate the global perturbation caused by deleterious mutations in each gene and shows its strong potential for interpretation of variants involved in Mendelian diseases and in complex disorders probed by genome-wide association studies.
Abstract: The decreasing cost of sequencing is leading to a growing repertoire of personal genomes. However, we are lagging behind in understanding the functional consequences of the millions of variants obtained from sequencing. Global system-wide effects of variants in coding genes are particularly poorly understood. It is known that while variants in some genes can lead to diseases, complete disruption of other genes, called ‘loss-of-function tolerant’, is possible with no obvious effect. Here, we build a systems-based classifier to quantitatively estimate the global perturbation caused by deleterious mutations in each gene. We first survey the degree to which gene centrality in various individual networks and a unified ‘Multinet’ correlates with the tolerance to loss-of-function mutations and evolutionary conservation. We find that functionally significant and highly conserved genes tend to be more central in physical protein-protein and regulatory networks. However, this is not the case for metabolic pathways, where the highly central genes have more duplicated copies and are more tolerant to loss-of-function mutations. Integration of three-dimensional protein structures reveals that the correlation with centrality in the protein-protein interaction network is also seen in terms of the number of interaction interfaces used. Finally, combining all the network and evolutionary properties allows us to build a classifier distinguishing functionally essential and loss-of-function tolerant genes with higher accuracy (AUC = 0.91) than any individual property. Application of the classifier to the whole genome shows its strong potential for interpretation of variants involved in Mendelian diseases and in complex disorders probed by genome-wide association studies.

163 citations


Journal ArticleDOI
TL;DR: This study examined primary and metastatic prostate cancer and found that miR-31 expression was reduced as a result of promoter hypermethylation, and importantly, the levels of miR -31 expression were inversely correlated with the aggressiveness of the disease.
Abstract: Androgen receptor (AR) signaling plays a critical role in prostate cancer (PCA) pathogenesis. Yet, the regulation of AR signaling remains elusive. Even with stringent androgen deprivation therapy, AR signaling persists. Here, our data suggest that there is a complex interaction between the expression of the tumor suppressor miRNA, miR-31 and AR signaling. We examined primary and metastatic PCA and found that miR-31 expression was reduced as a result of promoter hypermethylation and importantly, the levels of miR-31 expression was inversely correlated with the aggressiveness of the disease. As the expression of AR and miR-31 was inversely correlated in the cell lines, our study further suggested that miR-31 and AR could mutually repress each other. Upregulation of miR-31 effectively suppressed AR expression through multiple mechanisms and inhibited PCA growth in vivo. Notably, we found that miR-31 targeted AR directly at a site located in the coding region, which was commonly mutated in PCA. Additionally, miR-31 suppressed cell cycle regulators, including E2F1, E2F2, EXO1, FOXM1, and MCM2. Together, our findings suggest a novel AR regulatory mechanism mediated through miR-31 expression. The downregulation of miR-31 may disrupt cellular homeostasis and contribute to the evolution and progression of PCA. We provide implications for epigenetic treatment and support clinical development of detecting miR-31 promoter methylation as a novel biomarker.

159 citations


Journal ArticleDOI
TL;DR: Some key aspects of machine learning that make it useful for genome annotation, with illustrative examples from ENCODE are explained.
Abstract: By its very nature, genomics produces large, high-dimensional datasets that are well suited to analysis by machine learning approaches. Here, we explain some key aspects of machine learning that make it useful for genome annotation, with illustrative examples from ENCODE.

88 citations


Journal ArticleDOI
TL;DR: This network provides a framework for stratifying and predicting patient outcomes, and it is used to show that the peroxisome proliferator-activated receptor delta binds to a set of genes also regulated by the retinoic acid receptors and whose expression is associated with poor prognosis in breast cancer.

76 citations


Journal ArticleDOI
TL;DR: This work uses long-read complementary DNA datasets for the analysis of a eukaryotic transcriptome and demonstrates that long read sequence can be assembled into full-length transcripts with considerable success and is applicable to all long read sequencing technologies.
Abstract: Precise identification of RNA-coding regions and transcriptomes of eukaryotes is a significant problem in biology. Currently, eukaryote transcriptomes are analyzed using deep short-read sequencing experiments of complementary DNAs. The resulting short-reads are then aligned against a genome and annotated junctions to infer biological meaning. Here we use long-read complementary DNA datasets for the analysis of a eukaryotic transcriptome and generate two large datasets in the human K562 and HeLa S3 cell lines. Both data sets comprised at least 4 million reads and had median read lengths greater than 500 bp. We show that annotation-independent alignments of these reads provide partial gene structures that are very much in-line with annotated gene structures, 15% of which have not been obtained in a previous de novo analysis of short reads. For long-noncoding RNAs (i.e., lncRNA) genes, however, we find an increased fraction of novel gene structures among our alignments. Other important aspects of transcriptome analysis, such as the description of cell type-specific splicing, can be performed in an accurate, reliable and completely annotation-free manner, making it ideal for the analysis of transcriptomes of newly sequenced genomes. Furthermore, we demonstrate that long read sequence can be assembled into full-length transcripts with considerable success. Our method is applicable to all long read sequencing technologies.

65 citations


Journal ArticleDOI
TL;DR: The study suggests how evolution has shaped this network and provides direct biological network support that selective pressure is not on individual genes but rather on the relationship between genes, which highlights the importance of integrating phylogenetic analysis into biological network studies.
Abstract: The genetic network involved in the bacterial cell cycle is poorly understood even though it underpins the remarkable ability of bacteria to proliferate. How such network evolves is even less clear. The major aims of this work were to identify and examine the genes and pathways that are differentially expressed during the Caulobacter crescentus cell cycle, and to analyze the evolutionary features of the cell cycle network. We used deep RNA sequencing to obtain high coverage RNA-Seq data of five C. crescentus cell cycle stages, each with three biological replicates. We found that 1,586 genes (over a third of the genome) display significant differential expression between stages. This gene list, which contains many genes previously unknown for their cell cycle regulation, includes almost half of the genes involved in primary metabolism, suggesting that these “house-keeping” genes are not constitutively transcribed during the cell cycle, as often assumed. Gene and module co-expression clustering reveal co-regulated pathways and suggest functionally coupled genes. In addition, an evolutionary analysis of the cell cycle network shows a high correlation between co-expression and co-evolution. Most co-expression modules have strong phylogenetic signals, with broadly conserved genes and clade-specific genes predominating different substructures of the cell cycle co-expression network. We also found that conserved genes tend to determine the expression profile of their module. We describe the first phylogenetic and single-nucleotide-resolution transcriptomic analysis of a bacterial cell cycle network. In addition, the study suggests how evolution has shaped this network and provides direct biological network support that selective pressure is not on individual genes but rather on the relationship between genes, which highlights the importance of integrating phylogenetic analysis into biological network studies.

55 citations


Journal ArticleDOI
TL;DR: Identification of tissue-specific binding profiles and effector target genes reveals important insights into the mechanisms by which Rb/E2F controls distinct cell fates in vivo.
Abstract: Background The tumor suppressor Rb/E2F regulates gene expression to control differentiation in multiple tissues during development, although how it directs tissue-specific gene regulation in vivo is poorly understood.

49 citations


Journal ArticleDOI
TL;DR: It is shown that cell division is coupled to retrotransposition and, perhaps, is even a requirement for it, and that a correct phylogenetic tree of human subpopulations based solely on retroduplications can be reconstructed.
Abstract: In primates and other animals, reverse transcription of mRNA followed by genomic integration creates retroduplications. Expressed retroduplications are either "retrogenes" coding for functioning proteins, or expressed "processed pseudogenes," which can function as noncoding RNAs. To date, little is known about the variation in retroduplications in terms of their presence or absence across individuals in the human population. We have developed new methodologies that allow us to identify "novel" retroduplications (i.e., those not present in the reference genome), to find their insertion points, and to genotype them. Using these methods, we catalogued and analyzed 174 retroduplication variants in almost one thousand humans, which were sequenced as part of Phase 1 of The 1000 Genomes Project Consortium. The accuracy of our data set was corroborated by (1) multiple lines of sequencing evidence for retroduplication (e.g., depth of coverage in exons vs. introns), (2) experimental validation, and (3) the fact that we can reconstruct a correct phylogenetic tree of human subpopulations based solely on retroduplications. We also show that parent genes of retroduplication variants tend to be expressed at the M-to-G1 transition in the cell cycle and that M-to-G1 expressed genes have more copies of fixed retroduplications than genes expressed at other times. These findings suggest that cell division is coupled to retrotransposition and, perhaps, is even a requirement for it.

49 citations


01 Oct 2013
TL;DR: The approach can be readily used to prioritize variants in cancer and is immediately applicable in a precision-medicine context and can be further improved by incorporation of larger-scale population sequencing, better annotations, and expression data from large cohorts.
Abstract: Interpreting variants, especially noncoding ones, in the increasing number of personal genomes is challenging. We used patterns of polymorphisms in functionally annotated regions in 1092 humans to identify deleterious variants; then we experimentally validated candidates. We analyzed both coding and noncoding regions, with the former corroborating the latter. We found regions particularly sensitive to mutations ("ultrasensitive") and variants that are disruptive because of mechanistic effects on transcription-factor binding (that is, "motif-breakers"). We also found variants in regions with higher network centrality tend to be deleterious. Insertions and deletions followed a similar pattern to single-nucleotide variants, with some notable exceptions (e.g., certain deletions and enhancers). On the basis of these patterns, we developed a computational tool (FunSeq), whose application to ~90 cancer genomes reveals nearly a hundred candidate noncoding drivers.

Journal ArticleDOI
08 Jul 2013-Viruses
TL;DR: A high-throughput method based on a novel gene expression analysis, RNA-Seq, is used to give a global picture of differential gene expression by primary human macrophages of 10 healthy donors infected in vitro with WNV.
Abstract: The West Nile virus (WNV) is an emerging infection of biodefense concern and there are no available treatments or vaccines. Here we used a high-throughput method based on a novel gene expression analysis, RNA-Seq, to give a global picture of differential gene expression by primary human macrophages of 10 healthy donors infected in vitro with WNV. From a total of 28 million reads per sample, we identified 1,514 transcripts that were differentially expressed after infection. Both predicted and novel gene changes were detected, as were gene isoforms, and while many of the genes were expressed by all donors, some were unique. Knock-down of genes not previously known to be associated with WNV resistance identified their critical role in control of viral infection. Our study distinguishes both common gene pathways as well as novel cellular responses. Such analyses will be valuable for translational studies of susceptible and resistant individuals—and for targeting therapeutics—in multiple biological settings.

Journal ArticleDOI
TL;DR: An overview of the phenomenon of structural variation in the human genome sequence is provided, describing the novel genomics technologies that are revolutionizing the way structural variation is studied and giving examples of genomic structural variations that affect child development.
Abstract: Structural variation of the human genome sequence is the insertion, deletion, or rearrangement of stretches of DNA sequence sized from around 1,000 to millions of base pairs. Over the past few years, structural variation has been shown to be far more common in human genomes than previously thought. Very little is currently known about the effects of structural variation on normal child development, but such effects could be of considerable significance. This review provides an overview of the phenomenon of structural variation in the human genome sequence, describing the novel genomics technologies that are revolutionizing the way structural variation is studied and giving examples of genomic structural variations that affect child development.

Journal ArticleDOI
TL;DR: This work describes a workflow in which localized frustration, quantifying unfavorable local interactions, is employed as a metric to investigate the impact of SNVs on protein stability, and finds that frustration produces many immediately intuitive results.
Abstract: Population-scale sequencing is increasingly uncovering large numbers of rare single-nucleotide variants (SNVs) in coding regions of the genome. The rarity of these variants makes it challenging to evaluate their deleteriousness with conventional phenotype-genotype associations. Protein structures provide a way of addressing this challenge. Previous efforts have focused on globally quantifying the impact of SNVs on protein stability. However, local perturbations may severely impact protein functionality without strongly disrupting global stability (e.g. in relation to catalysis or allostery). Here, we describe a workflow in which localized frustration, quantifying unfavorable local interactions, is employed as a metric to investigate such effects. Using this workflow on the Protein Databank, we find that frustration produces many immediately intuitive results: for instance, disease-related SNVs create stronger changes in localized frustration than non-disease related variants, and rare SNVs tend to disrupt local interactions to a larger extent than common variants. Less obviously, we observe that somatic SNVs associated with oncogenes and tumor suppressor genes (TSGs) induce very different changes in frustration. In particular, those associated with TSGs change the frustration more in the core than the surface (by introducing loss-of-function events), whereas those associated with oncogenes manifest the opposite pattern, creating gain-of-function events.

Journal ArticleDOI
TL;DR: Sixty years after Watson and Crick published the double helix model of DNA's structure, thirteen members of Genome Biology's Editorial Board select key advances in the field of genome biology subsequent to that discovery.
Abstract: Sixty years after Watson and Crick published the double helix model of DNA's structure, thirteen members of Genome Biology's Editorial Board select key advances in the field of genome biology subsequent to that discovery.

Journal ArticleDOI
TL;DR: A computational method to predict cell cycle regulated genes based on their genomic features – transcription factor binding and motif profiles shows high accuracy and suggests that the periodical pattern of cell cycle genes is largely coded in their promoter regions.
Abstract: Background Time-course microarray experiments have been widely used to identify cell cycle regulated genes. However, the method is not effective for lowly expressed genes and is sensitive to experimental conditions. To complement microarray experiments, we propose a computational method to predict cell cycle regulated genes based on their genomic features – transcription factor binding and motif profiles.

Proceedings ArticleDOI
22 Sep 2013
TL;DR: A computational framework of comparative network analysis was established to identify the conserved and species-specific functions of cell-wall (CW) related genes across multiple tissue types between Arabidopsis and Poplar and studied gene co-expression networks for the cell wallrelated genes across different tissue types.
Abstract: In this study, we established a computational framework of comparative network analysis to identify the conserved and species-specific functions of cell-wall (CW) related genes [1, 2], an important gene family related to plant bio-fuel productions across multiple tissue types between Arabidopsis and Poplar. The co-expressed genes are believed to coordinate in transcription so that they may have similar functions [3, 4]. Also, a comparative analysis across species for gene co-expression networks (GCNs) provides a systematic way to understand genomic conserved or species-specific functions [5]. Therefore, to understand the functions of CW genes in different tissue types, we integrated and compared the network characteristics of CW genes across GCNs from different tissue types including leaf, flower and shoot for Arabidopsis and Poplar [6]. First, by aligning the gene co-expression sub-networks associated with CW genes between two plants for each tissue type, we grouped the tissue types based on the alignment of the CW genes along with their neighboring orthologous genes. For those tissues with good alignments, it suggests that CW genes coordinate in a similar way for both plants, which may have involved in the conserved functions. For the tissues with poor alignments, however, CW genes may take part in species-specific functions. The gene ontology enrichment and signaling pathways of their co-expressed neighboring genes were identified to provide new insight for cell wall biology. Second, since the genes with high network centralities of a GCN, so called "hub" genes, are believed to have key functions [7], we investigated the network centralities for the CW genes between two plants to understand their functions in a global network point of view. The network centralities of GCN that we used are clustering coefficient (CC) for measuring gene's local cliqueness, and eigenvector centrality (EC) for measuring gene's global influence over the entire network. Besides finding hub genes for each tissue type within and across two plants, we also identified the conserved hub genes and tissue-specific hub genes in either local or global fashion. The CW genes that happen to become hub were particularly of interest to study. If many CW genes are global hubs in certain tissues, it implies that cell wall related activities may interact with the whole plant in those tissues, but if local hubs, they may coordinate with certain local activities only. Finally, we used the genomic variation data to identify the species-specific SNPs, especially in the promoter regions of the CW co-expressed neighboring genes across tissues, and associate them with corresponding species-specific functions. In summary, our comparative network analysis framework studied gene co-expression networks for the cell wall related genes across different tissue types in Arabidopsis and Poplar, and identified their conserved and species-specific functions and variations. This framework can also be used to study other gene families along with their functions across multiple species.

Journal ArticleDOI
01 Nov 2013-Science
TL;DR: Through a collection of examples, Epstein provides a balanced and nuanced consideration of genetic influences on athletic performance.
Abstract: Through a collection of examples, Epstein provides a balanced and nuanced consideration of genetic influences on athletic performance.