scispace - formally typeset
Search or ask a question
Author

Mark Gerstein

Bio: Mark Gerstein is an academic researcher from Yale University. The author has contributed to research in topics: Genome & Gene. The author has an hindex of 168, co-authored 751 publications receiving 149578 citations. Previous affiliations of Mark Gerstein include Rutgers University & Structural Genomics Consortium.
Topics: Genome, Gene, Human genome, Genomics, Pseudogene


Papers
More filters
Posted ContentDOI
20 Aug 2021-bioRxiv
TL;DR: In this article, a yeast-based genetic screening pipeline is described to evaluate large collections of MAPK docking sequences in parallel, using a combinatorial library based on the docking sequences from the MAPK kinases MKK6 and MKK7, defining features critical for binding to stress-activated MAPKs JNK1 and p38α.
Abstract: Essential functions of mitogen-activated protein kinases (MAPKs) depend on their capacity to selectively phosphorylate a limited repertoire of substrates. MAPKs harbor a conserved groove located outside of the catalytic cleft that binds to short linear sequence motifs found in substrates and regulators. However, the weak and transient nature of these “docking” interactions poses a challenge to defining MAPK interactomes and associated sequence motifs. Here, we describe a yeast-based genetic screening pipeline to evaluate large collections of MAPK docking sequences in parallel. Using this platform we analyzed a combinatorial library based on the docking sequences from the MAPK kinases MKK6 and MKK7, defining features critical for binding to the stress-activated MAPKs JNK1 and p38α. We subsequently screened a library consisting of ~12,000 sequences from the human proteome, revealing a large number of MAPK-selective interactors, including many not conforming to previously defined docking motifs. Analysis of p38α/JNK1 exchange mutants identified specific docking groove residues mediating selective binding. Finally, we verified that docking sequences identified in the screen could function in substrate recruitment in vitro and in cultured cells. Collectively, these studies establish an approach for characterization of MAPK docking sequences and provide a resource for future investigation of signaling downstream of p38 and JNK MAP kinases.

1 citations

Posted ContentDOI
17 Sep 2022-bioRxiv
TL;DR: In this paper , the authors show that both aging and cancer share a common epigenetic replication signature, which they modeled from DNA methylation data in extensively passaged immortalized human cells in vitro and tested on clinical tissues.
Abstract: Aging is the leading risk factor for cancer. While it’s been proposed that the age-related accumulation of somatic mutations drives this relationship, it is likely not the full story. Here, we show that both aging and cancer share a common epigenetic replication signature, which we modeled from DNA methylation data in extensively passaged immortalized human cells in vitro and tested on clinical tissues. This epigenetic signature of replication – termed CellDRIFT – increased with age across multiple tissues, distinguished tumor from normal tissue, and was escalated in normal breast tissue from cancer patients. Additionally, within-person tissue differences were correlated with both predicted lifetime tissue-specific stem cell divisions and tissue-specific cancer risk. Overall, our findings suggest that age-related replication drives epigenetic changes in cells, pushing them towards a more tumorigenic state. One sentence summary Cellular replication leaves an epigenetic fingerprint that may partially underly the age-associated increase in cancer risk.

1 citations

Posted ContentDOI
19 Nov 2018-bioRxiv
TL;DR: RADAR can successfully pinpoint variants, both somatic and germline, associated with RBP-function dysregulation, that cannot be found by most current prioritization methods, for example variants affecting splicing.
Abstract: RNA-binding proteins (RBPs) play key roles in post-transcriptional regulation and disease. Their binding sites cover more of the genome than coding exons; nevertheless, most noncoding variant-prioritization methods only focus on transcriptional regulation. Here, we integrate the portfolio of ENCODE-RBP experiments to develop RADAR, a variant-scoring framework. RADAR uses conservation, RNA structure, network centrality, and motifs to provide an overall impact score. Then it further incorporates tissue-specific inputs to highlight disease-specific variants. Our results demonstrate RADAR can successfully pinpoint variants, both somatic and germline, associated with RBP-function dysregulation, that cannot be found by most current prioritization methods, for example variants affecting splicing.

1 citations

Journal ArticleDOI
TL;DR: The denoising autoencoder framework can extract meaningful patterns corresponding to functional gene groups and patient clusters from the gene expression of asthma patients, and can estimate FEV1/FVC ratios across patients, via support-vector-machine regression.
Abstract: The pathogenesis of asthma is a complex process involving multiple genes and pathways. Identifying biomarkers from asthma datasets, especially those that include heterogeneous subpopulations, is challenging. Potentially, autoencoders provide ideal frameworks for such tasks as they can embed complex, noisy high-dimensional gene expression data into a low-dimensional latent space in an unsupervised fashion, enabling us to extract distinguishing features from expression data. Here, we developed a framework combining a denoising autoencoder and a supervised learning classifier to identify gene signatures related to asthma severity. Using the trained autoencoder with 50 hidden units, we found that hierarchical clustering on the low-dimensional embedding corresponds well with previously defined and clinically relevant clusters of patients. Moreover, each hidden unit has contributions from each of the genes, and pathway analysis of these contributions shows that the hidden units are significantly enriched in known asthma-related pathways. We then used genes that contribute most to the hidden units to develop a secondary random-forest classifier for directly predicting asthma severity. The feature importance metric from this classifier identified a signature based on 50 key genes, which are associated with severity. Furthermore, we can use these key genes to successfully estimate FEV1/FVC ratios across patients, via support-vector-machine regression. We found that the denoising autoencoder framework can extract meaningful patterns corresponding to functional gene groups and patient clusters from the gene expression of asthma patients.

1 citations

Posted ContentDOI
30 Nov 2020-bioRxiv
TL;DR: It is found that even if a crowded site contains recognizable target motifs, it can still be uninformative for motif inference due to the presence of interfering motifs from other TFs, implying a “less-is-more” effect.
Abstract: Many statistical methods have been developed to infer the binding motifs of a transcription factor (TF) from a subset of its numerous binding regions in the genome. We refer to such regions, e.g. detected by ChIP-seq, as binding sites. The sites with strong binding signals are selected for motif inference. However, binding signals do not necessarily indicate the existence of target motifs. Moreover, even strong binding signals can be spurious due to experimental artifacts. Here, we observe that such uninformative sites without target motifs tend to be “crowded” -- i.e. have many other TF binding sites present nearby. In addition, we find that even if a crowded site contains recognizable target motifs, it can still be uninformative for motif inference due to the presence of interfering motifs from other TFs. We propose using less crowded and shorter binding sites in motif interference and develop specific recommendations for carrying this out. We find our recommendations substantially improve the resulting motifs in various contexts by 30%-70%, implying a “less-is-more” effect.

1 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.
Abstract: The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

70,111 citations

Journal ArticleDOI
TL;DR: The goals of the PDB are described, the systems in place for data deposition and access, how to obtain further information and plans for the future development of the resource are described.
Abstract: The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.

34,239 citations

Journal ArticleDOI
TL;DR: The Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure outperforms other aligners by a factor of >50 in mapping speed.
Abstract: Motivation Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. Results To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80-90% success rate, corroborating the high precision of the STAR mapping strategy. Availability and implementation STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.

30,684 citations

Journal ArticleDOI
TL;DR: Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches and can be used simultaneously to achieve even greater alignment speeds.
Abstract: Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source http://bowtie.cbcb.umd.edu.

20,335 citations

28 Jul 2005
TL;DR: PfPMP1)与感染红细胞、树突状组胞以及胎盘的单个或多个受体作用,在黏附及免疫逃避中起关键的作�ly.
Abstract: 抗原变异可使得多种致病微生物易于逃避宿主免疫应答。表达在感染红细胞表面的恶性疟原虫红细胞表面蛋白1(PfPMP1)与感染红细胞、内皮细胞、树突状细胞以及胎盘的单个或多个受体作用,在黏附及免疫逃避中起关键的作用。每个单倍体基因组var基因家族编码约60种成员,通过启动转录不同的var基因变异体为抗原变异提供了分子基础。

18,940 citations