scispace - formally typeset
Search or ask a question
Author

Mark Gerstein

Bio: Mark Gerstein is an academic researcher from Yale University. The author has contributed to research in topics: Genome & Gene. The author has an hindex of 168, co-authored 751 publications receiving 149578 citations. Previous affiliations of Mark Gerstein include Rutgers University & Structural Genomics Consortium.
Topics: Genome, Gene, Human genome, Genomics, Pseudogene


Papers
More filters
Journal ArticleDOI
TL;DR: A systematic error that arises from microarrays is addressed and current methods to solve the problem are discussed, including a local averaging approach called stan-dardization and normalization of mi-croarray data (SNOMAD).
Abstract: Microarray experiments are provid-ing a huge amount of genome-widedata on gene expression. Many priorexpression analyses have focused oninferring functional relationships (1–7);however, the quality control and nor-malization of the raw data that resultfrom microarrays have received less at-tention. Here we address a systematicerror that arises from microarrays anddiscuss current methods to resolve theproblem.It is well known that the data fromhigh-throughput experiments embody asignificant component of measurementerror that must be removed before anyanalysis can be applied to the data. Anintuitive idea is to repeat the experi-ments and decrease the noise by aver-aging the measurements from repli-cates (8). Unfortunately, microarraysare still difficult to repeat; in most cas-es, researchers do not have many repli-cates for analysis. A Bayesian proba-bilistic approach has been proposed toaddress the problem of the small repeti-tion number for microarray experi-ments (9). While random error can becanceled by replicate experiments, sys-tematic error will not diminish by aver-aging replicates. For example, a notori-ous systematic error in microarrayexperiments is that the expression ratioof a particular gene at different condi-tions is a function of its absolute ex-pression levels. If one uses a simplefold-change cut off, the genes with lowexpression levels tend to numericallymeet the given cut off, even thoughthey are not truly differentially ex-pressed. Different methods have beenproposed to deal with this problem(10–15).In this review, we want to direct at-tention toward a type of systematic er-ror that is manifested by the strong in-teraction between neighboring spots onthe array. If the replicate experimentsare performed on the arrays with same-chip geometry, then these interactionswill not be canceled by the replicates.We will first demonstrate this noise viaa case study, and then we will discussthe possible source of these artifacts.Finally, we will discuss current meth-ods to solve the problem; in particular,a local averaging approach called stan-dardization and normalization of mi-croarray data (SNOMAD) (16). Weexamined several different yeast mi-croarray data sets: diauxic shift, α-fac-tor-arrested cell cycle, cdc15-arrestedcell cycle, and cdc28-arrested cell cycle(17–19).To demonstrate the artifact in themicroarray data, we offer the followingevidence. The relationship betweengene expression and physical chip dis-tance can be revealed by comparing thechip distance map (Figure 1A) to an ex-pression correlation coefficient map(Figure 1B). The horizontal and verti-cal axes of these two maps representthe positions of the genes along a chro-mosome. The colors on the distanceand correlation maps represent the chipdistance and expression correlation co-efficient between gene pairs, respec-tively. Interestingly, the highly correlat-ed gene expression regions (Figure 1B,red blocks) always correspond to theshort chip distance regions (Figure 1A,red blocks), which suggests that themajor reason why two genes are detect-ed to be co-expressed is that thesegenes are located near each other on thechip. We also calculated the average cor-relation coefficient of gene expressionprofiles as a function of the physicalchip distance between two genes. Fig-ure 2 shows the result for a microarraydata set of the yeast α-arrested cell cy-cle. Without an artifact, the averagecorrelation coefficient should be inde-pendent of the chip distance. However,Figure 2 shows that the closer twogenes are on the chip, the higher theiraverage correlation coefficient is. Thisindicates that this data set contains alarge proportion of artifacts. Actually,

30 citations

Journal ArticleDOI
TL;DR: An approach to interrogate phosphorylation and its role in protein-protein interactions on a proteome-wide scale was developed and hundreds of known and potentially new phosphoserine-dependent interactors with 14-3-3 proteins and WW domains were found.
Abstract: Post-translational phosphorylation is essential to human cellular processes, but the transient, heterogeneous nature of this modification complicates its study in native systems. We developed an approach to interrogate phosphorylation and its role in protein-protein interactions on a proteome-wide scale. We genetically encoded phosphoserine in recoded E. coli and generated a peptide-based heterologous representation of the human serine phosphoproteome. We designed a single-plasmid library encoding >100,000 human phosphopeptides and confirmed the site-specific incorporation of phosphoserine in >36,000 of these peptides. We then integrated our phosphopeptide library into an approach known as Hi-P to enable proteome-level screens for serine-phosphorylation-dependent human protein interactions. Using Hi-P, we found hundreds of known and potentially new phosphoserine-dependent interactors with 14-3-3 proteins and WW domains. These phosphosites retained important binding characteristics of the native human phosphoproteome, as determined by motif analysis and pull-downs using full-length phosphoproteins. This technology can be used to interrogate user-defined phosphoproteomes in any organism, tissue, or disease of interest.

30 citations

Journal ArticleDOI
12 Nov 2020-Cell
TL;DR: A data-sanitization procedure allowing raw functional genomics reads to be shared while minimizing privacy leakage is developed, enabling principled privacy-utility trade-offs.

30 citations

Journal ArticleDOI
01 Feb 2010-Proteins
TL;DR: A novel method, RigidFinder, for the identification of rigid blocks from different conformations—across many scales, from large complexes to small loops, is described.
Abstract: Advances in structure determination have made possible the analysis of large macromolecular complexes (some with nearly 10,000 residues, such as GroEL). The large-scale conformational changes associated with these complexes require new approaches. Historically, a crucial component of motion analysis has been the identification of moving rigid blocks from the comparison of different conformations. However, existing tools do not allow consistent block identification in very large structures. Here, we describe a novel method, RigidFinder, for such identification of rigid blocks from different conformations—across many scales, from large complexes to small loops. RigidFinder defines rigidity in terms of blocks, where inter-residue distances are conserved across conformations. Distance conservation, unlike the averaged values (e.g., RMSD) used by many other methods, allows for sensitive identification of motions. A further distinguishing feature of our method, is that, it is capable of finding blocks made from nonconsecutive fragments of multiple polypeptide chains. In our implementation, we utilize an efficient quasi-dynamic programming search algorithm that allows for real-time application to very large structures. RigidFinder can be used at a dedicated web server (http://rigidfinder.molmovdb.org). The server also provides links to examples at various scales such as loop closure, domain motions, partial refolding, and subunit shifts. Moreover, here we describe the detailed application of RigidFinder to four large structures: Pyruvate Phosphate Dikinase, T7 RNA polymerase, RNA polymerase II, and GroEL. The results of the method are in excellent agreement with the expert-described rigid blocks. Proteins 2010. © 2009 Wiley-Liss, Inc.

30 citations

Journal ArticleDOI
TL;DR: The utility of the current regulatory distinction between identifiable and nonidentifiable genomic information is discussed, particularly given the seemingly anomalous preferences of their surveyed patient population, as the authors note.
Abstract: Hull and colleagues (2008) discuss the utility of the current regulatory distinction between identifiable and nonidentifiable genomic information, particularly given the seemingly anomalous preferences of their surveyed patient population. As the authors note, this regulatory distinction will become even less meaningful with the proliferation of genomic databases. Particularly as industries such as personal genomics expand — flooding both private and public databases with readily identifiable genomic data — they will effectively prevent an ever-growing number of individuals from remaining genetically anonymous (Lowrance and Collins 2007). In fact, recent research has already shown that individual genomes can be readily identified out of larger mixed groups of publicly available data from genome wide association studies using only a small subset of one’s genome (Homer et al. 2008). Once it’s known that a person has participated in a genome wide association study, it becomes fairly straightforward to use their or their relative’s genomic data — which may well be made available through personal genomics — to re-identify that individual (National Institutes of Health [NIH] 2008). The general expanse of genomics into our medical system, both through personal genomics and also through other evolving biomedical technologies such as targeted personalized medicine, also raises other non-trivial privacy concerns both for the patient herself but also for her extended family that share much of her genomic complement.

30 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.
Abstract: The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

70,111 citations

Journal ArticleDOI
TL;DR: The goals of the PDB are described, the systems in place for data deposition and access, how to obtain further information and plans for the future development of the resource are described.
Abstract: The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.

34,239 citations

Journal ArticleDOI
TL;DR: The Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure outperforms other aligners by a factor of >50 in mapping speed.
Abstract: Motivation Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. Results To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80-90% success rate, corroborating the high precision of the STAR mapping strategy. Availability and implementation STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.

30,684 citations

Journal ArticleDOI
TL;DR: Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches and can be used simultaneously to achieve even greater alignment speeds.
Abstract: Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source http://bowtie.cbcb.umd.edu.

20,335 citations

28 Jul 2005
TL;DR: PfPMP1)与感染红细胞、树突状组胞以及胎盘的单个或多个受体作用,在黏附及免疫逃避中起关键的作�ly.
Abstract: 抗原变异可使得多种致病微生物易于逃避宿主免疫应答。表达在感染红细胞表面的恶性疟原虫红细胞表面蛋白1(PfPMP1)与感染红细胞、内皮细胞、树突状细胞以及胎盘的单个或多个受体作用,在黏附及免疫逃避中起关键的作用。每个单倍体基因组var基因家族编码约60种成员,通过启动转录不同的var基因变异体为抗原变异提供了分子基础。

18,940 citations