scispace - formally typeset
Search or ask a question

Showing papers by "Mark Gerstein published in 2006"


Journal ArticleDOI
30 Mar 2006-Nature
TL;DR: T tandem affinity purification was used to process 4,562 different tagged proteins of the yeast Saccharomyces cerevisiae to identify protein–protein interactions, which will help future studies on individual proteins as well as functional genomics and systems biology.
Abstract: Identification of protein-protein interactions often provides insight into protein function, and many cellular processes are performed by stable protein complexes. We used tandem affinity purification to process 4,562 different tagged proteins of the yeast Saccharomyces cerevisiae. Each preparation was analysed by both matrix-assisted laser desorption/ionization-time of flight mass spectrometry and liquid chromatography tandem mass spectrometry to increase coverage and accuracy. Machine learning was used to integrate the mass spectrometry scores and assign probabilities to the protein-protein interactions. Among 4,087 different proteins identified with high confidence by mass spectrometry from 2,357 successful purifications, our core data set (median precision of 0.69) comprises 7,123 protein-protein interactions involving 2,708 proteins. A Markov clustering algorithm organized these interactions into 547 protein complexes averaging 4.9 subunits per complex, about half of them absent from the MIPS database, as well as 429 additional interactions between pairs of complexes. The data (all of which are available online) will help future studies on individual proteins as well as functional genomics and systems biology.

2,975 citations


Journal Article
TL;DR: The in vitro substrates recognized by most yeast protein kinases are described, with the use of proteome chip technology, and these results will provide insights into the mechanisms and roles of protein phosphorylation in many eukaryotes.
Abstract: Protein phosphorylation is estimated to affect 30% of the proteome and is a major regulatory mechanism that controls many basic cellular processes. Until recently, our biochemical understanding of protein phosphorylation on a global scale has been extremely limited; only one half of the yeast kinases have known in vivo substrates and the phosphorylating kinase is known for less than 160 phosphoproteins. Here we describe, with the use of proteome chip technology, the in vitro substrates recognized by most yeast protein kinases: we identified over 4,000 phosphorylation events involving 1,325 different proteins. These substrates represent a broad spectrum of different biochemical functions and cellular roles. Distinct sets of substrates were recognized by each protein kinase, including closely related kinases of the protein kinase A family and four cyclin-dependent kinases that vary only in their cyclin subunits. Although many substrates reside in the same cellular compartment or belong to the same functional category as their phosphorylating kinase, many others do not, indicating possible new roles for several kinases. Furthermore, integration of the phosphorylation results with protein-protein interaction and transcription factor binding data revealed novel regulatory modules. Our phosphorylation results have been assembled into a first-generation phosphorylation map for yeast. Because many yeast proteins and pathways are conserved, these results will provide insights into the mechanisms and roles of protein phosphorylation in many eukaryotes.

923 citations


Journal ArticleDOI
22 Dec 2006-Science
TL;DR: This work characterize interactions of protein networks by using atomic-resolution information from three-dimensional protein structures to find that some previously recognized relationships between network topology and genomic features are actually more reflective of a structural quantity, the number of distinct binding interfaces.
Abstract: Most studies of protein networks operate on a high level of abstraction, neglecting structural and chemical aspects of each interaction. Here, we characterize interactions by using atomic-resolution information from three-dimensional protein structures. We find that some previously recognized relationships between network topology and genomic features (e.g., hubs tending to be essential proteins) are actually more reflective of a structural quantity, the number of distinct binding interfaces. Subdividing hubs with respect to this quantity provides insight into their evolutionary rate and indicates that additional mechanisms of network growth are active in evolution (beyond effective preferential attachment through gene duplication).

480 citations


Journal ArticleDOI
TL;DR: This work develops algorithms for identifying generalized hierarchies and uses these approaches to illuminate extensive pyramid-shaped hierarchical structures existing in the regulatory networks of representative prokaryotes and eukaryotes, finding that TFs at the bottom of the regulatory hierarchy are more essential to the viability of the cell.
Abstract: A fundamental question in biology is how the cell uses transcription factors (TFs) to coordinate the expression of thousands of genes in response to various stimuli. The relationships between TFs and their target genes can be modeled in terms of directed regulatory networks. These relationships, in turn, can be readily compared with commonplace “chain-of-command” structures in social networks, which have characteristic hierarchical layouts. Here, we develop algorithms for identifying generalized hierarchies (allowing for various loop structures) and use these approaches to illuminate extensive pyramid-shaped hierarchical structures existing in the regulatory networks of representative prokaryotes (Escherichia coli) and eukaryotes (Saccharomyces cerevisiae), with most TFs at the bottom levels and only a few master TFs on top. These masters are situated near the center of the protein–protein interaction network, a different type of network from the regulatory one, and they receive most of the input for the whole regulatory hierarchy through protein interactions. Moreover, they have maximal influence over other genes, in terms of affecting expression-level changes. Surprisingly, however, TFs at the bottom of the regulatory hierarchy are more essential to the viability of the cell. Finally, one might think master TFs achieve their wide influence through directly regulating many targets, but TFs with most direct targets are in the middle of the hierarchy. We find, in fact, that these midlevel TFs are “control bottlenecks” in the hierarchy, and this great degree of control for “middle managers” has parallels in efficient social structures in various corporate and governmental settings.

355 citations


Journal ArticleDOI
TL;DR: Of the amplification methodologies examined in this paper, the multiple displacement amplification products generated the least bias, and produced significantly higher yields of amplified DNA.
Abstract: Whole genome amplification is an increasingly common technique through which minute amounts of DNA can be multiplied to generate quantities suitable for genetic testing and analysis. Questions of amplification-induced error and template bias generated by these methods have previously been addressed through either small scale (SNPs) or large scale (CGH array, FISH) methodologies. Here we utilized whole genome sequencing to assess amplification-induced bias in both coding and non-coding regions of two bacterial genomes. Halobacterium species NRC-1 DNA and Campylobacter jejuni were amplified by several common, commercially available protocols: multiple displacement amplification, primer extension pre-amplification and degenerate oligonucleotide primed PCR. The amplification-induced bias of each method was assessed by sequencing both genomes in their entirety using the 454 Sequencing System technology and comparing the results with those obtained from unamplified controls. All amplification methodologies induced statistically significant bias relative to the unamplified control. For the Halobacterium species NRC-1 genome, assessed at 100 base resolution, the D-statistics from GenomiPhi-amplified material were 119 times greater than those from unamplified material, 164.0 times greater for Repli-G, 165.0 times greater for PEP-PCR and 252.0 times greater than the unamplified controls for DOP-PCR. For Campylobacter jejuni, also analyzed at 100 base resolution, the D-statistics from GenomiPhi-amplified material were 15 times greater than those from unamplified material, 19.8 times greater for Repli-G, 61.8 times greater for PEP-PCR and 220.5 times greater than the unamplified controls for DOP-PCR. Of the amplification methodologies examined in this paper, the multiple displacement amplification products generated the least bias, and produced significantly higher yields of amplified DNA.

338 citations


Journal ArticleDOI
TL;DR: The geometry of the polypeptide exit tunnel has been determined using the crystal structure of the large ribosomal subunit from Haloarcula marismortui to determine the only passage in the solvent channel system that is both large enough to accommodate nascent peptides, and that traverses the particle.

314 citations


Journal ArticleDOI
TL;DR: A model was developed for the regulation of spontaneous switching between the opaque state and the white state that includes stochastic changes of Tos9p levels above and below a threshold that induce changes in the chromatin state of an as-yet-unidentified switching locus.
Abstract: In Candida albicans, the a1-2 complex represses white-opaque switching, as well as mating. Based upon the assumption that the a1-2 corepressor complex binds to the gene that regulates white-opaque switching, a chromatin immunoprecipitation-microarray analysis strategy was used to identify 52 genes that bound to the complex. One of these genes, TOS9, exhibited an expression pattern consistent with a “master switch gene.” TOS9 was only expressed in opaque cells, and its gene product, Tos9p, localized to the nucleus. Deletion of the gene blocked cells in the white phase, misexpression in the white phase caused stable mass conversion of cells to the opaque state, and misexpression blocked temperature-induced mass conversion from the opaque state to the white state. A model was developed for the regulation of spontaneous switching between the opaque state and the white state that includes stochastic changes of Tos9p levels above and below a threshold that induce changes in the chromatin state of an as-yet-unidentified switching locus. TOS9 has also been referred to as EAP2 and WOR1.

214 citations


Journal ArticleDOI
TL;DR: The central idea of the method is to search the protein interaction network for defective cliques, and predict the interactions that complete them, and it is shown that in practice it is efficient and has good predictive performance.
Abstract: Datasets obtained by large-scale, high-throughput methods for detecting protein--protein interactions typically suffer from a relatively high level of noise. We describe a novel method for improving the quality of these datasets by predicting missed protein--protein interactions, using only the topology of the protein interaction network observed by the large-scale experiment. The central idea of the method is to search the protein interaction network for defective cliques (nearly complete complexes of pairwise interacting proteins), and predict the interactions that complete them. We formulate an algorithm for applying this method to large-scale networks, and show that in practice it is efficient and has good predictive performance. More information can be found on our website http://topnet.gersteinlab.org/clique/ Contact: Mark.Gerstein@yale.edu Supplementary information: Supplementary Materials are available at Bioinformatics online.

182 citations


Journal ArticleDOI
TL;DR: A homology-based computational pipeline ('PseudoPipe') is developed that can search a mammalian genome and identify pseudogene sequences in a comprehensive and consistent manner.
Abstract: Motivation: Mammalian genomes contain many 'genomic fossils' i.e. pseudogenes. These are disabled copies of functional genes that have been retained in the genome by gene duplication or retrotransposition events. Pseudogenes are important resources in understanding the evolutionary history of genes and genomes. Results: We have developed a homology-based computational pipeline ('PseudoPipe') that can search a mammalian genome and identify pseudogene sequences in a comprehensive and consistent manner. The key steps in the pipeline involve using BLAST to rapidly cross-reference potential "parent" proteins against the intergenic regions of the genome and then processing the resulting "raw hits" -- i.e. eliminating redundant ones, clustering together neighbors, and associating and aligning clusters with a unique parent. Finally, pseudogenes are classified based on a combination of criteria including homology, intron-exon structure, and existence of stop codons and frameshifts. Availability: The PseudoPipe program is implemented in Python and can be downloaded at http://pseudogene.org/ Contact:Mark.Gerstein@yale.edu or zhaolei.zhang@utoronto.ca

173 citations


Journal ArticleDOI
TL;DR: Investigation of transcriptional circuitry controlling pseudohyphal development in Saccharomyces cerevisiae indicates that target hubs can serve as master regulators whose activity is sufficient for the induction of complex developmental responses and therefore represent important regulatory nodes in biological networks.
Abstract: To understand the organization of the transcriptional networks that govern cell differentiation, we have investigated the transcriptional circuitry controlling pseudohyphal development in Saccharomyces cerevisiae. The binding targets of Ste12, Tec1, Sok2, Phd1, Mga1, and Flo8 were globally mapped across the yeast genome. The factors and their targets form a complex binding network, containing patterns characteristic of autoregulation, feedback and feed-forward loops, and cross-talk. Combinatorial binding to intergenic regions was commonly observed, which allowed for the identification of a novel binding association between Mga1 and Flo8, in which Mga1 requires Flo8 for binding to promoter regions. Further analysis of the network showed that the promoters of MGA1 and PHD1 were bound by all of the factors used in this study, identifying them as key target hubs. Overexpression of either of these two proteins specifically induced pseudohyphal growth under noninducing conditions, highlighting them as master regulators of the system. Our results indicate that target hubs can serve as master regulators whose activity is sufficient for the induction of complex developmental responses and therefore represent important regulatory nodes in biological networks.

172 citations


Journal ArticleDOI
TL;DR: High-resolution CGH (HR-CGH) is developed to detect accurately and with relatively little bias the presence and extent of chromosomal aberrations in human DNA.
Abstract: Deletions and amplifications of the human genomic sequence (copy number polymorphisms) are the cause of numerous diseases and a potential cause of phenotypic variation in the normal population. Comparative genomic hybridization (CGH) has been developed as a useful tool for detecting alterations in DNA copy number that involve blocks of DNA several kilobases or larger in size. We have developed high-resolution CGH (HR-CGH) to detect accurately and with relatively little bias the presence and extent of chromosomal aberrations in human DNA. Maskless array synthesis was used to construct arrays containing 385,000 oligonucleotides with isothermal probes of 45–85 bp in length; arrays tiling the β-globin locus and chromosome 22q were prepared. Arrays with a 9-bp tiling path were used to map a 622-bp heterozygous deletion in the β-globin locus. Arrays with an 85-bp tiling path were used to analyze DNA from patients with copy number changes in the pericentromeric region of chromosome 22q. Heterozygous deletions and duplications as well as partial triploidies and partial tetraploidies of portions of chromosome 22q were mapped with high resolution (typically up to 200 bp) in each patient, and the precise breakpoints of two deletions were confirmed by DNA sequencing. Additional peaks potentially corresponding to known and novel additional CNPs were also observed. Our results demonstrate that HR-CGH allows the detection of copy number changes in the human genome at an unprecedented level of resolution.

Journal ArticleDOI
TL;DR: The database of molecular motions, MolMovDB (), has been in existence for the past decade and provides tools to interpolate between two conformations (the Morph Server) and predict possible motions in a single structure and developed tools to relate points of flexibility in a structure to particular key residue positions.
Abstract: The database of molecular motions, MolMovDB (http://molmovdb.org), has been in existence for the past decade. It classifies macromolecular motions and provides tools to interpolate between two conformations (the Morph Server) and predict possible motions in a single structure. In 2005, we expanded the services offered on MolMovDB. In particular, we further developed the Morph Server to produce improved interpolations between two submitted structures. We added support for multiple chains to the original adiabatic mapping interpolation, allowing the analysis of subunit motions. We also added the option of using FRODA interpolation, which allows for more complex pathways, potentially overcoming steric barriers. We added an interface to a hinge prediction service, which acts on single structures and predicts likely residue points for flexibility. We developed tools to relate such points of flexibility in a structure to particular key residue positions, i.e. active sites or highly conserved positions. Lastly, we began relating our motion classification scheme to function using descriptions from the Gene Ontology Consortium.

Journal ArticleDOI
TL;DR: This work identifies 14 characteristic sequence features potentially associated with essentiality, such as localization signals, codon adaptation, GC content, and overall hydrophobicity, and trained a machine learning classifier capable of predicting essential genes in S. mikatae and verified a subset of the predictions with eight in vivo knockouts.
Abstract: Essential genes are required for an organism’s viability, and the ability to identify these genes in pathogens is crucial to directed drug development. Predicting essential genes through computational methods is appealing because it circumvents expensive and difficult experimental screens. Most such prediction is based on homology mapping to experimentally verified essential genes in model organisms. We present here a different approach, one that relies exclusively on sequence features of a gene to estimate essentiality and offers a promising way to identify essential genes in unstudied or uncultured organisms. We identified 14 characteristic sequence features potentially associated with essentiality, such as localization signals, codon adaptation, GC content, and overall hydrophobicity. Using the well-characterized baker’s yeast Saccharomyces cerevisiae, we employed a simple Bayesian framework to measure the correlation of each of these features with essentiality. We then employed the 14 features to learn the parameters of a machine learning classifier capable of predicting essential genes. We trained our classifier on known essential genes in S. cerevisiae and applied it to the closely related and relatively unstudied yeast Saccharomyces mikatae. We assessed predictive success in two ways: First, we compared all of our predictions with those generated by homology mapping between these two species. Second, we verified a subset of our predictions with eight in vivo knockouts in S. mikatae, and we present here the first experimentally confirmed essential genes in this species.

Journal ArticleDOI
TL;DR: The developed TopNet-like Yale Network Analyzer (tYNA), a Web system for managing, comparing and mining multiple networks, both directed and undirected, that efficiently implements methods that have proven useful in network analysis.
Abstract: Summary: Biological processes involve complex networks of interactions between molecules. Various large-scale experiments and curation efforts have led to preliminary versions of complete cellular networks for a number of organisms. To grapple with these networks, we developed TopNet-like Yale Network Analyzer (tYNA), a Web system for managing, comparing and mining multiple networks, both directed and undirected. tYNA efficiently implements methods that have proven useful in network analysis, including identifying defective cliques, finding small network motifs (such as feed-forward loops), calculating global statistics (such as the clustering coefficient and eccentricity), and identifying hubs and bottlenecks. It also allows one to manage a large number of private and public networks using a flexible tagging system, to filter them based on a variety of criteria, and to visualize them through an interactive graphical interface. A number of commonly used biological datasets have been pre-loaded into tYNA, standardized and grouped into different categories. Availability: The tYNA system can be accessed at http://networks.gersteinlab.org/tyna. The source code, JavaDoc API and WSDL can also be downloaded from the website. tYNA can also be accessed from the Cytoscape software using a plugin. Contact: mark.gerstein@yale.edu Supplementary information: Additional figures and tables can be found at http://networks.gersteinlab.org/tyna/supp

Journal ArticleDOI
TL;DR: A recent development in microarray research entails the unbiased coverage, or tiling, of genomic DNA for the large-scale identification of transcribed sequences and regulatory elements, and two algorithms for finding an optimal tile path composed of longer sequence tiles are developed.
Abstract: A recent development in microarray research entails the unbiased coverage, or tiling, of genomic DNA for the large-scale identification of transcribed sequences and regulatory elements. A central issue in designing tiling arrays is that of arriving at a single-copy tile path, as significant sequence cross-hybridization can result from the presence of non-unique probes on the array. Due to the fragmentation of genomic DNA caused by the widespread distribution of repetitive elements, the problem of obtaining adequate sequence coverage increases with the sizes of subsequence tiles that are to be included in the design. This becomes increasingly problematic when considering complex eukaryotic genomes that contain many thousands of interspersed repeats. The general problem of sequence tiling can be framed as finding an optimal partitioning of non-repetitive subsequences over a prescribed range of tile sizes, on a DNA sequence comprising repetitive and non-repetitive regions. Exact solutions to the tiling problem become computationally infeasible when applied to large genomes, but successive optimizations are developed that allow their practical implementation. These include an efficient method for determining the degree of similarity of many oligonucleotide sequences over large genomes, and two algorithms for finding an optimal tile path composed of longer sequence tiles. The first algorithm, a dynamic programming approach, finds an optimal tiling in linear time and space; the second applies a heuristic search to reduce the space complexity to a constant requirement. A Web resource has also been developed, accessible at http://tiling.gersteinlab.org, to generate optimal tile paths from user-provided DNA sequences.

Journal ArticleDOI
TL;DR: A new approach, ProCAT, is reported, which corrects for background bias and spatial artifacts, identifies significant signals, filters nonspecific spots, and normalizes the resulting signal to protein abundance.
Abstract: Protein microarrays provide a versatile method for the analysis of many protein biochemical activities. Existing DNA microarray analytical methods do not translate to protein microarrays due to differences between the technologies. Here we report a new approach, ProCAT, which corrects for background bias and spatial artifacts, identifies significant signals, filters nonspecific spots, and normalizes the resulting signal to protein abundance. ProCAT provides a powerful and flexible new approach for analyzing many types of protein microarrays.

Journal ArticleDOI
TL;DR: Pseudogenomes, the molecular remains of broken genes which are unable to function because of a lethal injury to their structures, are providing scientists with insights that may help in the process of mapping genomes.
Abstract: This article discusses pseudogenomes, the molecular remains of broken genes which are unable to function because of a lethal injury to their structures. These pseudogenomes are providing scientists with insights that may help in the process of mapping genomes, as well as hints about their own, possible ongoing, role within the human genome.

Journal ArticleDOI
TL;DR: This work utilizes a systematic approach to discover genotype-phenotype associations that combines phenotypic information from a biomedical informatics database, GIDEON, with the molecular information contained in National Center for Biotechnology Information's Clusters of Orthologous Groups database (NCBI COGs).
Abstract: The ability to rapidly characterize an unknown microorganism is critical in both responding to infectious disease and biodefense. To do this, we need some way of anticipating an organism's phenotype based on the molecules encoded by its genome. However, the link between molecular composition (i.e. genotype) and phenotype for microbes is not obvious. While there have been several studies that address this challenge, none have yet proposed a large-scale method integrating curated biological information. Here we utilize a systematic approach to discover genotype-phenotype associations that combines phenotypic information from a biomedical informatics database, GIDEON, with the molecular information contained in National Center for Biotechnology Information's Clusters of Orthologous Groups database (NCBI COGs). Integrating the information in the two databases, we are able to correlate the presence or absence of a given protein in a microbe with its phenotype as measured by certain morphological characteristics or survival in a particular growth media. With a 0.8 correlation score threshold, 66% of the associations found were confirmed by the literature and at a 0.9 correlation threshold, 86% were positively verified. Our results suggest possible phenotypic manifestations for proteins biochemically associated with sugar metabolism and electron transport. Moreover, we believe our approach can be extended to linking pathogenic phenotypes with functionally related proteins.

Journal ArticleDOI
TL;DR: This work identifies about 160 pseudogenes, 10% of which have clear 'intron-exon' structure and are thus likely generated from recent duplications, and demonstrates that the computation pipeline provides a good balance between identifying all pseudogene and delineating the precise structure of duplicated genes.
Abstract: Background Pseudogenes are inheritable genetic elements showing sequence similarity to functional genes but with deleterious mutations. We describe a computational pipeline for identifying them, which in contrast to previous work explicitly uses intron-exon structure in parent genes to classify pseudogenes. We require alignments between duplicated pseudogenes and their parents to span intron-exon junctions, and this can be used to distinguish between true duplicated and processed pseudogenes (with insertions).

Journal ArticleDOI
TL;DR: It is shown that the HMM framework is able to efficiently process tiling array data as well as or better than previous approaches, and has strong implications for the optimum way medium-scale validation experiments should be carried out to verify the results of the genome-scale tiling arrays experiments.
Abstract: Motivation: Large-scale tiling array experiments are becoming increasingly common in genomics. In particular, the ENCODE project requires the consistent segmentation of many different tiling array datasets into 'active regions' (e.g. finding transfrags from transcriptional data and putative binding sites from ChIP-chip experiments). Previously, such segmentation was done in an unsupervised fashion mainly based on characteristics of the signal distribution in the tiling array data itself. Here we propose a supervised framework for doing this. It has the advantage of explicitly incorporating validated biological knowledge into the model and allowing for formal training and testing. Methodology: In particular, we use a hidden Markov model (HMM) framework, which is capable of explicitly modeling the dependency between neighboring probes and whose extended version (the generalized HMM) also allows explicit description of state duration density. We introduce a formal definition of the tiling-array analysis problem, and explain how we can use this to describe sampling small genomic regions for experimental validation to build up a gold-standard set for training and testing. We then describe various ideal and practical sampling strategies (e.g. maximizing signal entropy within a selected region versus using gene annotation or known promoters as positives for transcription or ChIP-chip data, respectively). Results: For the practical sampling and training strategies, we show how the size and noise in the validated training data affects the performance of an HMM applied to the ENCODE transcriptional and ChIP-chip experiments. In particular, we show that the HMM framework is able to efficiently process tiling array data as well as or better than previous approaches. For the idealized sampling strategies, we show how we can assess their performance in a simulation framework and how a maximum entropy approach, which samples sub-regions with very different signal intensities, gives the maximally performing gold-standard. This latter result has strong implications for the optimum way medium-scale validation experiments should be carried out to verify the results of the genome-scale tiling array experiments. Supplementary information: The supplementary data are available at http://tiling.gersteinlab.org/hmm/ Contact: mark.gerstein@yale.edu

Journal ArticleDOI
15 Dec 2006-Science
TL;DR: The science of the Web should enumerate the range of information requests that can be fruitfully made and the kinds of information infrastructure and data-mining techniques needed to fulfill them, to help develop a new science of practical data mining focusing on questions answerable with the existing digital libraries of information.
Abstract: We read with great interest the Perspective “Creating a science of the Web” by T. Berners-Lee et al. (11 Aug, p. 769). We agree that evolving Web technologies enable the creation of novel structures of information, whose properties and dynamics can be fruitfully studied. More generally, we would like to point out that the Web is a specific phenomenon associated with the increasing prevalence of information being digitized and linked together into complicated structures. The complexity of these structures underscores the need for systematic, large-scale data mining both to uncover new patterns in social interactions and to make discoveries in science through connecting disparate findings. For this vision to be realized, we have to develop a new science of practical data mining focusing on questions answerable with the existing digital libraries of information. In particular, today, free-text search (as embodied by Google) is the primary means of mining the Web, but there are many kinds of information requests it cannot handle. Queries combining general, standardized annotation about pages (such as from the semantic Web) with free-text search within them are often not supported—e.g., doing a full-text search of all biophysics blogs emanating just from governmental institutions within 100 miles of Chicago. Furthermore, it would be useful to develop ways of leveraging the small amounts of highly structured information in the semantic Web as “gold-standard training sets” to help bootstrap the querying and clustering of the large bodies of unstructured information on the Web as a whole. Thus, the science of the Web should enumerate the range of information requests that can be fruitfully made and the kinds of information infrastructure and data-mining techniques needed to fulfill them. # Response {#article-title-2} We agree with Smith and Gerstein's view that data mining is among the many important areas of research that are considering the Web as an object of scientific inquiry. They are correct in pointing out the importance of “text mining,” the basis of current Web search, for providing new Web capabilities. However, with the increasing amount of directly machine-readable data that are available on the Web (coming from, for example, database-producing equipment such as modern scientific devices and data-oriented applications), it is also clear that text mining needs to be augmented with new data technologies that work more directly with data and meta-data. Data mining is also an excellent case in point for the main focus of our Perspective in relation to the interdisciplinary nature of the emerging science of the Web. Analytic modeling techniques will be needed to understand where Web data reside and how they can best be accessed and integrated. Engineering and language development are needed if we are to be able to perform data mining without having to pull all the information into centralized data servers of a scale that only the few largest search companies can currently afford. In addition, data mining provides not just opportunities for better search, but also real policy issues with respect to information access and user privacy, especially where multiple data sources are aggregated into searchable forms.

Journal ArticleDOI
TL;DR: This work defines protein-protein interaction broadly as co-complexation, and develops a weighted-voting procedure to predict interactions among yeast helical membrane proteins by optimally combining evidence based on diverse genome-wide information.

Journal ArticleDOI
TL;DR: This work examines insertion site specificity and global insertion behavior of two mini-transposons previously used for large-scale gene disruption in Saccharomyces cerevisiae: Tn3 and Tn7, and develops a windowed Kolmogorov–Smirnov (K–S) test to analyze transposon insertion distributions in sequence windows of various sizes.
Abstract: Transposons are widely employed as tools for gene disruption. Ideally, they should display unbiased insertion behavior, and incorporate readily into any genomic DNA to which they are exposed. However, many transposons preferentially insert at specific nucleotide sequences. It is unclear to what extent such bias affects their usefulness as mutagenesis tools. Here, we examine insertion site specificity and global insertion behavior of two mini-transposons previously used for large-scale gene disruption in Saccharomyces cerevisiae: Tn3 and Tn7. Using an expanded set of insertion data, we confirm that Tn3 displays marked preference for the AT-rich 5 bp consensus site TA[A/T]TA, whereas Tn7 displays negligible target site preference. On a genome level, both transposons display marked non-uniform insertion behavior: certain sites are targeted far more often than expected, and both distributions depart drastically from Poisson. Thus, to compare their insertion behavior on a genome level, we developed a windowed Kolmogorov–Smirnov (K–S) test to analyze transposon insertion distributions in sequence windows of various sizes. We find that when scored in large windows (.300 bp), both Tn3 and Tn7 distributions appear uniform, whereas in smaller windows, Tn7 appears uniform while Tn3 does not. Thus, both transposons are effective tools for gene disruption, but Tn7 does so with less duplication and a more uniform distribution, better approximating the behavior of the ideal transposon.

Journal ArticleDOI
TL;DR: How basic molecular networks are distinct yet connected and well coordinated is demonstrated, with the long-distance regulation in metabolic networks agrees with its counterpart in social networks (namely, assembly lines).
Abstract: Background: Molecular networks are of current interest, particularly with the publication of many large-scale datasets. Previous analyses have focused on topologic structures of individual networks. Results: Here, we present a global comparison of four basic molecular networks: regulatory, coexpression, interaction, and metabolic. In terms of overall topologic correlation - whether nearby proteins in one network are close in another - we find that the four are quite similar. However, focusing on the occurrence of local features, we introduce the concept of composite hubs, namely hubs shared by more than one network. We find that the three 'action' networks (metabolic, coexpression, and interaction) share the same scaffolding of hubs, whereas the regulatory network uses distinctly different regulator hubs. Finally, we examine the inter-relationship between the regulatory network and the three action networks, focusing on three composite motifs - triangles, trusses, and bridges - involving different degrees of regulation of gene pairs. Our analysis shows that interaction and co-expression networks have short-range relationships, with directly interacting and co-expressed proteins sharing regulators. However, the metabolic network contains many long-distance relationships: far-away enzymes in a pathway often have time-delayed expression relationships, which are well coordinated by bridges connecting their regulators. Conclusion: We demonstrate how basic molecular networks are distinct yet connected and well coordinated. Many of our conclusions can be mapped onto structured social networks, providing intuitive comparisons. In particular, the long-distance regulation in metabolic networks agrees with its counterpart in social networks (namely, assembly lines). Conversely, the segregation of regulator hubs from other hubs diverges from social intuitions (as managers often are centers of interactions).

Book ChapterDOI
TL;DR: Some of the most widely used statistical techniques for normalizing and scoring traditional microarray data and indicate their potential utility for analyzing the newer protein and tiling microarray experiments are presented.
Abstract: A credit to microarray technology is its broad application. Two experiments--the tiling microarray experiment and the protein microarray experiment--are exemplars of the versatility of the microarrays. With the technology's expanding list of uses, the corresponding bioinformatics must evolve in step. There currently exists a rich literature developing statistical techniques for analyzing traditional gene-centric DNA microarrays, so the first challenge in analyzing the advanced technologies is to identify which of the existing statistical protocols are relevant and where and when revised methods are needed. A second challenge is making these often very technical ideas accessible to the broader microarray community. The aim of this chapter is to present some of the most widely used statistical techniques for normalizing and scoring traditional microarray data and indicate their potential utility for analyzing the newer protein and tiling microarray experiments. In so doing, we will assume little or no prior training in statistics of the reader. Areas covered include background correction, intensity normalization, spatial normalization, and the testing of statistical significance.

Journal ArticleDOI
TL;DR: An automated web tool is developed—COP (COrrelations by Positional artifacts) to detect these artifacts in microarray experiments, which find that genes that are close on the microarray chips tend to have higher correlations between their expression profiles.
Abstract: Microarray technology is currently one of the most widely-used technologies in biology. Many studies focus on inferring the function of an unknown gene from its co-expressed genes. Here, we are able to show that there are two types of positional artifacts in microarray data introducing spurious correlations between genes. First, we find that genes that are close on the microarray chips tend to have higher correlations between their expression profiles. We call this the ‘chip artifact’. Our calculations suggest that the carry-over during the printing process is one of the major sources of this type of artifact, which is later confirmed by our experiments. Based on our experiments, the measured intensity of a microarray spot contains 0.1% (for fully-hybridized spots) to 93% (for un-hybridized ones) of noise resulting from this artifact. Secondly, we, for the first time, show that genes that are close on the microtiter plates in microarray experiments also tend to have higher correlations. We call this the ‘plate artifact’. Both types of artifacts exist with different severity in all cDNA microarray experiments that we analyzed. Therefore, we develop an automated web tool— COP (COrrelations by Positional artifacts) to detect these artifacts in microarray experiments. COP has been integrated with the microarray data normalization tool, ExpressYourself, which is available at http:// bioinfo.mbb.yale.edu/ExpressYourself/. Together, the two can eliminate most of the common noises in microarray data.

Journal ArticleDOI
TL;DR: A comprehensive package of tools for analyzing helix-helix packing in proteins including quantitative measures of the helix interaction surface area and helix crossing angle, as well as several methods for visualizing the helical interaction are developed.
Abstract: Motivation: In many proteins, helix--helix interactions can be critical to establishing protein conformation (folding) and dynamics, as well as determining associations between protein units. However, the determination of a set of rules that guide helix--helix interaction has been elusive. In order to gain further insight into the helix--helix interface, we have developed a comprehensive package of tools for analyzing helix--helix packing in proteins. These tools are available at http://helix.gersteinlab.org. They include quantitative measures of the helix interaction surface area and helix crossing angle, as well as several methods for visualizing the helical interaction. These methods can be used for analysis of individual protein conformations or to gain insight into dynamic changes in helix interactions. For the latter purpose, a direct interface from entries in the Molecular Motions Database to the HIT site has been provided. Contact: Mark.Gerstein@yale.edu

Book ChapterDOI
01 Jan 2006
TL;DR: It is shown that the occurrence of protein folds in 20 completely sequenced genomes follow a power-law distribution, where the number of folds with a given genomic occurrence decays as F(V = a V −b), with a few occurring many times and most occurring infrequently.
Abstract: Motivation Global surveys of protein folds in genomes measure the usage of essential molecular parts in different organisms. In a recent survey, we showed that the occurrence of protein folds in 20 completely sequenced genomes follow a power-law distribution; i.e., the number of folds (F) with a given genomic occurrence (V) decays as F(V = a V −b, with a few occurring many times and most occurring infrequently. Clearly, such a distribution results from the way in which genomes have evolved into their current states.

Journal ArticleDOI
TL;DR: Analysis of the mapping results of RNA isolated from five cell/tissue types, NB4 cells, NB 4 cells treated with retinoic acid, neutrophils, and placenta, throughout the ENCODE region reveals a large number of novel transcribed regions, which suggest that many of the novel transcription regions may have a functional role.
Abstract: We have used genomic tiling arrays to identify transcribed regions throughout the human genome. Analysis of the mapping results of RNA isolated from five cell/tissue types, NB4 cells, NB4 cells treated with retinoic acid (RA), NB4 cells treated with 12-O-tetradecanoylphorbol-13 acetate (TPA), neutrophils, and placenta, throughout the ENCODE region reveals a large number of novel transcribed regions. Interestingly, neutrophils exhibit a great deal of novel expression in several intronic regions. Comparison of the hybridization results of NB4 cells treated with different stimuli relative to untreated cells reveals that many new regions are expressed upon cell differentiation. One such region is the Hox locus, which contains a large number of novel regions expressed in a number of cell types. Analysis of the trinucleotide composition of the novel transcribed regions reveals that it is similar to that of known exons. These results suggest that many of the novel transcribed regions may have a functional role.

Journal ArticleDOI
Mark Gerstein1
06 Apr 2006-Nature
TL;DR: Although it is believed that progress on scenario development can and will be made, the elements of ‘up-to-date’ economic theory identified as overlooked are either too vague to be meaningful, or are issues the community has been dealing with for years.
Abstract: SIR — Your Special Report “The costs of global warming” (Nature 439, 374–375; 2006) gives an unbalanced picture of the emissions scenarios developed by the Intergovernmental Panel on Climate Change (IPCC). In contrast to the claim that these scenarios are outdated, a recent peer-reviewed assessment has concluded that, with a few notable exceptions, they compare reasonably well to recent data and projections for gross domestic product, population and emissions (D.v.V. and B.O’N. Clim. Change, in the press. doi: 10.1007/s10584-005-9031-0; see www. iiasa.ac.at/Research/PCC/pubs/vanVuuren& ONeill2006_CC_uncorproof.pdf). Although we believe that progress on scenario development can and will be made, the elements of ‘up-to-date’ economic theory identified as overlooked — “how future societies will operate, how fast the population will grow, and how technological progress will change things” — are either too vague to be meaningful, or are issues the community has been dealing with for years. The Energy Modeling Forum has a 30-year history of model comparisons, exploring the implications for climate policy of a range of rates of economic growth and technological change (D. W. Gaskins and J. P. Weyant Am. Econ. Rev. 83, 318–323; 1993, and J. P. Weyant Energy Econ. 26, 501–515; 2004). It is not correct to imply that the scenarios only use market exchange rates, or that they all assume that “the economies of poor countries will quickly catch up with those of rich nations”. Some scenarios are also reported in terms of purchasing-power parity exchange rates in the original 2000 IPCC Special Report. The debate on the emissions impacts of alternative exchange rates in economic modelling is not conclusive, but such impacts are likely to be small compared with the influence of technology, lifestyle and climate policies. And in no scenario do developing countries become as affluent as industrialized ones. The assumed degree of catching up in the scenarios covers a wide range of possibilities. Focusing on a small number of most-likely futures ignores lessons from history: if the world always worked according to best-guess projections, we would now be living with nuclear power too cheap to meter and no ozone hole. Arnulf Grubler*†, Brian O’Neill*‡, Detlef van Vuuren§ *International Institute for Applied Systems Analysis, A-2361 Laxenburg, Austria †School of Forestry & Environmental Studies, Yale University, New Haven, Connecticut 06511, USA ‡Watson Institute for International Studies, Brown University, Providence, Rhode Island 02912, USA §Netherlands Environmental Assessment Agency, PO Box 303, 3720 BA Bilthoven, The Netherlands