scispace - formally typeset
Search or ask a question

Showing papers on "Munich Information Center for Protein Sequences published in 2004"


Journal ArticleDOI
TL;DR: The Munich Information Center for Protein Sequences (MIPS at the GSF), Neuherberg, Germany, provides resources related to genome information and develops databases covering computable information such as the basic evolutionary relations among all genes.
Abstract: The Munich Information Center for Protein Sequences (MIPS-GSF), Neuherberg, Germany, provides protein sequence-related information based on whole-genome analysis. The main focus of the work is directed toward the systematic organization of sequence-related attributes as gathered by a variety of algorithms, primary information from experimental data together with information compiled from the scientific literature. MIPS maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the database of complete cDNAs (German Human Genome Project, NGFN), the database of mammalian protein-protein interactions (MPPI), the database of FASTA homologies (SIMAP), and the interface for the fast retrieval of protein-associated information (QUIPOS). The Arabidopsis thaliana database, the rice database, the plant EST databases (MATDB, MOsDB, SPUTNIK), as well as the databases for the comprehensive set of genomes (PEDANT genomes) are described elsewhere in the 2003 and 2004 NAR database issues, respectively. All databases described, and the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de).

799 citations


Journal ArticleDOI
TL;DR: A Markov random field method is applied to the prediction of yeast protein function based on multiple protein-protein interaction datasets, and function is assigned to unknown proteins with a probability representing the confidence of this prediction.
Abstract: Motivation: Gene Ontology (GO) consortium provides structural description of protein function that is used as a common language for gene annotation in many organisms. Large-scale techniques have generated many valuable protein--protein interaction datasets that are useful for the study of protein function. Combining both GO and protein--protein interaction data allows the prediction of function for unknown proteins. Result: We apply a Markov random field method to the prediction of yeast protein function based on multiple protein--protein interaction datasets. We assign function to unknown proteins with a probability representing the confidence of this prediction. The functions are based on three general categories of cellular component, molecular function and biological process defined in GO. The yeast proteins are defined in the Saccharomyces Genome Database (SGD). The protein--protein interaction datasets are obtained from the Munich Information Center for Protein Sequences (MIPS), including physical interactions and genetic interactions. The efficiency of our prediction is measured by applying the leave-one-out validation procedure to a functional path matching scheme, which compares the prediction with the GO description of a protein's function from the abstract level to the detailed level along the GO structure. For biological process, the leave-one-out validation procedure shows 52% precision and recall of our method, much better than that of the simple guilty-by-association methods. Supplementary material: http://www.cmb.usc.edu/~msms/gomapping

161 citations


Journal ArticleDOI
TL;DR: Joint application strategies that combine the strengths of two microbial gene finders to improve the overall gene finding performance and results in a significant improvement in specificity while there is similarity in sensitivity to Glimmer.
Abstract: Motivation: As a starting point in annotation of bacterial genomes, gene finding programs are used for the prediction of functional elements in the DNA sequence. Due to the faster pace and increasing number of genome projects currently underway, it is becoming especially important to have performant methods for this task. Results: This study describes the development of joint application strategies that combine the strengths of two microbial gene finders to improve the overall gene finding performance. Critica is very specific in the detection of similarity-supported genes as it uses a comparative sequence analysis-based approach. Glimmer employs a very sophisticated model of genomic sequence properties and is sensitive also in the detection of organism-specific genes. Based on a data set of 113 microbial genome sequences, we optimized a combined application approach using different parameters with relevance to the gene finding problem. This results in a significant improvement in specificity while there is similarity in sensitivity to Glimmer. The improvement is especially pronounced for GC rich genomes. The method is currently being applied for the annotation of several microbial genomes. Availability: The methods described have been implemented within the gene prediction component of the GenDB genome annotation system.

82 citations


Journal ArticleDOI
TL;DR: A scoring system is applied to study yeast protein complexes by using the Saccharomyces cerevisiae protein complexes database of the Munich Information Center for Protein Sequences and links the expression of the Alzheimer's disease hallmark gene APP to the beta-site-cleaving enzymes BACE and BACE2.
Abstract: Statistical similarity analysis has been instrumental in elucidation of the voluminous microarray data. Genes with correlated expression profiles tend to be functionally associated. However, the majority of functionally associated genes turn out to be uncorrelated. One conceivable reason is that the expression of a gene can be sensitively dependent on the often-varying cellular state. The intrinsic state change has to be plastically accommodated by gene-regulatory mechanisms. To capture such dynamic coexpression between genes, a concept termed "liquid association" (LA) has been introduced recently. LA offers a scoring system to guide a genome-wide search for critical cellular players that may interfere with the coexpression of a pair of genes, thereby weakening their overall correlation. Although the LA method works in many cases, a direct extension to more than two genes is hindered by the "curse of dimensionality." Here we introduce a strategy of finding an informative 2D projection to generalize LA for multiple genes. A web site is constructed that performs on-line LA computation for any user-specified group of genes. We apply this scoring system to study yeast protein complexes by using the Saccharomyces cerevisiae protein complexes database of the Munich Information Center for Protein Sequences. Human genes are also investigated by profiling of 60 cancer cell lines of the National Cancer Institute. In particular, our system links the expression of the Alzheimer's disease hallmark gene APP (amyloid-beta precursor protein) to the beta-site-cleaving enzymes BACE and BACE2, the gamma-site-cleaving enzymes presenilin 1 and 2, apolipoprotein E, and other Alzheimer's disease-related genes.

77 citations


Journal ArticleDOI
TL;DR: A method for mining unannotated or annotated genome sequences with proteomic data to identify open reading frames and is demonstrated on experimental data from Mycobacterium tuberculosis and is also shown to work with eukaryotic organisms.
Abstract: We present a method for mining unannotated or annotated genome sequences with proteomic data to identify open reading frames. The region of a genome coding for a protein sequence is identified by using information from the analysis of proteins and peptides with MALDI-TOF mass spectrometry. The raw genome sequence or any unassembled contigs of an organism are theoretically cleaved into a number of equal sized but overlapping fragments, and these are then translated in all six frames into a series of virtual proteins. Each virtual protein is then subjected to a theoretical enzymatic digestion. Standard proteomic sample preparation methods are used to separate, array, and digest the proteins of interest to peptides. The masses of the resulting peptides are measured using mass spectrometry and compared to the theoretical peptide masses of the virtual proteins. The region of the genome responsible for coding for a particular protein can then be identified when there are a large number of hits between peptides from the protein and peptides from the virtual protein. The method makes no assumptions about the location of a protein in a particular gene sequence or the positions or types of start and stop codons. To illustrate this approach, all 773 proteins of Pseudomonas aeruginosa contained in SWISS-PROT were used to theoretically test the method and optimize parameters. Increasing the size of the virtual proteins results in an overall improvement in the ability to detect the coding region, at the cost of decreasing the sensitivity of the method for smaller proteins. Increasing the minimum number of matching peptides, lowering the mass error tolerance, or increasing the signal-to-noise ratio of the simulated mass spectrum, improves the ability to detect coding regions. The method is further demonstrated on experimental data from Mycobacterium tuberculosis and is also shown to work with eukaryotic organisms (e.g., Homo sapiens).

29 citations