scispace - formally typeset
Search or ask a question

Showing papers on "UniProt published in 2006"


Journal ArticleDOI
TL;DR: The Universal Protein Resource (UniProt) provides a central resource on protein sequences and functional annotation with three database components, each addressing a key need in protein bioinformatics.
Abstract: The Universal Protein Resource (UniProt) provides a central resource on protein sequences and functional annotation with three database components, each addressing a key need in protein bioinformatics. The UniProt Knowledgebase (UniProtKB), comprising the manually annotated UniProtKB/Swiss-Prot section and the automatically annotated UniProtKB/TrEMBL section, is the preeminent storehouse of protein annotation. The extensive cross-references, functional and feature annotations and literature-based evidence attribution enable scientists to analyse proteins and query across databases. The UniProt Reference Clusters (UniRef) speed similarity searches via sequence space compression by merging sequences that are 100% (UniRef100), 90% (UniRef90) or 50% (UniRef50) identical. Finally, the UniProt Archive (UniParc) stores all publicly available protein sequences, containing the history of sequence data with links to the source databases. UniProt databases continue to grow in size and in availability of information. Recent and upcoming changes to database contents, formats, controlled vocabularies and services are described. New download availability includes all major releases of UniProtKB, sequence collections by taxonomic division and complete proteomes. A bibliography mapping service has been added, and an ID mapping service will be available soon. UniProt databases can be accessed online at http://www.uniprot.org or downloaded at ftp://ftp.uniprot.org/pub/databases/.

1,092 citations


Journal ArticleDOI
TL;DR: The SWiss-MODEL Repository is a database of annotated 3D protein structure models generated by the SWISS- MODEL homology-modelling pipeline that reflects the current state of sequence and structure databases.
Abstract: The SWISS-MODEL Repository is a database of annotated 3D protein structure models generated by the SWISS-MODEL homology-modelling pipeline. As of September 2005, the repository contained 675,000 models for 604,000 different protein sequences of the UniProt database. Regular updates ensure that the content of the repository reflects the current state of sequence and structure databases, integrating new or modified target sequences, and making use of new template structures. Each Repository entry consists of one or more 3D models accompanied by detailed information about the target protein and the model building process: functional annotation, a detailed template selection log, target-template alignment, summary of the model building and model quality assessment. The SWISS-MODEL Repository is freely accessible at http://swissmodel.expasy.org/repository/.

851 citations


Journal ArticleDOI
TL;DR: Although UCSC Known Genes offers the highest genomic and CDS coverage among major human and mouse gene sets, more detailed analysis suggests all of them could be further improved.
Abstract: The University of California Santa Cruz (UCSC) Known Genes dataset is constructed by a fully automated process, based on protein data from Swiss-Prot/TrEMBL (UniProt) and the associated mRNA data from Genbank. The detailed steps of this process are described. Extensive cross-references from this dataset to other genomic and proteomic data were constructed. For each known gene, a details page is provided containing rich information about the gene, together with extensive links to other relevant genomic, proteomic and pathway data. As of July 2005, the UCSC Known Genes are available for human, mouse and rat genomes. The Known Genes serves as a foundation to support several key programs: the Genome Browser, Proteome Browser, Gene Sorter and Table Browser offered at the UCSC website. All the associated data files and program source code are also available. They can be accessed at http://genome.ucsc.edu. The genomic coverage of UCSC Known Genes, RefSeq, Ensembl Genes, H-Invitational and CCDS is analyzed. Although UCSC Known Genes offers the highest genomic and CDS coverage among major human and mouse gene sets, more detailed analysis suggests all of them could be further improved. Contact: fanhsu@soe.ucsc.edu

507 citations


Journal ArticleDOI
TL;DR: The SPODOBASE database provides integrated access to expressed sequence tags (EST) from the lepidopteran insect Spodoptera frugiperda which will allow identification of a number of genes and comprehensive cloning of gene families of interest for scientific community.
Abstract: The Lepidoptera Spodoptera frugiperda is a pest which causes widespread economic damage on a variety of crop plants. It is also well known through its famous Sf9 cell line which is used for numerous heterologous protein productions. Species of the Spodoptera genus are used as model for pesticide resistance and to study virus host interactions. A genomic approach is now a critical step for further new developments in biology and pathology of these insects, and the results of ESTs sequencing efforts need to be structured into databases providing an integrated set of tools and informations. The ESTs from five independent cDNA libraries, prepared from three different S. frugiperda tissues (hemocytes, midgut and fat body) and from the Sf9 cell line, are deposited in the database. These tissues were chosen because of their importance in biological processes such as immune response, development and plant/insect interaction. So far, the SPODOBASE contains 29,325 ESTs, which are cleaned and clustered into non-redundant sets (2294 clusters and 6103 singletons). The SPODOBASE is constructed in such a way that other ESTs from S. frugiperda or other species may be added. User can retrieve information using text searches, pre-formatted queries, query assistant or blast searches. Annotation is provided against NCBI, UNIPROT or Bombyx mori ESTs databases, and with GO-Slim vocabulary. The SPODOBASE database provides integrated access to expressed sequence tags (EST) from the lepidopteran insect Spodoptera frugiperda. It is a publicly available structured database with insect pest sequences which will allow identification of a number of genes and comprehensive cloning of gene families of interest for scientific community. SPODOBASE is available from URL: http://bioweb.ensam.inra.fr/spodobase

96 citations


Journal ArticleDOI
TL;DR: The initial release of the LIPID MAPS Proteome Database contains 2959 records, representing human and mouse proteins involved in lipid metabolism, and this LMPD protein list was enhanced with annotations from UniProt, EntrezGene, ENZYME, GO, KEGG and other public resources.
Abstract: The LIPID MAPS Proteome Database (LMPD) is an object-relational database of lipid-associated protein sequences and annotations. The initial release contains 2959 records, representing human and mouse proteins involved in lipid metabolism. UniProt IDs were obtained based on keyword search of KEGG and GO databases, and this LMPD protein list was then enhanced with annotations from UniProt, EntrezGene, ENZYME, GO, KEGG and other public resources. We also assigned associations with general lipid categories, based on GO and KEGG annotations. Users may search LMPD by database ID or keyword, and filter by species and/or lipid class associations; from the search results, one can then access a compilation of data relevant to each protein of interest, cross-linked to external databases. The LIPID MAPS Proteome Database (LMPD) is publicly available from the LIPID MAPS Consortium website (http://www.lipidmaps.org/). The direct URL is http://www.lipidmaps.org/data/proteome/index.cgi.

80 citations


Journal ArticleDOI
TL;DR: Phylogenomic inference of protein (or gene) function attempts to address the question, “What function does this protein perform?” in an evolutionary context by using annotated subfamily groupings to infer function.
Abstract: Phylogenomic inference of protein (or gene) function attempts to address the question, “What function does this protein perform?” in an evolutionary context. As originally outlined by Jonathan Eisen [1–3], phylogenomic inference of protein function is a multistep process involving selection of homologs, multiple sequence alignment (MSA), and phylogenetic tree construction; overlaying annotations on the tree topology; discriminating between orthologs and paralogs; and—finally—inferring the function of a protein based on the orthologs identified by this process and the annotations retrieved. Figure 1 shows an example of using annotated subfamily groupings to infer function, in a manner similar to [1]. One of us, while at Celera Genomics, separately came up with a similar approach for the functional classification of the human genome [4], based on the automated identification of functional subfamilies using the SCI-PHY algorithm and the use of subfamily hidden Markov models (HMMs) to classify novel sequences [5,6]. Our experiences over the past several years in developing computational pipelines for automating phylogenomic inference at the genome scale [7]—and the challenges we have faced in this effort—motivate this paper. Figure 1 Phylogenomic Analysis of Protein Function Using Subfamily Annotation In practice, phylogenomic inference of gene function is not often used. Far from it. The majority of novel sequences are assigned a putative function through the use of annotation transfer from the top hits in a database search. In our analysis of over 300,000 proteins in the UniProt database, only 3% of proteins with informative annotations (i.e., those not labelled as “hypothetical” or “unknown”) had experimental support for their annotations; 97% were annotated using electronic evidence alone. These annotations are uploaded to GenBank, where they persist even if they are eventually determined to be in error. The systematic errors associated with this annotation protocol have been pointed out by numerous investigators over the years [8–10]. The root causes of these errors are these: Gene duplication. This enables protein superfamilies to innovate novel functions on the same structural template, so that the top database hit may have a function distinct from the query. Domain shuffling. Domain fusion and fission events add an additional layer of complexity, as a query and database hit may share only a local region of homology and thus have entirely different molecular functions and structures. Propagation of existing errors in database annotations. This is particularly pernicious, as existing annotation errors are seldom detected and, even if detected, are not necessarily corrected. Evolutionary distance. Two proteins can share a common ancestor and domain structure, yet have very different functions simply due to their presence in very divergently related species. Phylogenomic analysis, properly applied, avoids these errors and provides a mechanism for detecting existing database annotation errors [3,7]. Why then is phylogenomic inference not used more widely? We believe this is due to four reasons. First, the actual frequency of annotation error is not known, so the gravity of the situation is not recognized. Second, phylogenomic inference is a much more complicated endeavor than a simple database search and requires significantly more expertise and computing resources. It is therefore not easily applied at the genome scale. Third, millions of dollars and years of effort have been poured into developing computational annotation systems that depend on annotation transfer from top database hits, perhaps overlaid with domain prediction methods such as PFAM or the NCBI CDD [11,12]. Fourth, phylogenomic approaches to protein function prediction have arisen only in the last few years, while database search methods have been available for much longer. Revolutions do not normally take place overnight. These four reasons result in phylogenomic inference being applied on a one-off basis, for a few protein superfamilies here and there. This may be about to change. A variety of software tools and algorithms enabling phylogenomic inference have been developed in recent years (see Table 1). Some of these methods have based annotation transfer on the identification of orthologs [13–15] or of functional subfamilies [6,16–21]. Other groups have used whole-tree analyses [22–24]. Still other groups employ expert knowledge to define functional subtypes and then develop statistical models to allow users to classify novel sequences [25,26]; these expert system-based approaches are unfortunately limited by the scarcity of experimental data for most protein families. Table 1 Resources for Phylogenomic Analysis It is worth examining the assumptions underlying these phylogenomic resources, and phylogenomic inference as a whole.

72 citations


Journal ArticleDOI
TL;DR: A selected combination of programs exploring PubMed abstracts, universal gene/protein databases, and state-of-the-art pathway knowledge bases was assembled to distinguish enzymes with hydrolytic activities that are expressed in the extracellular space of cancer cells.
Abstract: Background: We present an effective, rapid, systematic data mining approach for identifying genes or proteins related to a particular interest. A selected combination of programs exploring PubMed abstracts, universal gene/protein databases (UniProt, InterPro, NCBI Entrez), and stateof-the-art pathway knowledge bases (LSGraph and Ingenuity Pathway Analysis) was assembled to distinguish enzymes with hydrolytic activities that are expressed in the extracellular space of cancer cells. Proteins were identified with respect to six types of cancer occurring in the prostate, breast, lung, colon, ovary, and pancreas. Results: The data mining method identified previously undetected targets. Our combined strategy applied to each cancer type identified a minimum of 375 proteins expressed within the extracellular space and/or attached to the plasma membrane. The method led to the recognition of human cancer-related hydrolases (on average, ~35 per cancer type), among which were prostatic acid phosphatase, prostate-specific antigen, and sulfatase 1. Conclusion: The combined data mining of several databases overcame many of the limitations of querying a single database and enabled the facile identification of gene products. In the case of cancer-related targets, it produced a list of putative extracellular, hydrolytic enzymes that merit additional study as candidates for cancer radioimaging and radiotherapy. The proposed data mining strategy is of a general nature and can be applied to other biological databases for understanding biological functions and diseases.

71 citations


Journal ArticleDOI
TL;DR: The Gene3D release 4 database and web portal provide a combined structural, functional and evolutionary view of the protein world, focussed on providing structural annotation for protein sequences without structural representatives.
Abstract: The Gene3D release 4 database and web portal (http://cathwww.biochem.ucl.ac.uk:8080/Gene3D) provide a combined structural, functional and evolutionary view of the protein world. It is focussed on providing structural annotation for protein sequences without structural representatives--including the complete proteome sets of over 240 different species. The protein sequences have also been clustered into whole-chain families so as to aid functional prediction. The structural annotation is generated using HMM models based on the CATH domain families; CATH is a repository for manually deduced protein domains. Amongst the changes from the last publication are: the addition of over 100 genomes and the UniProt sequence database, domain data from Pfam, metabolic pathway and functional data from COGs, KEGG and GO, and protein-protein interaction data from MINT and BIND. The website has been rebuilt to allow more sophisticated querying and the data returned is presented in a clearer format with greater functionality. Furthermore, all data can be downloaded in a simple XML format, allowing users to carry out complex investigations at their own computers.

66 citations


Journal ArticleDOI
TL;DR: The Evolutionary trace report_maker offers a new type of service for researchers investigating the function of novel proteins that takes a Protein Data Bank identifier or UniProt accession number, and returns a human-readable document in PDF format, supplemented by the original data needed to reproduce the results quoted in the report.
Abstract: Summary: Evolutionary trace report_maker offers a new type of service for researchers investigating the function of novel proteins. It pools, from different sources, information about protein sequence, structure and elementary annotation, and to that background superimposes inference about the evolutionary behavior of individual residues, using real-valued evolutionary trace method. As its only input it takes a Protein Data Bank identifier or UniProt accession number, and returns a human-readable document in PDF format, supplemented by the original data needed to reproduce the results quoted in the report. Availability: Evolutionary trace reports are freely available for academic users at http://mammoth.bcm.tmc.edu/report-maker Contact: {imihalek,ires,lichtarge}@bcm.tmc.edu

65 citations


Journal ArticleDOI
TL;DR: By using a statistical model to measure the functional similarity of genes based on the Gene Ontology directed acyclic graph, a novel Gene Functional Similarity Search Tool (GFSST) is developed to identify genes with related functions from annotated proteome databases.
Abstract: With the completion of the genome sequences of human, mouse, and other species and the advent of high throughput functional genomic research technologies such as biomicroarray chips, more and more genes and their products have been discovered and their functions have begun to be understood. Increasing amounts of data about genes, gene products and their functions have been stored in databases. To facilitate selection of candidate genes for gene-disease research, genetic association studies, biomarker and drug target selection, and animal models of human diseases, it is essential to have search engines that can retrieve genes by their functions from proteome databases. In recent years, the development of Gene Ontology (GO) has established structured, controlled vocabularies describing gene functions, which makes it possible to develop novel tools to search genes by functional similarity. By using a statistical model to measure the functional similarity of genes based on the Gene Ontology directed acyclic graph, we developed a novel Gene Functional Similarity Search Tool (GFSST) to identify genes with related functions from annotated proteome databases. This search engine lets users design their search targets by gene functions. An implementation of GFSST which works on the UniProt (Universal Protein Resource) for the human and mouse proteomes is available at GFSST Web Server. GFSST provides functions not only for similar gene retrieval but also for gene search by one or more GO terms. This represents a powerful new approach for selecting similar genes and gene products from proteome databases according to their functions.

58 citations


Journal ArticleDOI
01 Apr 2006-Proteins
TL;DR: The Protein Kinase Resource is described, an integrated online service that provides access to information relevant to cell signaling and enables kinase researchers to visualize and analyze the data directly in an online environment.
Abstract: The protein kinase superfamily is an important group of enzymes controlling cellular signaling cascades. The increasing amount of available experimental data provides a foundation for deeper understanding of details of signaling systems and the underlying cellular processes. Here, we describe the Protein Kinase Resource, an integrated online service that provides access to information relevant to cell signaling and enables kinase researchers to visualize and analyze the data directly in an online environment. The data set is synchronized with Uniprot and Protein Data Bank (PDB) databases and is regularly updated and verified. Additional annotation includes interactive display of domain composition, cross-references between orthologs and functional mapping to OMIM records. The Protein Kinase Resource provides an integrated view of the protein kinase superfamily by linking data with their visual representation. Thus, human kinases can be mapped onto the human kinome tree via an interactive display. Sequence and structure data can be easily displayed using applications developed for the PKR and integrated with the website and the underlying database. Advanced search mechanisms, such as multiparameter lookup, sequence pattern, and blast search, enable fast access to the desired information, while statistics tools provide the ability to analyze the relationships among the kinases under study. The integration of data presentation and visualization implemented in the Protein Kinase Resource can be adapted by other online providers of scientific data and should become an effective way to access available experimental information.

Journal ArticleDOI
TL;DR: Biozon is a unified biological database that integrates heterogeneous data types such as proteins, structures, domain families, protein–protein interactions and cellular pathways, and establishes the relationships between them, and results in a highly connected graph structure.
Abstract: Biological entities are strongly related and mutually dependent on each other. Therefore, there is a growing need to corroborate and integrate data from different resources and aspects of biological systems in order to analyze them effectively. Biozon is a unified biological database that integrates heterogeneous data types such as proteins, structures, domain families, protein-protein interactions and cellular pathways, and establishes the relationships between them. All data are integrated on to a single graph schema centered around the non-redundant set of biological objects that are shared by each source. This integration results in a highly connected graph structure that provides a more complete picture of the known context of a given object that cannot be determined from any one source. Currently, Biozon integrates roughly 2 million protein sequences, 42 million DNA or RNA sequences, 32 000 protein structures, 150 000 interactions and more from sources such as GenBank, UniProt, Protein Data Bank (PDB) and BIND. Biozon augments source data with locally derived data such as 5 billion pairwise protein alignments and 8 million structural alignments. The user may form complex cross-type queries on the graph structure, add similarity relations to form fuzzy queries and rank the results based on analysis of the edge structure similar to Google PageRank, online at Biozon.org.

Journal ArticleDOI
TL;DR: The study indicated that names for genes/proteins are highly ambiguous and there are usually multiple names for the same gene or protein and it was demonstrated that most gene/protein names appearing in text can be found in BioThesaurus.

Journal ArticleDOI
TL;DR: The ProtoNet system is extended in order to assign functional annotations automatically by leveraging on the scaffold of the hierarchical classification, and the method is able to overcome some frequent annotation pitfalls.
Abstract: In an era of rapid genome sequencing and high-throughput technology, automatic function prediction for a novel sequence is of utter importance in bioinformatics. While automatic annotation methods based on local alignment searches can be simple and straightforward, they suffer from several drawbacks, including relatively low sensitivity and assignment of incorrect annotations that are not associated with the region of similarity. ProtoNet is a hierarchical organization of the protein sequences in the UniProt database. Although the hierarchy is constructed in an unsupervised automatic manner, it has been shown to be coherent with several biological data sources. We extend the ProtoNet system in order to assign functional annotations automatically. By leveraging on the scaffold of the hierarchical classification, the method is able to overcome some frequent annotation pitfalls.

Journal ArticleDOI
TL;DR: The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data and suggests previously unknown domain families with at least 51% fidelity.
Abstract: Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain families, accessible for browsing and download at [1], provides a complementary view to that provided by other existing libraries. Furthermore, since it is automatic, the EVEREST process is scalable and we will run it in the future on larger databases as well. The EVEREST source files are available for download from the EVEREST web site.

Journal ArticleDOI
TL;DR: This work presents a server software infrastructure which allows to easily plug in modules to identify biologically interesting pieces of text to be then presented in a web interface to the curator.

Journal ArticleDOI
TL;DR: The UniProt KB Sequence/Annotation Version database (UniSave) is a comprehensive archive of UniProtKB/Swiss-Prot and UniProtkb/TrEMBL entry versions that provides access to previous versions of these entries.
Abstract: Summary: The UniProtKB Sequence/Annotation Version database (UniSave) is a comprehensive archive of UniProtKB/Swiss-Prot and UniProtKB/TrEMBL entry versions. All changed Swiss-Prot and TrEMBL entries are loaded into the UniSave as part of the public bi-weekly UniProtKB releases. Unlike the UniProtKB, which contains only the latest Swiss-Prot and TrEMBL entry versions, the UniSave provides access to previous versions of these entries. Availability: http://www.ebi.ac.uk/uniprot/unisave Contact: rolf.apweiler@ebi.ac.uk

Journal ArticleDOI
TL;DR: A database of unique protein sequence identifiers called Sequence Globally Unique Identifiers (SEGUID) derived from primary protein sequences is developed, which serve as a common link between multiple sequence databases and are resilient to annotation changes in either public or private databases throughout the lifetime of a given protein sequence.
Abstract: In proteome studies, identification of proteins requires searching protein sequence databases. The public protein sequence databases (e.g., NCBInr, UniProt) each contain millions of entries, and private databases add thousands more. Although much of the sequence information in these databases is redundant, each database uses distinct identifiers for the identical protein sequence and often contains unique annotation information. Users of one database obtain a database-specific sequence identifier that is often difficult to reconcile with the identifiers from a different database. When multiple databases are used for searches or the databases being searched are updated frequently, interpreting the protein identifications and associated annotations can be problematic. We have developed a database of unique protein sequence identifiers called Sequence Globally Unique Identifiers (SEGUID) derived from primary protein sequences. These identifiers serve as a common link between multiple sequence databases and are resilient to annotation changes in either public or private databases throughout the lifetime of a given protein sequence. The SEGUID Database can be downloaded (http://bioinformatics.anl.gov/SEGUID/) or easily generated at any site with access to primary protein sequence databases. Since SEGUIDs are stable, predictions based on the primary sequence information (e.g., pI, Mr) can be calculated just once; we have generated approximately 500 different calculations for more than 2.5 million sequences. SEGUIDs are used to integrate MS and 2-DE data with bioinformatics information and provide the opportunity to search multiple protein sequence databases, thereby providing a higher probability of finding the most valid protein identifications.

Journal ArticleDOI
TL;DR: A non‐redundant set of 1804 proteins was identified in human brain samples and in the majority of cases splice variants could be unambiguously identified by unique peptides, including matches to several hypothetical transcripts of known as well as predicted genes.
Abstract: The HUPO Brain Proteome Project is an initiative coordinating proteomics studies to characterise human and mouse brain proteomes. Proteins identified in human brain samples during the project's pilot phase were put into biological context through integration with various annotation sources followed by a bioinformatics analysis. The data set was related to the genome sequence via the genes encoding identified proteins including an assessment of splice variant identification as well as an analysis of tissue specificity of the respective transcripts. Proteins were furthermore categorised according to subcellular localisation, molecular function and biological process, grouped into protein families and mapped to biological pathways they are known to act in. Involvement in pathological conditions was examined based on association with entries in the online version of Mendelian Inheritance in Man and an interaction network was derived from curated protein-proteininteraction data. Overall a non-redundant set of 1804 proteins was identified in human brain samples. In the majority of cases splice variants could be unambiguously identified by unique peptides, including matches to several hypothetical transcripts of known as well as predicted genes.

Journal ArticleDOI
TL;DR: The feasibility of using the method presented here to classify enzyme classes based on the enzyme domain rules is evident and the rules may be also employed by the protein annotators in manual annotation or implemented in an automatic annotation flowchart.
Abstract: Background The number of sequences compiled in many genome projects is growing exponentially, but most of them have not been characterized experimentally. An automatic annotation scheme must be in an urgent need to reduce the gap between the amount of new sequences produced and reliable functional annotation. This work proposes rules for automatically classifying the fungus genes. The approach involves elucidating the enzyme classifying rule that is hidden in UniProt protein knowledgebase and then applying it for classification. The association algorithm, Apriori, is utilized to mine the relationship between the enzyme class and significant InterPro entries. The candidate rules are evaluated for their classificatory capacity.

Book ChapterDOI
04 Sep 2006
TL;DR: The proposed approach enables semi-automatic interoperation among heterogeneous protein data sources and establishment of semantic interoperation over conceptual framework of PO enables a better insight on how information can be integrated systematically and how queries can be composed.
Abstract: Resolving heterogeneity among various protein data sources is a crucial problem if we want to gain more information about proteomics process. Information from multiple protein databases like PDB, SCOP, and UniProt need to integrated to answer user queries. Issues of Semantic Heterogeneity haven't been addressed so far in Protein Informatics. This paper outlines protein data source composition approach based on our existing work of Protein Ontology (PO). The proposed approach enables semi-automatic interoperation among heterogeneous protein data sources. The establishment of semantic interoperation over conceptual framework of PO enables us to get a better insight on how information can be integrated systematically and how queries can be composed. The semantic interoperation between protein data sources is based on semantic relationships between concepts of PO. No other such generalized semantic protein data interoperation framework has been considered so far.

Journal ArticleDOI
TL;DR: In this article, the co-evolution of amino acid-based regular expressions and keyword-based logical expressions with genetic programming is used to identify functionally important sequence features and does not require expert knowledge.
Abstract: Methods for predicting protein function directly from amino acid sequences are useful tools in the study of uncharacterised protein families and in comparative genomics. Until now, this problem has been approached using machine learning techniques that attempt to predict membership, or otherwise, to predefined functional categories or subcellular locations. A potential drawback of this approach is that the human-designated functional classes may not accurately reflect the underlying biology, and consequently important sequence-to-function relationships may be missed. We show that a self-supervised data mining approach is able to find relationships between sequence features and functional annotations. No preconceived ideas about functional categories are required, and the training data is simply a set of protein sequences and their UniProt/Swiss-Prot annotations. The main technical aspect of the approach is the co-evolution of amino acid-based regular expressions and keyword-based logical expressions with genetic programming. Our experiments on a strictly non-redundant set of eukaryotic proteins reveal that the strongest and most easily detected sequence-to-function relationships are concerned with targeting to various cellular compartments, which is an area already well studied both experimentally and computationally. Of more interest are a number of broad functional roles which can also be correlated with sequence features. These include inhibition, biosynthesis, transcription and defence against bacteria. Despite substantial overlaps between these functions and their corresponding cellular compartments, we find clear differences in the sequence motifs used to predict some of these functions. For example, the presence of polyglutamine repeats appears to be linked more strongly to the "transcription" function than to the general "nuclear" function/location. We have developed a novel and useful approach for knowledge discovery in annotated sequence data. The technique is able to identify functionally important sequence features and does not require expert knowledge. By viewing protein function from a sequence perspective, the approach is also suitable for discovering unexpected links between biological processes, such as the recently discovered role of ubiquitination in transcription.

Journal ArticleDOI
TL;DR: Results show that the TACT system is useful for functional annotation and that the prediction of ORFs and protein functions is highly accurate and close to the results of human curation.
Abstract: Transcriptome Auto-annotation Conducting Tool (TACT) is a newly developed web-based automated tool for conducting functional annotation of transcripts by the integration of sequence similarity searches and functional motif predictions. We developed the TACT system by integrating two kinds of similarity searches, FASTY and BLASTX, against protein sequence databases, UniProtKB (Swiss-Prot/TrEMBL) and RefSeq, and a unified motif prediction program, InterProScan, into the ORF-prediction pipeline originally designed for the ‘H-Invitational’ human transcriptome annotation project. This system successively applies these constituent programs to an mRNA sequence in order to predict the most plausible ORF and the function of the protein encoded. In this study, we applied the TACT system to 19 574 non-redundant human transcripts registered in H-InvDB and evaluated its predictive power by the degree of agreement with human-curated functional annotation in H-InvDB. As a result, the TACT system could assign functional description to 12 559 transcripts (64.2%), the remainder being hypothetical proteins. Furthermore, the overall agreement of functional annotation with H-InvDB, including those transcripts annotated as hypothetical proteins, was 83.9% (16 432/19 574). These results show that the TACT system is useful for functional annotation and that the prediction of ORFs and protein functions is highly accurate and close to the results of human curation. TACT is freely available at http://www.jbirc.aist.go.jp/tact/.

Journal ArticleDOI
TL;DR: In this paper, the authors present the INVertebrate HOmologous GENes (INVHOGEN), a database combining the available invertebrate protein genes from UniProt (consisting of Swiss-Prot and TrEMBL) into gene families.
Abstract: Classification of proteins into families of homologous sequences constitutes the basis of functional analysis or of evolutionary studies. Here we present INVertebrate HOmologous GENes (INVHOGEN), a database combining the available invertebrate protein genes from UniProt (consisting of Swiss-Prot and TrEMBL) into gene families. For each family INVHOGEN provides a multiple protein alignment, a maximum likelihood based phylogenetic tree and taxonomic information about the sequences. It is possible to download the corresponding GenBank flatfiles, the alignment and the tree in Newick format. Sequences and related information have been structured in an ACNUC database under a client/server architecture. Thus, complex selections can be performed. An external graphical tool (FamFetch) allows access to the data to evaluate homology relationships between genes and distinguish orthologous from paralogous sequences. Thus, INVHOGEN complements the well-known HOVERGEN database. The databank is available at http://www.bi.uni-duesseldorf.de/~invhogen/invhogen.html.

Journal ArticleDOI
01 Jun 2006-Proteins
TL;DR: The findings presented in this study strongly support the notion that functional significance of protein sets may be captured by short signatures at their termini, and provide a valuable source for testing previously overlooked signatures in protein termini.
Abstract: The two ends of each protein are known as the amino (N-) and carboxyl (C-) termini. Short signatures in a protein's termini often carry vital cellular function. No systematic research has been conducted to address the importance of short signatures (3 to 10 amino acids) in protein termini at the proteomic level. Specifically, it is unknown whether such signatures are evolutionarily conserved, and if so, whether this conservation confers shared biological functions. Current signature detection methods fail to detect such short signatures due to inadequate statistical scores. The findings presented in this study strongly support the notion that functional significance of protein sets may be captured by short signatures at their termini. A positional search method was applied to over one million proteins from the UniProt database. The result is a collection of about a thousand significant signature groups (SIGs) that include previously identified as well as many novel signatures in protein termini. These SIGs represent protein sets with minimal or no overall sequence similarity excepting the similarity at their termini. The most significant SIGs are assigned by their strong correspondence to functional annotations derived from external databases such as Gene Ontology. Each of the SIGs is associated with the statistical significance of its functional association. These SIGs provide a valuable source for testing previously overlooked signatures in protein termini and allow for the investigation of the role played by such signatures throughout evolution. The SIGs archive and advanced search options are available at http://www.proteus.cs.huji.ac.il.

Journal ArticleDOI
TL;DR: Using FASTA and CLUSTAL_X programs, similarity scores can be calculated to choose items of interest and, increasingly, using text-mining tools such as PathBinderH and GENIA corpus.
Abstract: With the widespread availability of nucleotide and amino acid sequences, novel methods for extracting biologically and clinically relevant knowledge are feasible. Data is deposited on the Internet on websites such as GeneCards, available at http://www. genecards.org/mirror.shtml. Further information can be obtained from related sites - UniProt (http://www. uniprot.org) and SwissProt (http://www.expasy.org/ sprot/). Using FASTA and CLUSTAL_X programs, similarity scores can be calculated to choose items of interest. Further information can be obtained by mining text, either manually or increasingly using text-mining tools such as PathBinderH and GENIA corpus.

Journal Article
TL;DR: It is shown that it is possible to find new domains or to extend known domains using a semi-automated method; however the goal to detect class-specific domains was only partially achieved in the sense that the new domains the authors found were not all insect- specific domains.
Abstract: Automatically finding new protein domains is a challenge when using the complete collection of known proteins (i.e., UniProt). By limiting the taxonomic range to class insecta, including two full proteomes (A. gambiae and D. melanogaster), we reduced the size of the search space in the hope of finding taxon-specific domains. The MKDOM2 program (http://prodes.toulouse.inra.fr/prodom/xdom/mkdom2.html) was used to cluster the insect proteins into potential domains that were analyzed manually in a second step. We analyzed 219 potential domains, of which 2 were insect-specific. We show that it is possible to find new domains or to extend known domains using a semi-automated method; however the goal to detect class-specific domains was only partially achieved in the sense that the new domains we found were not all insect-specific domains. The files used as input and the resulting output files, as well as extensive descriptions of the domains, are available as supplementary data from http://bioinf.ibun.unal.edu.co/insecta/.

01 Jan 2006
TL;DR: StemNet enables the linking of heterogenuous knowledge sources to which the user only had separate access before and employs state-of-the-art text-mining technologies, both to annotate and recognize semantic entities as well as to link them to databases and ontologies.
Abstract: The goal of the StemNet Project (http://www.stemnet.de) is to develop a knowledge management system for Hematopoietic Stem Cell Transplantation and Immunogenetics [1] in order to assist biomedical researchers and practitioners in finding relevant information. The evolving knowledge service goes beyond tradtional approaches in that enables a targeted and fine-grained search for relevant biomedical entities (e.g. various protein functions, different types of immune cells, various sorts of antigens, blood diseases, clinical outcomes, etc.) in unstructured biomedical free text, in particular PubMed abstracts. At the same time, this information is linked to relevant biological database and ontology entries, such as UniProt, Entrez Gene, dbSNP, the Gene Ontology, the Sequence Ontology, and the Cell Ontology (see http://obo.sourceforge.net). Thus, StemNet enables the linking of heterogenuous knowledge sources to which the user only had separate access before. Semantic search and retrieval requires linguistic and semantic processing of documents. For this purpose, we employ state-of-the-art text-mining technologies [2, 3, 4, 5] , both to annotate and recognize semantic entities as well as to link them to databases and ontologies. Documents and their semantic metadata are then presented to the user via the Lucene search engine (http://lucene.apache.org/java/docs/index. html). In order to target user queries, annotated semantic entities are kept in special indices.

Journal ArticleDOI
TL;DR: A new methodology to retrieve DNA sequences of domain encoding regions through automatic database cross-referencing using Python library PAMIE was developed and applied to extract all the EGF domain encodingDNA sequences of homo sapiens for further large-scale proteomic experiments.
Abstract: Recent proteomic studies of protein domains require high-throughput and systematic approaches Since most experiments using protein domains, the modules of protein-protein interactions, require gene cloning, the first experimental step should be retrieving DNA sequences of domain encoding regions from databases For a large scale proteomic research, however, it is a laborious task to extract a large number of domain sequences manually from several inter-linked databases We present a new methodology to retrieve DNA sequences of domain encoding regions through automatic database cross-referencing To extract protein domain encoding regions, it traverses several inter-connected database with validation process And we applied this method to retrieve all the EGF domain encoding DNA sequences of homo sapiens This new algorithm was implemented using Python library PAMIE, which enables to cross-reference across distinct databases automatically Corresponding Author: Sanguk Kim (Email:sukim@ postechackr) This work was supported by the Korea Research Foundation Grant by the Korean Government (MOEHRD) (KRF-2005-070-C00095) and POSTECH BSRI research fund-2005 Introduction Genome projects are generating vast amounts of data that provide the existence of thousands of new gene products, especially the list of proteins responsible for cellular regulation However it does not immediately reveal what these proteins do, nor how they are assembled into the molecular machines and functional networks that control cellular behavior (Pawson et al, 2003) Cellular processes and overall molecular architectures of all organisms are largely mediated through elaborate scaffolds of protein-protein interactions Thus, the high-throughput strategies to study protein-protein interactions, such as yeast two-hybrid screening, have been developed to describe the protein interaction networks and to construct the protein interaction maps in model organisms (Uetz et al, 2000, Li et al, 2004, Ghavidel et al, 2005) However, proteins interact with more than one partner at a time, it is difficult to interpret large scale protein-protein interactions (Santonico et al, 2005) Protein domains represent the modular nature of proteins, which fold independently and often perform specific tasks While protein domains could interact with several binding partners, they are the single binding modules and interact with only one partner at a time (Santonico et al, 2005) Thus, the domain knowledge can help to obtain a clearer representation of the protein networks The experiments using protein domains need to extract the sequences of domain encoding regions from distinct databases for gene cloning and protein expression, although this process often performed manually (Yu et al, 2004) However, for the high-throughput proteomic experiments, the manual retrieval is daunting due to the following three reasons First, it needs to collect the information of hundreds or thousands of protein domains for large scale experiments Second, domain knowledge is not located in a single source so that one should cross-refer separately updating interconnected databases Third, iterative extraction process can be erroneous since databases sometimes contain dubious entries and point to missing links Thus, proper decision making policies are essential to eliminate the database entry errors and to validate the results Therefore, there are needs to develop bioinformatics methodology for retrieving genetic information of domains encoding region to conduct large scale proteomic researches Bioinformatics and Biosystems 2006, Vol 2, No 1, pp 94-97 95 Here we developed a methodology to extract protein domain encoding DNA sequence automatically from three distinct databases: Pfam, UniProt and GeneBank (Finn et al, 2006, Wu et al, 2006, Benson et al, 2006) using Python library PAMIE The algorithm also includes the validation process to verify the retrieved data We applied this method to extract all the EGF domain encoding regions of homo sapiens for further large-scale proteomic experiments The EGF (Epidermal Growth Factor) domain is a widely distributed, independently folding protein module that is thought to play a general role in extracelluar events such as adhesion, coagulation, and receptor-ligand interactions (Downing et al, 1996) Figure 1 The Algorithm of retrieving domain encoding sequences through database cross-referencing