scispace - formally typeset
Search or ask a question

Showing papers on "UniProt published in 2013"


Journal ArticleDOI
Rolf Apweiler, Alex Bateman, Maria Jesus Martin, Claire O'Donovan, Michele Magrane, Yasmin Alam-Faruque, Emanuele Alpi, Ricardo Antunes, J Arganiska, EB Casanova, Benoit Bely, M Bingley, Carlos Bonilla, Ramona Britto, Borisas Bursteinas, WM Chan, Gayatri Chavali, Elena Cibrian-Uhalte, A Da Silva, M De Giorgi, Tunca Doğan, F. Fazzini, Paul Gane, Leyla Jael Garcia Castro, Penelope Garmiri, Emma Hatton-Ellis, Reija Hieta, Rachael P. Huntley, Duncan Legge, W Liu, Jie Luo, Alistair MacDougall, Prudence Mutowo, Andrew Nightingale, Sandra Orchard, Klemens Pichler, Diego Poggioli, Sangya Pundir, L Pureza, Guoying Qi, S. Rosanoff, Rabie Saidi, Tony Sawford, Aleksandra Shypitsyna, Edd Turner, Volynkin, Tony Wardell, Xavier Watkins, Hermann Zellner, Matthew Corbett, M Donnelly, P van Rensburg, Mickael Goujon, Hamish McWilliam, Rodrigo Lopez, Ioannis Xenarios, Lydie Bougueleret, Alan Bridge, Sylvain Poux, Nicole Redaschi, Lucila Aimo, Andrea H. Auchincloss, Kristian B. Axelsen, Parit Bansal, Delphine Baratin, P-A Binz, M. C. Blatter, Brigitte Boeckmann, Jerven Bolleman, Emmanuel Boutet, Lionel Breuza, C Casal-Casas, E de Castro, Lorenzo Cerutti, Elisabeth Coudert, Béatrice A. Cuche, M Doche, Dolnide Dornevil, Séverine Duvaud, Anne Estreicher, L Famiglietti, M Feuermann, Elisabeth Gasteiger, Sebastien Gehant, Gerritsen, Arnaud Gos, Nadine Gruaz-Gumowski, Ursula Hinz, Chantal Hulo, J. James, Florence Jungo, Guillaume Keller, Lara, P Lemercier, J Lew, Damien Lieberherr, Thierry Lombardot, Xavier D. Martin, Patrick Masson, Anne Morgat, Teresa Batista Neto, Salvo Paesano, Ivo Pedruzzi, Sandrine Pilbout, Monica Pozzato, Manuela Pruess, Catherine Rivoire, Bernd Roechert, Maria Victoria Schneider, Christian J. A. Sigrist, K Sonesson, S Staehli, Andre Stutz, Shyamala Sundaram, Michael Tognolli, Laure Verbregue, A-L Veuthey, Cathy H. Wu, Cecilia N. Arighi, Leslie Arminski, Chuming Chen, Yongxing Chen, John S. Garavelli, Hongzhan Huang, Kati Laiho, Peter B. McGarvey, Darren A. Natale, Baris E. Suzek, C. R. Vinayaka, Qinghua Wang, Yuqi Wang, L-S Yeh, Yerramalla, Jie Zhang 
TL;DR: The mission of the Universal Protein Resource (UniProt) is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequences and functional annotation.
Abstract: The mission of the Universal Protein Resource (UniProt) (http://www.uniprot.org) is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequences and functional annotation. It integrates, interprets and standardizes data from literature and numerous resources to achieve the most comprehensive catalog possible of protein information. The central activities are the biocuration of the UniProt Knowledgebase and the dissemination of these data through our Web site and web services. UniProt is produced by the UniProt Consortium, which consists of groups from the European Bioinformatics Institute (EBI), the SIB Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). UniProt is updated and distributed every 4 weeks and can be accessed online for searches or downloads.

1,845 citations


Journal ArticleDOI
TL;DR: The update of OrthoDB—the hierarchical catalog of orthologs is presented, which provides computed evolutionary traits of orthology, such as gene duplicability and loss profiles, divergence rates, sibling groups, and now extended with exon–intron architectures, syntenic Orthologs and parent–child trees.
Abstract: The concept of orthology provides a foundation for formulating hypotheses on gene and genome evolution, and thus forms the cornerstone of comparative genomics, phylogenomics and metagenomics. We present the update of OrthoDB-the hierarchical catalog of orthologs (http://www.orthodb.org). From its conception, OrthoDB promoted delineation of orthologs at varying resolution by explicitly referring to the hierarchy of species radiations, now also adopted by other resources. The current release provides comprehensive coverage of animals and fungi representing 252 eukaryotic species, and is now extended to prokaryotes with the inclusion of 1115 bacteria. Functional annotations of orthologous groups are provided through mapping to InterPro, GO, OMIM and model organism phenotypes, with cross-references to major resources including UniProt, NCBI and FlyBase. Uniquely, OrthoDB provides computed evolutionary traits of orthologs, such as gene duplicability and loss profiles, divergence rates, sibling groups, and now extended with exon-intron architectures, syntenic orthologs and parent-child trees. The interactive web interface allows navigation along the species phylogenies, complex queries with various identifiers, annotation keywords and phrases, as well as with gene copy-number profiles and sequence homology searches. With the explosive growth of available data, OrthoDB also provides mapping of newly sequenced genomes and transcriptomes to the current orthologous groups.

372 citations


Journal ArticleDOI
TL;DR: How neXtProt contributes to prioritize C-HPP efforts and integrates C-hPP results with other research efforts to create a complete human proteome catalog is described.
Abstract: About 5000 (25%) of the ~20400 human protein-coding genes currently lack any experimental evidence at the protein level. For many others, there is only little information relative to their abundance, distribution, subcellular localization, interactions, or cellular functions. The aim of the HUPO Human Proteome Project (HPP, www.thehpp.org ) is to collect this information for every human protein. HPP is based on three major pillars: mass spectrometry (MS), antibody/affinity capture reagents (Ab), and bioinformatics-driven knowledge base (KB). To meet this objective, the Chromosome-Centric Human Proteome Project (C-HPP) proposes to build this catalog chromosome-by-chromosome ( www.c-hpp.org ) by focusing primarily on proteins that currently lack MS evidence or Ab detection. These are termed "missing proteins" by the HPP consortium. The lack of observation of a protein can be due to various factors including incorrect and incomplete gene annotation, low or restricted expression, or instability. neXtProt ( www.nextprot.org ) is a new web-based knowledge platform specific for human proteins that aims to complement UniProtKB/Swiss-Prot ( www.uniprot.org ) with detailed information obtained from carefully selected high-throughput experiments on genomic variation, post-translational modifications, as well as protein expression in tissues and cells. This article describes how neXtProt contributes to prioritize C-HPP efforts and integrates C-HPP results with other research efforts to create a complete human proteome catalog.

118 citations


Journal ArticleDOI
17 Apr 2013-PLOS ONE
TL;DR: This study investigates how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families.
Abstract: Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons – Attribution – Share Alike (CC BY-SA) license.

113 citations


Journal ArticleDOI
TL;DR: This work presents a method to cluster large protein sequence databases such as UniProt within days down to 20%-30% maximum pairwise sequence identity and compares favorably to CD-HIT and UCLUST in terms of false discovery rate, sensitivity, and speed.
Abstract: Fueled by rapid progress in high-throughput sequencing, the size of public sequence databases doubles every two years. Searching the ever larger and more redundant databases is getting increasingly inefficient. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed, sensitivity, and readability of homology searches. However, because the clustering time is quadratic in the number of sequences, standard sequence search methods are becoming impracticable. Here we present a method to cluster large protein sequence databases such as UniProt within days down to 20%-30% maximum pairwise sequence identity. kClust owes its speed and sensitivity to an alignment-free prefilter that calculates the cumulative score of all similar 6-mers between pairs of sequences, and to a dynamic programming algorithm that operates on pairs of similar 4-mers. To increase sensitivity further, kClust can run in profile-sequence comparison mode, with profiles computed from the clusters of a previous kClust iteration. kClust is two to three orders of magnitude faster than clustering based on NCBI BLAST, and on multidomain sequences of 20%-30% maximum pairwise sequence identity it achieves comparable sensitivity and a lower false discovery rate. It also compares favorably to CD-HIT and UCLUST in terms of false discovery rate, sensitivity, and speed. kClust fills the need for a fast, sensitive, and accurate tool to cluster large protein sequence databases to below 30% sequence identity. kClust is freely available under GPL at http://toolkit.lmb.uni-muenchen.de/pub/kClust/ .

98 citations


Journal ArticleDOI
TL;DR: The Peptide Match service is designed to quickly retrieve all occurrences of a given query peptide from UniProt Knowledgebase (UniProtKB) with isoforms, and supports queries where isobaric leucine and isoleucine are treated equivalent, as well as dynamic queries to major proteomic databases.
Abstract: Summary: We have developed a new web application for peptide matching using Apache Lucene-based search engine. The Peptide Match service is designed to quickly retrieve all occurrences of a given query peptide from UniProt Knowledgebase (UniProtKB) with isoforms. The matched proteins are shown in summary tables with rich annotations, including matched sequence region(s) and links to corresponding proteins in a number of proteomic/peptide spectral databases. The results are grouped by taxonomy and can be browsed by organism, taxonomic group or taxonomy tree. The service supports queries where isobaric leucine and isoleucine are treated equivalent, and an option for searching UniRef100 representative sequences, as well as dynamic queries to major proteomic databases. In addition to the web interface, we also provide RESTful web services. The underlying data are updated every 4 weeks in accordance with the UniProt releases. Availability: http://proteininformationresource.org/peptide.shtml Contact: ude.ledu@cnehc Supplementary information: Supplementary data are available at Bioinformatics online.

96 citations


Journal ArticleDOI
01 Jan 2013-Database
TL;DR: Since its first release, the database has been extended to cover 50 known protein–protein interactions drug targets, including protein complexes that can be stabilized by small molecules with therapeutic effect.
Abstract: TIMBAL is a database holding molecules of molecular weight <1200 Daltons that modulate protein-protein interactions. Since its first release, the database has been extended to cover 50 known protein-protein interactions drug targets, including protein complexes that can be stabilized by small molecules with therapeutic effect. The resource contains 14 890 data points for 6896 distinct small molecules. UniProt codes and Protein Data Bank entries are also included. Database URL: http://www-cryst.bioc.cam.ac.uk/timbal

75 citations


Journal ArticleDOI
07 May 2013-PLOS ONE
TL;DR: This work aims to reduce the gap by developing an equivalent resource to UniProt called ‘LipidHome’, providing theoretically generated lipid molecules and useful metadata, and a web application developed to present the information and provide computational access via a web service.
Abstract: Protein sequence databases are the pillar upon which modern proteomics is supported, representing a stable reference space of predicted and validated proteins. One example of such resources is UniProt, enriched with both expertly curated and automatic annotations. Taken largely for granted, similar mature resources such as UniProt are not available yet in some other “omics” fields, lipidomics being one of them. While having a seasoned community of wet lab scientists, lipidomics lies significantly behind proteomics in the adoption of data standards and other core bioinformatics concepts. This work aims to reduce the gap by developing an equivalent resource to UniProt called ‘LipidHome’, providing theoretically generated lipid molecules and useful metadata. Using the ‘FASTLipid’ Java library, a database was populated with theoretical lipids, generated from a set of community agreed upon chemical bounds. In parallel, a web application was developed to present the information and provide computational access via a web service. Designed specifically to accommodate high throughput mass spectrometry based approaches, lipids are organised into a hierarchy that reflects the variety in the structural resolution of lipid identifications. Additionally, cross-references to other lipid related resources and papers that cite specific lipids were used to annotate lipid records. The web application encompasses a browser for viewing lipid records and a ‘tools’ section where an MS1 search engine is currently implemented. LipidHome can be accessed at http://www.ebi.ac.uk/apweiler-srv/lipidhome.

71 citations


Journal ArticleDOI
11 Mar 2013-PLOS ONE
TL;DR: This work identified and collected GO resources including genes, proteins, taxonomy and GO relationships from NCBI, UniProt and GO organisations, and developed a PHP web application based on Model-View-Control architecture and a Java application to extract data from source files and loaded into database automatically.
Abstract: The primary means of classifying new functions for genes and proteins relies on Gene Ontology (GO), which defines genes/ proteins using a controlled vocabulary in terms of their Molecular Function, Biological Process and Cellular Component. The challenge is to present this information to researchers to compare and discover patterns in multiple datasets using visually comprehensible and user-friendly statistical reports. Importantly, while there are many GO resources available for eukaryotes, there are none suitable for simultaneous, graphical and statistical comparison between multiple datasets. In addition, none of them supports comprehensive resources for bacteria. By using Streptococcus pneumoniae as a model, we identified and collected GO resources including genes, proteins, taxonomy and GO relationships from NCBI, UniProt and GO organisations. Then, we designed database tables in PostgreSQL database server and developed a Java application to extract data from source files and loaded into database automatically. We developed a PHP web application based on Model-View-Control architecture, used a specific data structure as well as current and novel algorithms to estimate GO graphs parameters. We designed different navigation and visualization methods on the graphs and integrated these into graphical reports. This tool is particularly significant when comparing GO groups between multiple samples (including those of pathogenic bacteria) from different sources simultaneously. Comparing GO protein distribution among up- or downregulated genes from different samples can improve understanding of biological pathways, and mechanism(s) of infection. It can also aid in the discovery of genes associated with specific function(s) for investigation as a novel vaccine or therapeutic targets.

65 citations


Journal ArticleDOI
TL;DR: It is estimated that there is good evidence for protein existence for 69% (n = 13985) of the human protein-coding genes, while 23% have only evidence on the RNA level and 7% still lack experimental evidence.
Abstract: A gene-centric Human Proteome Project has been proposed to characterize the human protein-coding genes in a chromosome-centered manner to understand human biology and disease. Here, we report on the protein evidence for all genes predicted from the genome sequence based on manual annotation from literature (UniProt), antibody-based profiling in cells, tissues and organs and analysis of the transcript profiles using next generation sequencing in human cell lines of different origins. We estimate that there is good evidence for protein existence for 69% (n = 13985) of the human protein-coding genes, while 23% have only evidence on the RNA level and 7% still lack experimental evidence. Analysis of the expression patterns shows few tissue-specific proteins and approximately half of the genes expressed in all the analyzed cells. The status for each gene with regards to protein evidence is visualized in a chromosome-centric manner as part of a new version of the Human Protein Atlas (www.proteinatlas.org).

50 citations


Journal ArticleDOI
01 Jan 2013-Database
TL;DR: This work has compiled HypoxiaDB, a database of hypoxia-regulated proteins, a comprehensive, manually-curated, non-redundant catalog of proteins whose expressions are shown experimentally to be altered at different levels and durations of hypoxin.
Abstract: There has been intense interest in the cellular response to hypoxia, and a large number of differentially expressed proteins have been identified through various high-throughput experiments. These valuable data are scattered, and there have been no systematic attempts to document the various proteins regulated by hypoxia. Compilation, curation and annotation of these data are important in deciphering their role in hypoxia and hypoxia-related disorders. Therefore, we have compiled HypoxiaDB, a database of hypoxia-regulated proteins. It is a comprehensive, manually-curated, non-redundant catalog of proteins whose expressions are shown experimentally to be altered at different levels and durations of hypoxia. The database currently contains 72 000 manually curated entries taken on 3500 proteins extracted from 73 peer-reviewed publications selected from PubMed. HypoxiaDB is distinctive from other generalized databases: (i) it compiles tissue-specific protein expression changes under different levels and duration of hypoxia. Also, it provides manually curated literature references to support the inclusion of the protein in the database and establish its association with hypoxia. (ii) For each protein, HypoxiaDB integrates data on gene ontology, KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway, protein-protein interactions, protein family (Pfam), OMIM (Online Mendelian Inheritance in Man), PDB (Protein Data Bank) structures and homology to other sequenced genomes. (iii) It also provides pre-compiled information on hypoxia-proteins, which otherwise requires tedious computational analysis. This includes information like chromosomal location, identifiers like Entrez, HGNC, Unigene, Uniprot, Ensembl, Vega, GI numbers and Genbank accession numbers associated with the protein. These are further cross-linked to respective public databases augmenting HypoxiaDB to the external repositories. (iv) In addition, HypoxiaDB provides an online sequence-similarity search tool for users to compare their protein sequences with HypoxiaDB protein database. We hope that HypoxiaDB will enrich our knowledge about hypoxia-related biology and eventually will lead to the development of novel hypothesis and advancements in diagnostic and therapeutic activities. HypoxiaDB is freely accessible for academic and non-profit users via http://www.hypoxiadb.com.

Journal ArticleDOI
29 May 2013-PLOS ONE
TL;DR: Focusing on citation of entries in the European Nucleotide Archive, UniProt and Protein Data Bank, Europe (PDBe), it is demonstrated that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers.
Abstract: Molecular biology and literature databases represent essential infrastructure for life science research Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC Focusing on citation of entries in the European Nucleotide Archive (ENA), UniProt and Protein Data Bank, Europe (PDBe), we demonstrate that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers Many thousands of new literature-database relationships are found by text mining, since these relationships are also not present in the set of articles cited by database records We recommend that structured annotation of database records in articles is extended to other databases, such as ArrayExpress and Pfam, entries from which are also cited widely in the literature The very high precision and high-throughput of this text-mining pipeline makes this activity possible both accurately and at low cost, which will allow the development of new integrated data services

Journal ArticleDOI
TL;DR: A thorough description and analysis of the data made available by MobiDB is presented, providing descriptive statistics on the various available annotation sources, and a novel consensus annotation calculation and its related weighting scheme are described.
Abstract: Intrinsic protein disorder is becoming an increasingly important topic in protein science. During the last few years, intrinsically disordered proteins (IDPs) have been shown to play a role in many important biological processes, e.g. protein signalling and regulation. This has sparked a need to better understand and characterize different types of IDPs, their functions and roles. Our recently published database, MobiDB, provides a centralized resource for accessing and analysing intrinsic protein disorder annotations. Here, we present a thorough description and analysis of the data made available by MobiDB, providing descriptive statistics on the various available annotation sources. Version 1.2.1 of the database contains annotations for ca. 4,500,000 UniProt sequences, covering all eukaryotic proteomes. In addition, we describe a novel consensus annotation calculation and its related weighting scheme. The comparison between disorder information sources highlights how the MobiDB consensus captures the main features of intrinsic disorder and correlates well with manually curated datasets. Finally, we demonstrate the annotation of 13 eukaryotic model organisms through MobiDB's datasets, and of an example protein through the interactive user interface. MobiDB is a central resource for intrinsic disorder research, containing both experimental data and predictions. In the future it will be expanded to include additional information for all known proteins.

Journal ArticleDOI
09 Aug 2013-PLOS ONE
TL;DR: An application called MutationMapper is developed that facilitates greatly the process of validating a potential point mutation identified in an abstract and is demonstrated against several examples including a single sequence and multiple sequence alignments.
Abstract: There has been a rapid increase in the amount of mutational data due to, amongst other things, an increase in single nucleotide polymorphism (SNP) data and the use of site-directed mutagenesis as a tool to help dissect out functional properties of proteins. Many manually curated databases have been developed to index point mutations but they are not sustainable with the ever-increasing volume of scientific literature. There have been considerable efforts in the automatic extraction of mutation specific information from raw text involving use of various text-mining approaches. However, one of the key problems is to link these mutations with its associated protein and to present this data in such a way that researchers can immediately contextualize it within a structurally related family of proteins. To aid this process, we have developed an application called MutationMapper. Point mutations are extracted from abstracts and are validated against protein sequences in Uniprot as far as possible. Our methodology differs in a fundamental way from the usual text-mining approach. Rather than start with abstracts, we start with protein sequences, which facilitates greatly the process of validating a potential point mutation identified in an abstract. The results are displayed as mutations mapped on to the protein sequence or a multiple sequence alignment. The latter enables one to readily pick up mutations performed at equivalent positions in related proteins. We demonstrate the use of MutationMapper against several examples including a single sequence and multiple sequence alignments. The application is available as a web-service at http://mutationmapper.bioch.ox.ac.uk.

Journal ArticleDOI
TL;DR: Comparing chemistry and protein target content between 2010 and 2013 indicates quality improvements, major expansion, increased achiral structures and changes in MW distributions, which emphasise the expanding complementarity of chemistry‐to‐protein relationships between sources.
Abstract: ChEMBL, DrugBank, Human Metabolome Database and the Therapeutic Target Database are resources of curated chemistry-to-protein relationships widely used in the chemogenomic arena. In this work we have extended an earlier analysis (PMID 22821596) by comparing chemistry and protein target content between 2010 and 2013. For the former, details are presented for overlaps and differences, statistics of stereochemistry as well as stereo representation and MW profiles between the four databases. For 2013 our results indicate quality improvements, major expansion, increased achiral structures and changes in MW distributions. An orthogonal comparison of chemical content with different sources inside PubChem highlights further interpretable differences. Expansion of protein content by UniProt IDs is also recorded for 2013 and Gene Ontology comparisons for human-only sets indicate differences. These emphasise the expanding complementarity of chemistry-to-protein relationships between sources, although different criteria are used for their capture.

30 Mar 2013
TL;DR: The epitope-receptor interactions are attributed to the epitope's sequence and suggest that in silico proteolysis products showing the highest degree of sequence identity with an epitope or its part are characteristic of a given protein or a group of cross-reactive homologs.
Abstract: The objective of this study was to analyse allergenic proteins by identifying their molecular biomarkers for detection in food using bioinformatics tools. The protein and epitope sequences were from BIOPEP database, proteolysis was simulated using BIOPEP program and UniProt database screening via BLAST and FASTA programs. The biomarkers of food proteins were proposed: for example for whey proteins - TPEVDDEALEKFDKALKALPMHIR (β-Lg: fragment 141-164), chicken egg - AAVSVDCSEYPKPDCTAEDRPL (ovomucoid: 156-177), wheat - KCNGTVEQVESIVNTLNAGQIASTDVVEVVVSPPY (triose phosphate isomerase: 12-46) and peanuts - QARQLKNNNPFKFFVPPFQQSPRAVA (arachin: 505-530). The results are annotated in the BIOPEP database of allergenic proteins and epitopes, available at http://www.uwm.edu.pl/biochemia. The epitope-receptor interactions are attributed to the epitope's sequence and suggest that in silico proteolysis products showing the highest degree of sequence identity with an epitope or its part are characteristic of a given protein or a group of cross-reactive homologs. The protein markers from basic food groups were proposed based on the above assumption.

Journal ArticleDOI
TL;DR: The Online Protein Processing Resource (TOPPR) is presented, an online database that contains thousands of published proteolytically processed sites in human and mouse proteins and provides an online analysis platform, including methods to analyze protease specificity and substrate-centric analyses.
Abstract: We here present The Online Protein Processing Resource (TOPPR; http://iomicsugentbe/toppr/), an online database that contains thousands of published proteolytically processed sites in human and mouse proteins These cleavage events were identified with COmbinded FRActional DIagonal Chromatography proteomics technologies, and the resulting database is provided with full data provenance Indeed, TOPPR provides an interactive visual display of the actual fragmentation mass spectrum that led to each identification of a reported processed site, complete with fragment ion annotations and search engine scores Apart from warehousing and disseminating these data in an intuitive manner, TOPPR also provides an online analysis platform, including methods to analyze protease specificity and substrate-centric analyses Concretely, TOPPR supports three ways to retrieve data: (i) the retrieval of all substrates for one or more cellular stimuli or assays; (ii) a substrate search by UniProtKB/Swiss-Prot accession number, entry name or description; and (iii) a motif search that retrieves substrates matching a user-defined protease specificity profile The analysis of the substrates is supported through the presence of a variety of annotations, including predicted secondary structure, known domains and experimentally obtained 3D structure where available Across substrates, substrate orthologs and conserved sequence stretches can also be shown, with iceLogo visualization provided for the latter

Journal ArticleDOI
01 Jan 2013-Database
TL;DR: This article describes an organelle-focused, manual curation initiative targeting proteins from the human peroxisome, and illustrates with the use of examples how GO annotations now capture cell and tissue type information and the advantages that such an annotation approach provides to users.
Abstract: The Gene Ontology (GO) is the de facto standard for the functional description of gene products, providing a consistent, information-rich terminology applicable across species and information repositories. The UniProt Consortium uses both manual and automatic GO annotation approaches to curate UniProt Knowledgebase (UniProtKB) entries. The selection of a protein set prioritized for manual annotation has implications for the characteristics of the information provided to users working in a specific field or interested in particular pathways or processes. In this article, we describe an organelle-focused, manual curation initiative targeting proteins from the human peroxisome. We discuss the steps taken to define the peroxisome proteome and the challenges encountered in defining the boundaries of this protein set. We illustrate with the use of examples how GO annotations now capture cell and tissue type information and the advantages that such an annotation approach provides to users. Database URL: http://www.ebi.ac.uk/GOA/ and http://www.uniprot.org.

Journal ArticleDOI
01 Jan 2013-Database
TL;DR: The MisPred database contains a collection of abnormal, incomplete and mispredicted protein sequences from 19 metazoan species identified as erroneous by MisPred quality control tools in the UniProtKB/Swiss-Prot, UniProt KB/TrEMBL, NCBI/RefSeq and EnsEMBL databases.
Abstract: Correct prediction of the structure of protein-coding genes of higher eukaryotes is still a difficult task; therefore, public databases are heavily contaminated with mispredicted sequences. The high rate of misprediction has serious consequences because it significantly affects the conclusions that may be drawn from genome-scale sequence analyses of eukaryotic genomes. Here we present the MisPred database and computational pipeline that provide efficient means for the identification of erroneous sequences in public databases. The MisPred database contains a collection of abnormal, incomplete and mispredicted protein sequences from 19 metazoan species identified as erroneous by MisPred quality control tools in the UniProtKB/Swiss-Prot, UniProtKB/TrEMBL, NCBI/RefSeq and EnsEMBL databases. Major releases of the database are automatically generated and updated regularly. The database (http://www.mispred.com) is easily accessible through a simple web interface coupled to a powerful query engine and a standard web service. The content is completely or partially downloadable in a variety of formats. Database URL: http://www.mispred.com

Journal ArticleDOI
TL;DR: In this article, the authors present SNVDis, a tool that allows evaluation of proteome-wide nsSNV distribution in functional sites, domains and pathways by integrating information of active sites, pathways, binding sites, and domains extracted from a number of different sources.

Journal ArticleDOI
TL;DR: A machine learning approach for accurately identifying the candidate genes for tissue specific/selective expression provides an efficient way to select some interesting genes for developing new biomedical markers and improve knowledge of tissue-specific expression.
Abstract: Understanding how genes are expressed specifically in particular tissues is a fundamental question in developmental biology. Many tissue-specific genes are involved in the pathogenesis of complex human diseases. However, experimental identification of tissue-specific genes is time consuming and difficult. The accurate predictions of tissue-specific gene targets could provide useful information for biomarker development and drug target identification. In this study, we have developed a machine learning approach for predicting the human tissue-specific genes using microarray expression data. The lists of known tissue-specific genes for different tissues were collected from UniProt database, and the expression data retrieved from the previously compiled dataset according to the lists were used for input vector encoding. Random Forests (RFs) and Support Vector Machines (SVMs) were used to construct accurate classifiers. The RF classifiers were found to outperform SVM models for tissue-specific gene prediction. The results suggest that the candidate genes for brain or liver specific expression can provide valuable information for further experimental studies. Our approach was also applied for identifying tissue-selective gene targets for different types of tissues. A machine learning approach has been developed for accurately identifying the candidate genes for tissue specific/selective expression. The approach provides an efficient way to select some interesting genes for developing new biomedical markers and improve our knowledge of tissue-specific expression.

Journal ArticleDOI
TL;DR: This work introduces an approach that structures the protein sequence space at the peptide level using theoretical and empirical information from large-scale proteomic data to generate a mass spectrometry-centric protein sequence database (MScDB).
Abstract: Protein sequence databases are indispensable tools for life science research including mass spectrometry (MS)-based proteomics. In current database construction processes, sequence similarity clustering is used to reduce redundancies in the source data. Albeit powerful, it ignores the peptide-centric nature of proteomic data and the fact that MS is able to distinguish similar sequences. Therefore, we introduce an approach that structures the protein sequence space at the peptide level using theoretical and empirical information from large-scale proteomic data to generate a mass spectrometry-centric protein sequence database (MScDB). The core modules of MScDB are an in-silico proteolytic digest and a peptide-centric clustering algorithm that groups protein sequences that are indistinguishable by mass spectrometry. Analysis of various MScDB uses cases against five complex human proteomes, resulting in 69 peptide identifications not present in UniProtKB as well as 79 putative single amino acid polymorphisms....

Journal ArticleDOI
TL;DR: It is proved that even distantly remote homologs can be safely endowed with structural templates and GO and/or Pfam terms provided that annotation is done within clusters collecting cluster-related protein sequences and where a statistical validation of the shared structural and functional features is possible.
Abstract: Background In the genomic era a key issue is protein annotation, namely how to endow protein sequences, upon translation from the corresponding genes, with structural and functional features. Routinely this operation is electronically done by deriving and integrating information from previous knowledge. The reference database for protein sequences is UniProtKB divided into two sections, UniProtKB/TrEMBL which is automatically annotated and not reviewed and UniProtKB/Swiss-Prot which is manually annotated and reviewed. The annotation process is essentially based on sequence similarity search. The question therefore arises as to which extent annotation based on transfer by inheritance is valuable and specifically if it is possible to statistically validate inherited features when little homology exists among the target sequence and its template(s).

Journal ArticleDOI
TL;DR: The Variability Analysis in Networks (VAN) software package is presented: a collection of R functions to streamline this bioinformatics analysis and provides a comprehensive and user-friendly platform for the integrative analysis of -omics data to identify disease-associated network modules.
Abstract: Large-scale molecular interaction networks are dynamic in nature and are of special interest in the analysis of complex diseases, which are characterized by network-level perturbations rather than changes in individual genes/proteins. The methods developed for the identification of differentially expressed genes or gene sets are not suitable for network-level analyses. Consequently, bioinformatics approaches that enable a joint analysis of high-throughput transcriptomics datasets and large-scale molecular interaction networks for identifying perturbed networks are gaining popularity. Typically, these approaches require the sequential application of multiple bioinformatics techniques – ID mapping, network analysis, and network visualization. Here, we present the Variability Analysis in Networks (VAN) software package: a collection of R functions to streamline this bioinformatics analysis. VAN determines whether there are network-level perturbations across biological states of interest. It first identifies hubs (densely connected proteins/microRNAs) in a network and then uses them to extract network modules (comprising of a hub and all its interaction partners). The function identifySignificantHubs identifies dysregulated modules (i.e. modules with changes in expression correlation between a hub and its interaction partners) using a single expression and network dataset. The function summarizeHubData identifies dysregulated modules based on a meta-analysis of multiple expression and/or network datasets. VAN also converts protein identifiers present in a MITAB-formatted interaction network to gene identifiers (UniProt identifier to Entrez identifier or gene symbol using the function generatePpiMap) and generates microRNA-gene interaction networks using TargetScan and Microcosm databases (generateMicroRnaMap). The function obtainCancerInfo is used to identify hubs (corresponding to significantly perturbed modules) that are already causally associated with cancer(s) in the Cancer Gene Census database. Additionally, VAN supports the visualization of changes to network modules in R and Cytoscape (visualizeNetwork and obtainPairSubset, respectively). We demonstrate the utility of VAN using a gene expression data from metastatic melanoma and a protein-protein interaction network from the Human Protein Reference Database. Our package provides a comprehensive and user-friendly platform for the integrative analysis of -omics data to identify disease-associated network modules. This bioinformatics approach, which is essentially focused on the question of explaining phenotype with a 'network type’ and in particular, how regulation is changing among different states of interest, is relevant to many questions including those related to network perturbations across developmental timelines.

Book ChapterDOI
TL;DR: Although the field of glycome informatics has established several methods, standards and technologies for carbohydrate analysis, the analysis of glycoproteins and other glycoconjugates is still in its infancy, and some prospects on this area of research will be given.
Abstract: Although the fi eld of glycome informatics has established several methods, standards and technologies for carbohydrate analysis, the analysis of glycoproteins and other glycoconjugates is still in its infancy. However, from even before the term “glycome informatics” emerged, several groups have developed methods and tools on the analysis of glycosylation sites. In particular, the Expasy server has provided such tools to aid in the prediction of glycosylation sites of N - and O -glycans, while glycosciences.de has provided tools for the analysis of the amino acid distribution around glycosylation sites in 3D space, based on data from the Protein Data Bank (PDB). In addition to these tools, databases of glycoprotein information are available that may aid in glycoprotein prediction; GlycoProtDB is a database of glycoprotein information characterized by the Japanese Consortium for Glycobiology and Glycotechnology, and UniProt includes glycosylation site information along with its protein sequence data. Furthermore, the providers of the glycosylation tools on Expasy, the Center for Biological Sequence Analysis, also provide a database of O-glycosylation called O-GlycBase. Such databases may eventually aid in the development of glycoprotein-analysis tools as more consistent data is accumulated, and some prospects on this area of research will be given. This chapter will introduce various tools and methods that are available for the analysis of glycoproteins in general. To date, the majority of these tools pertain to glycosylation site prediction. A few tools provided by the glycosciences.de web portal provide statistical tools for the analysis of amino acids surrounding glycans as found in the data of the Protein Data Bank (PDB). A description of potentially useful databases pertaining to glycoproteins will also be introduced. The URLs for these resources are listed in Table 1 . Each resource will be described in different subsections, and in summary, perspectives on future glycoproteomic research will be given.

Dissertation
01 Jan 2013
TL;DR: New method, Biological and Statistical Mean (BSM) score is introduced to calculate functional similarity between gene products (GPs) that can help to extract biologically relevant and statistically robust information from large-scale biomedical, genomic and proteomic data sources.
Abstract: 1.2 billion users in Facebook, 17 million articles in Wikipedia, and 190 million tweets per day have demanded significant increase of information processing through Internet in recent years. Similarly life sciences and bioinformatics also have faced issues of processing Big data due to the explosion of publicly available genomic information resulted from the Human Genome Project (HGP) and the increasing usage of high throughput technology. HGP was completed in 2003 and resulted in identifying 20,000-25,000 genes in human DNA and determining the sequences of three billion human base pairs. The information requires huge amount of data storage and becomes difficult to process using on-hand database management tools or traditional data processing applications. This thesis introduces new method, Biological and Statistical Mean (BSM) score to calculate functional similarity between gene products (GPs) that can help to extract biologically relevant and statistically robust information from large-scale biomedical, genomic and proteomic data sources. BSM score is defined by 16 different scoring matrices derived from principles of multi-view learning in machine learning algorithm and five different databases including Gene Ontology, UniProt, SCOP, CATH, and KUPS. The proposed method also shows how diverse databases and principles in machine learning theory can be integrated into a simple scoring function, and how the simple concept can give significant impact on the studies in biomedical and human life sciences. The comprehensive evaluations and performance comparisons with other conventional methods show that BSM score clearly outperforms other methods in terms of sensitivity of clustering similarity functional groups and coverage of identifying related genes. As a part of potential applications handling large amount of diverse data sources in medical domain, this thesis introduces similarity-based drug target identification and disease networks using BSM scores. Application of BSM score is freely available through http://www.ittc.ku.edu/chenlab/goal/

01 Jan 2013
TL;DR: Bio4j is a bioinformatics graph based DB including most data available in UniProt KB (SwissProt + Trembl), Gene Ontology (GO), UniRef, UniRef (50,90,100), RefSeq, NCBI taxonomy, and Expasy Enzyme DB.
Abstract: Today’s biology involves many times the use of different omics approaches, in particular, data and information from genomes and proteins are frequently difficult to integrate. In order to manage in an integrated way the information about complete genomes and proteomes in BG7 based projects we developed the Bio4J platform (www.bio4j.com). Bio4j is a bioinformatics graph based DB including most data available in UniProt KB (SwissProt + Trembl), Gene Ontology (GO), UniRef (50,90,100), RefSeq, NCBI taxonomy, and Expasy Enzyme DB. The current version of Bio4j (0.7) includes 530.642.683 relationships and 76.071.411 nodes. Bio4j uses Neo4j technology, another Open Source project. Since Bio4j is based on Neo4j graph-based DB it is highly scalable. New data sources and features can be added and what it's more important, the Java API allows you to easily incorporate your own data to Bio4j so you can make the best out of it. Performance is one of the main advantages of the platform. In Bio4j data is organized in a way semantically equivalent to what it represents thanks to the graph structure. That means that queries which would even be impossible to perform with a standard Relational DB, just take a couple of seconds with Bio4j. Bio4j is an open source platform released under AGPLv3. Future developments: data for metacyc are being included in Bio4j. Integration strategies of data from different technologies can take advantage of Bio4J platform since this platform has really integrated data from Uniprot, Genomes (Refseq) and NCBI Taxonomy. Bio4j is freely available. IWBBIO 2013. Proceedings Granada, 18-20 March, 2013 281

Journal ArticleDOI
TL;DR: MRMPath and MRMutation, web-based bioinformatics software that are platform independent, facilitate the recovery of information related to biological pathways by biologists to extract information relevant to quantitative mass spectrometry analysis.
Abstract: Quantitative proteomics applications in mass spectrometry depend on the knowledge of the mass-to-charge ratio (m/z) values of proteotypic peptides for the proteins under study and their product ions. MRMPath and MRMutation, web-based bioinformatics software that are platform independent, facilitate the recovery of this information by biologists. MRMPath utilizes publicly available information related to biological pathways in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. All the proteins involved in pathways of interest are recovered and processed in silico to extract information relevant to quantitative mass spectrometry analysis. Peptides may also be subjected to automated BLAST analysis to determine whether they are proteotypic. MRMutation catalogs and makes available, following processing, known (mutant) variants of proteins from the current UniProtKB database. All these results, available via the web from well-maintained, public databases, are written to an Excel spreadsheet, which the user can download and save. MRMPath and MRMutation can be freely accessed. As a system that seeks to allow two or more resources to interoperate, MRMPath represents an advance in bioinformatics tool development. As a practical matter, the MRMPath automated approach represents significant time savings to researchers.

Journal Article
TL;DR: In this article, the use of neural networks, Monte Carlo, support vector machine (SVM), and data mining techniques to predict the fold of a query protein from its primary sequence has been discussed.
Abstract: The rapid growth in genomic and proteomic data causes a lot of challenges that are raised up and need powerful solutions. It is worth noting that UniProtKB/TrEMBL database Release 28-Nov-2012 contains 28,395,832 protein sequence entries, while the number of stored protein structures in Protein Data Bank (PDB, 4-12-2012) is 65,643. Thus, the need of extracting structural information through computational analysis of protein sequences has become very important, especially, the prediction of the fold of a query protein from its primary sequence has become very challenging. The traditional computational methods are not powerful enough to address theses challenges. Researchers have examined the use of a lot of techniques such as neural networks, Monte Carlo, support vector machine and data mining techniques. This paper puts a spot on this growing field and covers the main approaches and perspectives to handle this problem.

01 Jan 2013
TL;DR: It is demonstrated that the three studies have the ability to predict and provide new insights in classifying misannotated proteins, understanding protein binding patterns, and identifying a potentially new model for gene regulation.
Abstract: Proteins are the principal catalytic agents, structural elements, signal transmitters, transporters, and molecular machines in cells. Experimental determination of protein function is expensive in time and resources compared to computational methods. Hence, assigning proteins function, predicting protein binding patterns, and understanding protein regulation are important problems in functional genomics and key challenges in bioinformatics. This dissertation comprises of three studies. In the first two papers, we apply machine-learning methods to (1) identify misannotated sequences and (2) predict the binding patterns of proteins. The third paper is (3) a genome-wide analysis of G4-quadruplex sequences in the maize genome. The first two papers are based on two-stage classification methods. The first stage uses machine-learning approaches that combine composition-based and sequence-based features. We use either a decision trees (HDTree) or support vector machines (SVM) as second-stage classifiers and show that classification performance reaches or outperforms more computationally expensive approaches. For study (1) our method identified potential misannotated sequences within a well-characterized set of proteins in a popular bioinformatics database. We identified misannotated proteins and show the proteins have contradicting AmiGO and UniProt annotations. For study (2), we developed a three-phase approach: Phase I classifies whether a protein binds with another protein. Phase II determines whether a protein-binding protein is a hub. Phase III classifies hub proteins based on the number of binding sites and the number of concurrent binding partners. For study (3), we carried out a computational genome-wide screen to identify non-telomeric G4-quadruplex (G4Q) elements in maize to explore their potential role in gene regulation for flowering plants. Analysis of G4Q-containing genes uncovered a striking tendency for their enrichment in genes of networks and pathways associated with electron transport, sugar degradation, and hypoxia responsiveness. The maize G4Q elements may play a previously unrecognized role in coordinating global regulation of gene expression in response to hypoxia to control carbohydrate metabolism for anaerobic metabolism. We demonstrated that our three studies have the ability to predict and provide new insights in classifying misannotated proteins, understanding protein binding patterns, and identifying a potentially new model for gene regulation.