scispace - formally typeset
Search or ask a question
Journal ArticleDOI

BioLiP: a semi-manually curated database for biologically relevant ligand–protein interactions

18 Oct 2012-Nucleic Acids Research (Oxford University Press)-Vol. 41, pp 1096-1103
TL;DR: To facilitate template-based ligand–protein docking, virtual ligand screening and protein function annotations, a hierarchical procedure for assessing the biological relevance of ligands present in the PDB structures is developed which involves a four-step biological feature filtering followed by careful manual verifications.
Abstract: BioLiP (http://zhanglab.ccmb.med.umich.edu/BioLiP/) is a semi-manually curated database for biologically relevant ligand-protein interactions. Establishing interactions between protein and biologically relevant ligands is an important step toward understanding the protein functions. Most ligand-binding sites prediction methods use the protein structures from the Protein Data Bank (PDB) as templates. However, not all ligands present in the PDB are biologically relevant, as small molecules are often used as additives for solving the protein structures. To facilitate template-based ligand-protein docking, virtual ligand screening and protein function annotations, we develop a hierarchical procedure for assessing the biological relevance of ligands present in the PDB structures, which involves a four-step biological feature filtering followed by careful manual verifications. This procedure is used for BioLiP construction. Each entry in BioLiP contains annotations on: ligand-binding residues, ligand-binding affinity, catalytic sites, Enzyme Commission numbers, Gene Ontology terms and cross-links to the other databases. In addition, to facilitate the use of BioLiP for function annotation of uncharacterized proteins, a new consensus-based algorithm COACH is developed to predict ligand-binding sites from protein sequence or using 3D structure. The BioLiP database is updated weekly and the current release contains 204 223 entries.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: A stand-alone I-TASSER Suite that can be used for off-line protein structure and function prediction and three complementary algorithms to enhance function inferences are developed, the consensus of which is derived by COACH4 using support vector machines.
Abstract: The lowest free-energy conformations are identified by structure clustering. A second round of assembly simulation is conducted, starting from the centroid models, to remove steric clashes and refine global topology. Final atomic structure models are constructed from the low-energy conformations by a two-step atomic-level energy minimization approach. The correctness of the global model is assessed by the confidence score, which is based on the significance of threading alignments and the density of structure clustering; the residue-level local quality of the structural models and B factor of the target protein are evaluated by a newly developed method, ResQ, built on the variation of modeling simulations and the uncertainty of homologous alignments through support vector regression training. For function annotation, the structure models with the highest confidence scores are matched against the BioLiP5 database of ligand-protein interactions to detect homologous function templates. Functional insights on ligand-binding site (LBS), Enzyme Commission (EC) and Gene Ontology (GO) are deduced from the functional templates. We developed three complementary algorithms (COFACTOR, TM-SITE and S-SITE) to enhance function inferences, the consensus of which is derived by COACH4 using support vector machines. Detailed instructions for installation, implementation and result interpretation of the Suite can be found in the Supplementary Methods and Supplementary Tables 1 and 2. The I-TASSER Suite pipeline was tested in recent communitywide structure and function prediction experiments, including CASP10 (ref. 1) and CAMEO2. Overall, I-TASSER generated the correct fold with a template modeling score (TM-score) >0.5 for 10 out of 36 “New Fold” (NF) targets in the CASP10, which have no homologous templates in the Protein Data Bank (PDB). Of the 110 template-based modeling targets, 92 had a TM-score >0.5, and 89 had the templates drawn closer to the native with an average r.m.s. deviation improvement of 1.05 Å in the same threadingaligned regions6. In CAMEO, COACH generated LBS predictions for 4,271 targets with an average accuracy 0.86, which was 20% higher than that of the second-best method in the experiment. Here we illustrate I-TASSER Suite–based structure and function modeling using six examples (Fig. 1b–g) from the communitywide blind tests1,2. R0006 and R0007 are two NF targets from CASP10, and I-TASSER constructed models of correct fold with a TM-score of 0.62 for both targets (Fig. 1b,c). An illustration of local quality estimation by ResQ is shown for T0652, which has an average error 0.75 Å compared to the actual deviation of the model from the native (Fig. 1h). The four LBS prediction examples (Fig. 1d–g) are from CASP10 (ref. 1) and CAMEO2; COACH generated ligand models all with a ligand r.m.s. deviation below 2 Å. COACH also correctly assigned the threeand fourdigit EC numbers to the enzyme targets C0050 and C0046 (Supplementary Table 3). In summary, we developed a stand-alone I-TASSER Suite that can be used for off-line protein structure and function prediction. The I-TASSER Suite: protein structure and function prediction

4,693 citations

Journal ArticleDOI
TL;DR: Focuses have been made on the introduction of new methods for atomic-level structure refinement, local structure quality estimation and biological function annotations, which are designed to address the requirements from the user community and to increase the accuracy of modeling predictions.
Abstract: The I-TASSER server (http://zhanglab.ccmb.med.umich.edu/I-TASSER) is an online resource for automated protein structure prediction and structure-based function annotation. In I-TASSER, structural templates are first recognized from the PDB using multiple threading alignment approaches. Full-length structure models are then constructed by iterative fragment assembly simulations. The functional insights are finally derived by matching the predicted structure models with known proteins in the function databases. Although the server has been widely used for various biological and biomedical investigations, numerous comments and suggestions have been reported from the user community. In this article, we summarize recent developments on the I-TASSER server, which were designed to address the requirements from the user community and to increase the accuracy of modeling predictions. Focuses have been made on the introduction of new methods for atomic-level structure refinement, local structure quality estimation and biological function annotations. We expect that these new developments will improve the quality of the I-TASSER server and further facilitate its use by the community for high-resolution structure and function prediction.

1,698 citations


Cites methods from "BioLiP: a semi-manually curated dat..."

  • ...The major new developments include: (i) a new approach in estimating residue-level local quality of the structural models, which are critical to guide functional studies by the biologist users; (ii) an algorithm for B-factor prediction; (iii) methods for atomic-level structure refinement to improve the hydrogen-bonding networks and physical realism of the I-TASSER models; (iv) a consensus-based ligand-binding site prediction that combines structure and sequence profile comparisons by COACH (10); (v) an integration of the new function library BioLiP (18) to increase the coverage of...

    [...]

  • ...Finally, functional insights of the query protein are obtained by matching the structural model with proteins in the BioLiP function library via structure and sequence profile comparisons (8,10,18)....

    [...]

  • ...Consequently, the procedure was used to construct a comprehensive function database, BioLiP (18), from databases of known protein structure/function and the literature in PubMed....

    [...]

  • ...Starting from the amino acid sequence, I-TASSER constructs 3D structural models by reassembling fragments excised from threading templates, where the biological insights of the target proteins are deduced by matching the structure models to known proteins in the functional databases (18)....

    [...]

Journal ArticleDOI
TL;DR: Two new methods, one based on binding-specific substructure comparison (TM-Site) and another on sequence profile alignment (S-SITE), for complementary binding site predictions are developed, which demonstrate a new robust approach to protein-ligand binding site recognition, ready for genome-wide structure-based function annotations.
Abstract: Motivation: Identification of protein–ligand binding sites is critical to protein function annotation and drug discovery. However, there is no method that could generate optimal binding site prediction for different protein types. Combination of complementary predictions is probably the most reliable solution to the problem. Results: We develop two new methods, one based on binding-specific substructure comparison (TM-SITE) and another on sequence profile alignment (S-SITE), for complementary binding site predictions. The methods are tested on a set of 500 non-redundant proteins harboring 814 natural, drug-like and metal ion molecules. Starting from low-resolution protein structure predictions, the methods successfully recognize 451% of binding residues with average Matthews correlation coefficient (MCC) significantly higher (with P-value 510 –9 in student t-test) than other state-of-the-art methods, including COFACTOR, FINDSITE and ConCavity. When combining TM-SITE and S-SITE with other structure-based programs, a consensus approach (COACH) can increase MCC by 15% over the best individual predictions. COACH was examined in the recent community-wide COMEO experiment and consistently ranked as the best method in last 22 individual datasets with the Area Under the Curve score 22.5% higher than the second best method. These data demonstrate a new robust approach to protein–ligand binding site recognition, which is ready for genome-wide structure-based function annotations.

715 citations

Journal ArticleDOI
TL;DR: Large-scale benchmark tests show that the new hybrid COFACTOR approach significantly improves the function annotation accuracy of the former structure-based pipeline and other state-of-the-art functional annotation methods, particularly for targets that have no close homology templates.
Abstract: The COFACTOR web server is a unified platform for structure-based multiple-level protein function predictions. By structurally threading low-resolution structural models through the BioLiP library, the COFACTOR server infers three categories of protein functions including gene ontology, enzyme commission and ligand-binding sites from various analogous and homologous function templates. Here, we report recent improvements of the COFACTOR server in the development of new pipelines to infer functional insights from sequence profile alignments and protein-protein interaction networks. Large-scale benchmark tests show that the new hybrid COFACTOR approach significantly improves the function annotation accuracy of the former structure-based pipeline and other state-of-the-art functional annotation methods, particularly for targets that have no close homology templates. The updated COFACTOR server and the template libraries are available at http://zhanglab.ccmb.med.umich.edu/COFACTOR/.

384 citations


Cites background or methods from "BioLiP: a semi-manually curated dat..."

  • ...Enzymatic homologs are identified by aligning the target structure, using TMalign (13), to a library of 8392 enzyme structures from the BioLiP library (12), with the active site residues mapped from the Catalytic Site Atlas database (21)....

    [...]

  • ...Briefly, the query structure is compared to a non-redundant set of known proteins in the BioLiP library (12) through two sets of local and global structural alignments based on the TM-align algorithm (13), for functional homology detections....

    [...]

  • ...First, functional homologies are identified by matching the query structure through a non-redundant set of the BioLiP library (12), which currently contains 58 416 structure templates harboring in total 76 679 ligand-binding sites for interaction between receptor proteins and small molecule compounds, short peptides and nucleic acids....

    [...]

Journal ArticleDOI
TL;DR: This unit describes how to use the I‐TASSER protocol to generate structure and function prediction and how to interpret the prediction results, as well as alternative approaches for further improving the I-TASSer modeling quality for distant‐homologous and multi‐domain protein targets.
Abstract: I-TASSER is a hierarchical protocol for automated protein structure prediction and structure-based function annotation. Starting from the amino acid sequence of target proteins, I-TASSER first generates full-length atomic structural models from multiple threading alignments and iterative structural assembly simulations followed by atomic-level structure refinement. The biological functions of the protein, including ligand-binding sites, enzyme commission number, and gene ontology terms, are then inferred from known protein function databases based on sequence and structure profile comparisons. I-TASSER is freely available as both an on-line server and a stand-alone package. This unit describes how to use the I-TASSER protocol to generate structure and function prediction and how to interpret the prediction results, as well as alternative approaches for further improving the I-TASSER modeling quality for distant-homologous and multi-domain protein targets.

382 citations

References
More filters
Journal ArticleDOI
TL;DR: KEGG Mapper, a collection of tools for KEGG PATHWAY, BRITE and MODULE mapping, enabling integration and interpretation of large-scale data sets and recent enhancements to the K EGG content, especially the incorporation of disease and drug information used in practice and in society, to support translational bioinformatics.
Abstract: Kyoto Encyclopedia of Genes and Genomes (KEGG, http://www.genome.jp/kegg/ or http://www.kegg.jp/) is a database resource that integrates genomic, chemical and systemic functional information. In particular, gene catalogs from completely sequenced genomes are linked to higher-level systemic functions of the cell, the organism and the ecosystem. Major efforts have been undertaken to manually create a knowledge base for such systemic functions by capturing and organizing experimental knowledge in computable forms; namely, in the forms of KEGG pathway maps, BRITE functional hierarchies and KEGG modules. Continuous efforts have also been made to develop and improve the cross-species annotation procedure for linking genomes to the molecular networks through the KEGG Orthology system. Here we report KEGG Mapper, a collection of tools for KEGG PATHWAY, BRITE and MODULE mapping, enabling integration and interpretation of large-scale data sets. We also report a variant of the KEGG mapping procedure to extend the knowledge base, where different types of data and knowledge, such as disease genes and drug targets, are integrated as part of the KEGG molecular networks. Finally, we describe recent enhancements to the KEGG content, especially the incorporation of disease and drug information used in practice and in society, to support translational bioinformatics.

4,259 citations


"BioLiP: a semi-manually curated dat..." refers background in this paper

  • ...a ligand is removed from the list if it is found to have biological relevance in the related literature of the structure file or is present in the KEGG database (26)....

    [...]

Journal ArticleDOI
TL;DR: The current models for the complexes of Cro, repressor, and CAP with operator DNA are probably fundamentally correct, but it should be emphasized that model building alone, even when coupled with genetic and biochemical studies, cannot be expected to provide a completely reliable "high-resolution" view of the protein-DNA complex.
Abstract: Several general principles emerge from the studies of Cro, lambda repressor, and CAP. The DNA-binding sites are recognized in a form similar to B-DNA. They do not form cruciforms or other novel DNA structures. There seem to be proteins that bind left-handed Z-DNA (87) and DNA in other conformations, but it remains to be seen how these structures are recognized or how proteins recognize specific sequences in single-stranded DNA. Cro, repressor, and CAP use symmetrically related subunits to interact with two-fold related sites in the operator sequences. Many other DNA-binding proteins are dimers or tetramers and their operator sequences have approximate two-fold symmetry. It seems likely that these proteins will, like Cro, repressor, and CAP, form symmetric complexes. However, there is no requirement for symmetry in protein-DNA interactions. Some sequence-specific DNA-binding proteins, like RNA polymerase, do not have symmetrically related subunits and do not bind to symmetric recognition sequences. Cro, repressor, and CAP use alpha-helices for many of the contacts between side chains and bases in the major groove. An adjacent alpha-helical region contacts the DNA backbone and may help to orient the "recognition" helices. This use of alpha-helical regions for DNA binding appears to be a common mode of recognition. Most of the contacts made by Cro, repressor, and CAP occur on one side of the double helix. However, lambda repressor contacts both sides of the double helix by using a flexible region of protein to wrap around the DNA. Recognition of specific base sequences involves hydrogen bonds and van der Waals interactions between side chains and the edges of base pairs. These specific interactions, together with backbone interactions and electrostatic interactions, stabilize the protein-DNA complexes. The current models for the complexes of Cro, repressor, and CAP with operator DNA are probably fundamentally correct, but it should be emphasized that model building alone, even when coupled with genetic and biochemical studies, cannot be expected to provide a completely reliable "high-resolution" view of the protein-DNA complex. For example, the use of standard B-DNA geometry for the operator is clearly an approximation.(ABSTRACT TRUNCATED AT 400 WORDS)

1,480 citations


"BioLiP: a semi-manually curated dat..." refers background in this paper

  • ...Binding MOAD excludes small DNA/RNA molecules and metal ions, which are in fact important ligand molecules in many proteins (8,9)....

    [...]

Journal ArticleDOI
TL;DR: BindingDB is a publicly accessible database currently containing ∼20 000 experimentally determined binding affinities of protein–ligand complexes, for 110 protein targets including isoforms and mutational variants, and ∼11‬000 small molecule ligands.
Abstract: BindingDB (http://www.bindingdb.org) is a publicly accessible database currently containing ∼20 000 experimentally determined binding affinities of protein–ligand complexes, for 110 protein targets including isoforms and mutational variants, and ∼11 000 small molecule ligands. The data are extracted from the scientific literature, data collection focusing on proteins that are drug-targets or candidate drug-targets and for which structural data are present in the Protein Data Bank. The BindingDB website supports a range of query types, including searches by chemical structure, substructure and similarity; protein sequence; ligand and protein names; affinity ranges and molecular weight. Data sets generated by BindingDB queries can be downloaded in the form of annotated SDfiles for further analysis, or used as the basis for virtual screening of a compound database uploaded by the user. The data in BindingDB are linked both to structural data in the PDB via PDB IDs and chemical and sequence searches, and to the literature in PubMed via PubMed IDs.

1,381 citations


"BioLiP: a semi-manually curated dat..." refers background in this paper

  • ...For each protein chain (called receptor), the information (if any) to be collected includes the following: (i) ligand-binding affinity from manual survey of the original literature and the existing databases of Binding MOAD (5), PDBbind (4) and BindingDB (6); (ii) catalytic site residues mapped from the Catalytic Site Atlas (21); (iii) annotated EC numbers in the COMPND records; GO terms (22) from the GO Annotation database (23); (iv) UniProt accession code (24) mapped from the SIFTS project (25) and (v) the PubMed abstract of the primary literature citation in the ‘JRNL’ record....

    [...]

  • ...The completeness of the binding affinity data in BioLiP is unprecedented, which includes not only all high-quality annotations from the Binding MOAD (5), PDBbind (4) and BindingDB (6) databases but also data obtained by manual survey of the original literature....

    [...]

  • ...For a ligand– protein complex, when no binding affinity data are reported in the literature, the complex is excluded from the PDBbind and BindingDB databases....

    [...]

  • ...BindingDB (6) is a database that collects binding data directly from scientific literatures....

    [...]

Journal ArticleDOI
TL;DR: Construction of a Thermotoga neapolitana adenylate kinase (AK) library using PERMUTE revealed that this approach produces vectors that express circularly permuted proteins with distinct sequence diversity from existing methods.
Abstract: A simple approach for creating libraries of circularly permuted proteins is described that is called PERMutation Using Transposase Engineering (PERMUTE). In PERMUTE, the transposase MuA is used to randomly insert a minitransposon that can function as a protein expression vector into a plasmid that contains the open reading frame (ORF) being permuted. A library of vectors that express different permuted variants of the ORF-encoded protein is created by: (i) using bacteria to select for target vectors that acquire an integrated minitransposon; (ii) excising the ensemble of ORFs that contain an integrated minitransposon from the selected vectors; and (iii) circularizing the ensemble of ORFs containing integrated minitransposons using intramolecular ligation. Construction of a Thermotoga neapolitana adenylate kinase (AK) library using PERMUTE revealed that this approach produces vectors that express circularly permuted proteins with distinct sequence diversity from existing methods. In addition, selection of this library for variants that complement the growth of Escherichia coli with a temperature-sensitive AK identified functional proteins with novel architectures, suggesting that PERMUTE will be useful for the directed evolution of proteins with new functions.

1,093 citations

Journal ArticleDOI
TL;DR: The outcomes of this project have been organized into a Web-accessible database named the PDBbind database and led to a collection of binding affinity data (K(d), K(i), and IC(50) for a total of 1359 complexes.
Abstract: We have screened the entire Protein Data Bank (Release No. 103, January 2003) and identified 5671 protein−ligand complexes out of 19 621 experimental structures. A systematic examination of the primary references of these entries has led to a collection of binding affinity data (Kd, Ki, and IC50) for a total of 1359 complexes. The outcomes of this project have been organized into a Web-accessible database named the PDBbind database.

750 citations


"BioLiP: a semi-manually curated dat..." refers background in this paper

  • ...PDBbind (4) is another ligand-binding affinity database that has less strict requirements than Binding MOAD (e.g. lower structure resolution, inclusion *To whom correspondence should be addressed....

    [...]

  • ...The completeness of the binding affinity data in BioLiP is unprecedented, which includes not only all high-quality annotations from the Binding MOAD (5), PDBbind (4) and BindingDB (6) databases but also data obtained by manual survey of the original literature....

    [...]

  • ...For a ligand– protein complex, when no binding affinity data are reported in the literature, the complex is excluded from the PDBbind and BindingDB databases....

    [...]

  • ...PDBbind (4) is another ligand-binding affinity database that has less strict requirements than Binding MOAD (e....

    [...]

  • ...In total, 20 013 entries have binding affinity data, with 10 445 from Binding MOAD, 13 579 from PDBbind, 7179 from Binding DB and 62 from manual survey of the original literature....

    [...]