scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Proposing a highly accurate protein structural class predictor using segmentation-based features.

24 Jan 2014-BMC Genomics (BioMed Central)-Vol. 15, Iss: 1, pp 1-13
TL;DR: By proposing segmented distribution and segmented auto covariance feature extraction methods to capture local and global discriminatory information from evolutionary profiles and predicted secondary structure of the proteins, this study is able to enhance the protein structural class prediction performance significantly.
Abstract: Prediction of the structural classes of proteins can provide important information about their functionalities as well as their major tertiary structures. It is also considered as an important step towards protein structure prediction problem. Despite all the efforts have been made so far, finding a fast and accurate computational approach to solve protein structural class prediction problem still remains a challenging problem in bioinformatics and computational biology. In this study we propose segmented distribution and segmented auto covariance feature extraction methods to capture local and global discriminatory information from evolutionary profiles and predicted secondary structure of the proteins. By applying SVM to our extracted features, for the first time we enhance the protein structural class prediction accuracy to over 90% and 85% for two popular low-homology benchmarks that have been widely used in the literature. We report 92.2% and 86.3% prediction accuracies for 25PDB and 1189 benchmarks which are respectively up to 7.9% and 2.8% better than previously reported results for these two benchmarks. By proposing segmented distribution and segmented auto covariance feature extraction methods to capture local and global discriminatory information from evolutionary profiles and predicted secondary structure of the proteins, we are able to enhance the protein structural class prediction performance significantly.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This study proposes two segmentation based feature extraction methods and shows that by applying a Support Vector Machine (SVM) classifier to the extracted features, they are able to enhance Gram-positive and Gram-negative subcellular localization prediction accuracies by up to 6.4% better than previous studies including the studies that used GO for feature extraction.

222 citations


Cites background or methods or result from "Proposing a highly accurate protein..."

  • ...…about the statistical significance of our achieved results, we will report Sensitivity, Specificity, and Matthew's Correlation Coefficient (MCC) for each subcellular location as well as for the overall benchmark (Hu et al., 2012; Yu et al., 2013; Dehzangi et al., 2014b; Marcin et al., 2012)....

    [...]

  • ...These feature groups have attained promising results in similar studies (Dehzangi et al., 2014a, 2013b, 2013c, 2014b)....

    [...]

  • ...They have mainly tried to extract this local information using the protein sequence as a single block which has failed to achieve this goal (Nanni et al., 2013b, 2013a; Dehzangi et al., 2014a)....

    [...]

  • ...It was shown in the literature that using these numbers produce similar output as using the probability values (Dehzangi et al., 2014a, 2014b)....

    [...]

  • ...Despite its simplicity, the occurrence feature group has shown its effectiveness in maintaining global discriminatory information with respect to the length of a protein sequence (Dehzangi et al., 2013a; Taguchi and Gromiha, 2007; Dehzangi et al., 2014b)....

    [...]

Journal ArticleDOI
TL;DR: A powerful method to predict protein structural classes for low-similarity sequences is developed on the basis of a very objective and strict benchmark dataset and will provide an important guide to extract valuable information from protein sequences.
Abstract: Protein structural class could provide important clues for understanding protein fold, evolution and function. However, it is still a challenging problem to accurately predict protein structural classes for low-similarity sequences. This paper was devoted to develop a powerful method to predict protein structural classes for low-similarity sequences. On the basis of a very objective and strict benchmark dataset, we firstly extracted optimal tripeptide compositions (OTC) which was picked out by using feature selection technique to formulate protein samples. And an overall accuracy of 91.1% was achieved in jackknife cross-validation. Subsequently, we investigated the accuracies of three popular features: position-specific scoring matrix (PSSM), predicted secondary structure information (PSSI) and the average chemical shift (ACS) for comparison. Finally, to further improve the prediction performance, we examined all combinations of the four kinds of features and achieved the maximum accuracy of 96.7% in jackknife cross-validation by combining OTC with ACS, demonstrating that the model is efficient and powerful. Our study will provide an important guide to extract valuable information from protein sequences.

175 citations

Journal ArticleDOI
TL;DR: The GenoSkyline-Plus annotations demonstrate that integrated genome annotations at the single tissue level provide a valuable tool for understanding the etiology of complex human diseases.
Abstract: Continuing efforts from large international consortia have made genome-wide epigenomic and transcriptomic annotation data publicly available for a variety of cell and tissue types. However, synthesis of these datasets into effective summary metrics to characterize the functional non-coding genome remains a challenge. Here, we present GenoSkyline-Plus, an extension of our previous work through integration of an expanded set of epigenomic and transcriptomic annotations to produce high-resolution, single tissue annotations. After validating our annotations with a catalog of tissue-specific non-coding elements previously identified in the literature, we apply our method using data from 127 different cell and tissue types to present an atlas of heritability enrichment across 45 different GWAS traits. We show that broader organ system categories (e.g. immune system) increase statistical power in identifying biologically relevant tissue types for complex diseases while annotations of individual cell types (e.g. monocytes or B-cells) provide deeper insights into disease etiology. Additionally, we use our GenoSkyline-Plus annotations in an in-depth case study of late-onset Alzheimer's disease (LOAD). Our analyses suggest a strong connection between LOAD heritability and genetic variants contained in regions of the genome functional in monocytes. Furthermore, we show that LOAD shares a similar localization of SNPs to monocyte-functional regions with Parkinson's disease. Overall, we demonstrate that integrated genome annotations at the single tissue level provide a valuable tool for understanding the etiology of complex human diseases. Our GenoSkyline-Plus annotations are freely available at http://genocanyon.med.yale.edu/GenoSkyline.

96 citations

Journal ArticleDOI
TL;DR: This review highlights the importance of membrane proteins and how computational methods are capable of overcoming challenges associated with their experimental characterization and focuses on computational techniques to determine the 3D structure of MP and characterize their binding interfaces.

77 citations

Journal ArticleDOI
TL;DR: A new predictor called PSSM-Suc is proposed which employs evolutionary information of amino acids for predicting succinylated lysine residues using profile bigrams extracted from position specific scoring matrices and showed a significant improvement in performance over state-of-the-art predictors.

65 citations

References
More filters
Journal ArticleDOI
TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.
Abstract: The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

70,111 citations

Journal ArticleDOI
TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Abstract: LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.

40,826 citations

Book
Vladimir Vapnik1
01 Jan 1995
TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?
Abstract: Setting of the learning problem consistency of learning processes bounds on the rate of convergence of learning processes controlling the generalization ability of learning processes constructing learning algorithms what is important in learning theory?.

40,147 citations

Journal ArticleDOI
TL;DR: The goals of the PDB are described, the systems in place for data deposition and access, how to obtain further information and plans for the future development of the resource are described.
Abstract: The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.

34,239 citations

Journal ArticleDOI
TL;DR: This database provides a detailed and comprehensive description of the structural and evolutionary relationships of the proteins of known structure and provides for each entry links to co-ordinates, images of the structure, interactive viewers, sequence data and literature references.

6,603 citations