scispace - formally typeset
Search or ask a question

Showing papers on "Pseudo amino acid composition published in 2013"


Journal ArticleDOI
TL;DR: A freely available, open source python package called protein in python (propy) for calculating the widely used structural and physicochemical features of proteins and peptides from amino acid sequence and can also easily compute the previous descriptors based on user-defined properties, which are automatically available from the AAindex database.
Abstract: Summary: Sequence-derived structural and physiochemical features have been frequently used for analysing and predicting structural, functional, expression and interaction profiles of proteins and peptides. To facilitate extensive studies of proteins and peptides, we developed a freely available, open source python package called protein in python (propy) for calculating the widely used structural and physicochemical features of proteins and peptides from amino acid sequence. It computes five feature groups composed of 13 features, including amino acid composition, dipeptide composition, tripeptide composition, normalized Moreau–Broto autocorrelation, Moran autocorrelation, Geary autocorrelation, sequence-order-coupling number, quasi-sequence-order descriptors, composition, transition and distribution of various structural and physicochemical properties and two types of pseudo amino acid composition (PseAAC) descriptors. These features could be generally regarded as different Chou’s PseAAC modes. In addition, it can also easily compute the previous descriptors based on user-defined properties, which are automatically available from the AAindex database. Availability: The python package, propy, is freely available via http:// code.google.com/p/protpy/downloads/list, and it runs on Linux and MS-Windows.

388 citations


Journal ArticleDOI
07 Feb 2013-PLOS ONE
TL;DR: It was observed that the overall cross-validation success rate achieved by iSNO-PseAAC in identifying nitrosylated proteins on an independent dataset was over 90%, indicating that the new predictor is quite promising.
Abstract: Posttranslational modifications (PTMs) of proteins are responsible for sensing and transducing signals to regulate various cellular functions and signaling events. S-nitrosylation (SNO) is one of the most important and universal PTMs. With the avalanche of protein sequences generated in the post-genomic age, it is highly desired to develop computational methods for timely identifying the exact SNO sites in proteins because this kind of information is very useful for both basic research and drug development. Here, a new predictor, called iSNO-PseAAC, was developed for identifying the SNO sites in proteins by incorporating the position-specific amino acid propensity (PSAAP) into the general form of pseudo amino acid composition (PseAAC). The predictor was implemented using the conditional random field (CRF) algorithm. As a demonstration, a benchmark dataset was constructed that contains 731 SNO sites and 810 non-SNO sites. To reduce the homology bias, none of these sites were derived from the proteins that had pairwise sequence identity to any other. It was observed that the overall cross-validation success rate achieved by iSNO-PseAAC in identifying nitrosylated proteins on an independent dataset was over 90%, indicating that the new predictor is quite promising. Furthermore, a user-friendly web-server for iSNO-PseAAC was established at http://app.aporc.org/iSNO-PseAAC/, by which users can easily obtain the desired results without the need to follow the mathematical equations involved during the process of developing the prediction method. It is anticipated that iSNO-PseAAC may become a useful high throughput tool for identifying the SNO sites, or at the very least play a complementary role to the existing methods in this area.

358 citations


Journal ArticleDOI
TL;DR: It was observed that the overall success rate achieved by iHSP-PseRAAAC in identifying the functional types of HSPs among the aforementioned six types was more than 87%, which was derived by the jackknife test on a stringent benchmark dataset.

270 citations


Journal ArticleDOI
TL;DR: A new predictor, called iLoc-Animal, has been developed that can be used to deal with the systems containing both single- and multi-label animal (metazoan except human) proteins and the outcomes achieved were quite encouraging, indicating that the predictor may become a useful tool in this area.
Abstract: Predicting protein subcellular localization is a challenging problem, particularly when query proteins have multi-label features meaning that they may simultaneously exist at, or move between, two or more different subcellular location sites. Most of the existing methods can only be used to deal with the single-label proteins. Actually, multi-label proteins should not be ignored because they usually bear some special function worthy of in-depth studies. By introducing the “multi-label learning” approach, a new predictor, called iLoc-Animal, has been developed that can be used to deal with the systems containing both single- and multi-label animal (metazoan except human) proteins. Meanwhile, to measure the prediction quality of a multi-label system in a rigorous way, five indices were introduced; they are “Absolute-True”, “Absolute-False” (or Hamming-Loss”), “Accuracy”, “Precision”, and “Recall”. As a demonstration, the jackknife cross-validation was performed with iLoc-Animal on a benchmark dataset of animal proteins classified into the following 20 location sites: (1) acrosome, (2) cell membrane, (3) centriole, (4) centrosome, (5) cell cortex, (6) cytoplasm, (7) cytoskeleton, (8) endoplasmic reticulum, (9) endosome, (10) extracellular, (11) Golgi apparatus, (12) lysosome, (13) mitochondrion, (14) melanosome, (15) microsome, (16) nucleus, (17) peroxisome, (18) plasma membrane, (19) spindle, and (20) synapse, where many proteins belong to two or more locations. For such a complicated system, the outcomes achieved by iLoc-Animal for all the aforementioned five indices were quite encouraging, indicating that the predictor may become a useful tool in this area. It has not escaped our notice that the multi-label approach and the rigorous measurement metrics can also be used to investigate many other multi-label problems in molecular biology. As a user-friendly web-server, iLoc-Animal is freely accessible to the public at the web-site http://www.jci-bioinfo.cn/iLoc-Animal.

232 citations


Journal ArticleDOI
TL;DR: A novel computational classifier for the prediction of membrane protein types using proteins' sequences using sequence attributes including the cationic patch sizes, the orientation, and the topology of transmembrane segments and most of the sequence attributes implemented in the proposed classifier have supported literature evidences.

123 citations


Journal ArticleDOI
TL;DR: An efficient GO method called GOASVM is proposed that exploits the information from the GO term frequencies and distant homologs to represent a protein in the general form of Chou's pseudo-amino acid composition.

104 citations


Journal ArticleDOI
TL;DR: Motivated by the success of the pseudo amino acid composition (PseAAC) proposed by Chou, this approach for protein remote homology detection achieves superior or comparable performance with current state‐of‐the‐art methods.
Abstract: Protein remote homology detection is a key problem in bioinformatics. Currently the discriminative methods, such as Support Vector Machine (SVM) can achieve the best performance. The most efficient approach to improve the performance of SVM-based methods is to find a general protein representation method that is able to convert proteins with different lengths into fixed length vectors and captures the different properties of the proteins for the discrimination. The bottleneck of designing the protein representation method is that native proteins have different lengths. Motivated by the success of the pseudo amino acid composition (PseAAC) proposed by Chou, we applied this approach for protein remote homology detection. Some new indices derived from the amino acid index (AAIndex) database are incorporated into the PseAAC to improve the generalization ability of this method. Finally, the performance is further improved by combining the modified PseAAC with profile-based protein representation containing the evolutionary information extracted from the frequency profiles. Our experiments on a well-known benchmark show this method achieves superior or comparable performance with current state-of-the-art methods.

98 citations


Journal ArticleDOI
TL;DR: Two different feature extraction methods and two different models of neural networks were performed on three benchmark datasets of different kinds of proteins, i.e. datasets constructed specially for Gram-positive bacterial proteins, plant proteins and virus proteins, and shows that RBF neural network has apparently superiorities against BP neural network on these datasets no matter which type of feature extraction is chosen.
Abstract: Prediction of protein subcellular location is a meaningful task which attracted much attention in recent years. A lot of protein subcellular location predictors which can only deal with the single-location proteins were developed. However, some proteins may belong to two or even more subcellular locations. It is important to develop predictors which will be able to deal with multiplex proteins, because these proteins have extremely useful implication in both basic biological research and drug discovery. Considering the circumstance that the number of methods dealing with multiplex proteins is limited, it is meaningful to explore some new methods which can predict subcellular location of proteins with both single and multiple sites. Different methods of feature extraction and different models of predict algorithms using on different benchmark datasets may receive some general results. In this paper, two different feature extraction methods and two different models of neural networks were performed on three benchmark datasets of different kinds of proteins, i.e. datasets constructed specially for Gram-positive bacterial proteins, plant proteins and virus proteins. These benchmark datasets have different number of location sites. The application result shows that RBF neural network has apparently superiorities against BP neural network on these datasets no matter which type of feature extraction is chosen.

70 citations


Journal ArticleDOI
TL;DR: P Pseudo–amino acid composition, which has proven to be a very efficient tool in representing protein sequences, and a multilabel KNN algorithm are used to compose this prediction engine.
Abstract: Predicting membrane protein type is a meaningful task because this kind of information is very useful to explain the function of membrane proteins. Due to the explosion of new protein sequences discovered, it is highly desired to develop efficient computation tools for quickly and accurately predicting the membrane type for a given protein sequence. Even though several membrane predictors have been developed, they can only deal with the membrane proteins which belong to the single membrane type. The fact is that there are membrane proteins belonging to two or more than two types. To solve this problem, a system for predicting membrane protein sequences with single or multiple types is proposed. Pseudo–amino acid composition, which has proven to be a very efficient tool in representing protein sequences, and a multilabel KNN algorithm are used to compose this prediction engine. The results of this initial study are encouraging.

61 citations


Journal ArticleDOI
TL;DR: The performance of the prediction models indicate that the proposed methods might be applied as a useful and efficient assistant tool for the prediction of sub-subcellular localizations.

50 citations


Journal ArticleDOI
TL;DR: This survey presents results concerning genetic sequences and Chou's pseudo amino acid composition as well as methodologies developed based on this concept along with elements of fuzzy set theory, and emphasizes on fuzzy clustering and its application in analysis of genetic sequences.
Abstract: The study of genetic sequences is of great importance in biology and medicine. Sequence analysis and taxonomy are two major fields of application of bioinformatics. In this survey, we present results concerning genetic sequences and Chou's pseudo amino acid composition as well as methodologies developed based on this concept along with elements of fuzzy set theory, and emphasize on fuzzy clustering and its application in analysis of genetic sequences.

Journal ArticleDOI
TL;DR: This paper proposes a method to create the 60-dimensional feature vector for protein sequences via the general form of pseudo amino acid composition and compares it with six other recently proposed alignment-free methods to show that the proposed method gives a more consistent biological relationship than the others.
Abstract: In this paper, we propose a method to create the 60-dimensional feature vector for protein sequences via the general form of pseudo amino acid composition. The construction of the feature vector is based on the contents of amino acids, total distance of each amino acid from the first amino acid in the protein sequence and the distribution of 20 amino acids. The obtained cosine distance metric (also called the similarity matrix) is used to construct the phylogenetic tree by the neighbour joining method. In order to show the applicability of our approach, we tested it on three proteins: 1) ND5 protein sequences from nine species, 2) ND6 protein sequences from eight species, and 3) 50 coronavirus spike proteins. The results are in agreement with known history and the output from the multiple sequence alignment program ClustalW, which is widely used. We have also compared our phylogenetic results with six other recently proposed alignment-free methods. These comparisons show that our proposed method gives a more consistent biological relationship than the others. In addition, the time complexity is linear and space required is less as compared with other alignment-free methods that use graphical representation. It should be noted that the multiple sequence alignment method has exponential time complexity.

Journal ArticleDOI
TL;DR: A novel method called auto covariance of averaged chemical shift (acACS) for extracting structure features from a sequence by combining dipeptide composition, reduced amino acid composition, evolutionary information, and acACS outperformed other feature extraction methods.

Journal ArticleDOI
TL;DR: A novel method to predict subchloroplast locations of proteins using tripeptide compositions using the binomial distribution to optimize the feature sets and a predictor called ChloPred has been built, which will provide important information for theoretical and experimental research of chloroplast proteins.
Abstract: Chloroplasts are organelles found in plant cells that conduct photosynthesis. The subchloroplast locations of proteins are correlated with their functions. With the availability of a great number of protein data, it is highly desired to develop a computational method to predict the subchloroplast locations of chloroplast proteins. In this study, we proposed a novel method to predict subchloroplast locations of proteins using tripeptide compositions. It first used the binomial distribution to optimize the feature sets. Then the support vector machine was selected to perform the prediction of subchloroplast locations of proteins. The proposed method was tested on a reliable and rigorous dataset including 259 chloroplast proteins with sequence identity ≤ 25%. In the jack-knife cross-validation, 92.21% envelope proteins, 93.20% thylakoid membrane, 52.63% thylakoid lumen and 85.00% stroma can be correctly identified. The overall accuracy achieves 88.03% which is higher than that of other models. Based on this method, a predictor called ChloPred has been built and can be freely available from http://cobi.uestc.edu.cn/people/hlin/tools/ChloPred/. The predictor will provide important information for theoretical and experimental research of chloroplast proteins.

Journal ArticleDOI
TL;DR: A new predictor, called Virus-ECC-mPLoc, has been developed that can be used to deal with the systems containing both singleplex and multiplex proteins by introducing a powerful multi-label learning approach which exploits correlations between subcellular locations and by hybridizing the gene ontology information with the dipeptide composition information.
Abstract: Protein subcellular localization aims at predicting the location of a protein within a cell using computational methods. Knowledge of subcellular localization of viral proteins in a host cell or virus-infected cell is important because it is closely related to their destructive tendencies and consequences. Prediction of viral protein subcellular localization is an important but challenging problem, particularly when proteins may simultaneously exist at, or move between, two or more different subcellular location sites. Most of the existing protein subcellular localization methods specialized for viral proteins are only used to deal with the single-location proteins. To better reflect the characteristics of multiplex proteins, a new predictor, called Virus-ECC-mPLoc, has been developed that can be used to deal with the systems containing both singleplex and multiplex proteins by introducing a powerful multi-label learning approach which exploits correlations between subcellular locations and by hybridizing the gene ontology information with the dipeptide composition information. It can be utilized to identify viral proteins among the following six locations: (1) viral capsid, (2) host cell membrane, (3) host endoplasmic reticulum, (4) host cytoplasm, (5) host nucleus, and (6) secreted. Experimental results show that the overall success rates thus obtained by Virus-ECC-mPLoc are 86.9% for jackknife test and 87.2% for independent data set test, which are significantly higher than that by any of the existing predictors. As a user-friendly web-server, Virus-ECCmPLoc is freely accessible to the public at the web-site http://levis.tongji.edu.cn:8080/bioinfo/Virus-ECC-mPLoc/.

Journal ArticleDOI
TL;DR: The result indicates that this classifier model can be used for identification of novel prokaryotic essential proteins.
Abstract: Prediction of essential proteins of a pathogenic organism is the key for the potential drug target identification, because inhibition of these would be fatal for the pathogen. Identification of these proteins requires the use of complex experimental techniques which are quite expensive and time consuming. We implemented Support Vector Machine algorithm to develop a classifier model for in silico prediction of prokaryotic essential proteins based on the physico-chemical properties of the amino acid sequences. This classifier was designed based on a set of 10 physico-chemical descriptor vectors (DVs) and 4 hybrid DVs calculated from amino acid sequences using PROFEAT and PseAAC servers. The classifier was trained using data sets consisting of 500 known essential and 500 non-essential proteins (n=1,000) and evaluated using an external validation set consisting of 3,462 essential proteins and 5,538 non-essential proteins (n=9,000). The performances of individual DV sets were evaluated. DV set 13, which is the combination of composition, transition and distribution descriptor set and hybrid autocorrelation descriptor set, provided accuracy of 91.2% in 10-fold cross-validation of the training set and an accuracy of 89.7% in external validation set and of 91.8% and 88.1% using a different yeast protein dataset. Our result indicates that this classification model can be used for identification of novel prokaryotic essential proteins.

Journal ArticleDOI
TL;DR: In this study, the introduction of the entropy in information theory was introduced as another predictive factor in the model and significantly improved the performance of the predictive method.

Journal ArticleDOI
01 Jan 2013-Proteins
TL;DR: The long‐range and short‐range contact in protein were used to derive extended version of the pseudo amino acid composition based on sliding window method, capable of predicting the protein folding rates just from the amino acid sequence without the aid of any structural class information.
Abstract: Protein folding is the process by which a protein processes from its denatured state to its specific biologically active conformation. Understanding the relationship between sequences and the folding rates of proteins remains an important challenge. Most previous methods of predicting protein folding rate require the tertiary structure of a protein as an input. In this study, the long-range and short-range contact in protein were used to derive extended version of the pseudo amino acid composition based on sliding window method. This method is capable of predicting the protein folding rates just from the amino acid sequence without the aid of any structural class information. We systematically studied the contributions of individual features to folding rate prediction. The optimal feature selection procedures are adopted by means of combining the forward feature selection and sequential backward selection method. Using the jackknife cross validation test, the method was demonstrated on the large dataset. The predictor was achieved on the basis of multitudinous physicochemical features and statistical features from protein using nonlinear support vector machine (SVM) regression model, the method obtained an excellent agreement between predicted and experimentally observed folding rates of proteins. The correlation coefficient is 0.9313 and the standard error is 2.2692. The prediction server is freely available at http://www.jci-bioinfo.cn/swfrate/input.jsp.

Journal ArticleDOI
TL;DR: This work compared the amino acid compositions of plastid proteins with those of non-plastid ones and found significant differences, which were used as a basis to develop various feature-based prediction models using similarity-search and machine learning.
Abstract: Plastids are an important component of plant cells, being the site of manufacture and storage of chemical compounds used by the cell, and contain pigments such as those used in photosynthesis, starch synthesis/storage, cell color etc. They are essential organelles of the plant cell, also present in algae. Recent advances in genomic technology and sequencing efforts is generating a huge amount of DNA sequence data every day. The predicted proteome of these genomes needs annotation at a faster pace. In view of this, one such annotation need is to develop an automated system that can distinguish between plastid and non-plastid proteins accurately, and further classify plastid-types based on their functionality. We compared the amino acid compositions of plastid proteins with those of non-plastid ones and found significant differences, which were used as a basis to develop various feature-based prediction models using similarity-search and machine learning. In this study, we developed separate Support Vector Machine (SVM) trained classifiers for characterizing the plastids in two steps: first distinguishing the plastid vs. non-plastid proteins, and then classifying the identified plastids into their various types based on their function (chloroplast, chromoplast, etioplast, and amyloplast). Five diverse protein features: amino acid composition, dipeptide composition, the pseudo amino acid composition, Nterminal-Center-Cterminal composition and the protein physicochemical properties are used to develop SVM models. Overall, the dipeptide composition-based module shows the best performance with an accuracy of 86.80% and Matthews Correlation Coefficient (MCC) of 0.74 in phase-I and 78.60% with a MCC of 0.44 in phase-II. On independent test data, this model also performs better with an overall accuracy of 76.58% and 74.97% in phase-I and phase-II, respectively. The similarity-based PSI-BLAST module shows very low performance with about 50% prediction accuracy for distinguishing plastid vs. non-plastids and only 20% in classifying various plastid-types, indicating the need and importance of machine learning algorithms. The current work is a first attempt to develop a methodology for classifying various plastid-type proteins. The prediction modules have also been made available as a web tool, PLpred available at http://bioinfo.okstate.edu/PLpred/ for real time identification/characterization. We believe this tool will be very useful in the functional annotation of various genomes.

Journal ArticleDOI
TL;DR: A new apoptosis proteins localization algorithm, named PSSP, is proposed based on the predicted cleavage sites of primary protein sequences, which demonstrates that the total accuracies by this approach are comparable to existing methods.
Abstract: Apoptosis proteins play an essential role in the development and homeostasis of an organism. The accurate prediction of subcellular location for apoptosis proteins is helpful for understanding the mechanism of programmed cell death and their biological functions. In this article, a new apoptosis proteins localization algorithm, named PSSP, is proposed based on the predicted cleavage sites of primary protein sequences. First, protein chains are divided into N-terminal signal parts and mature protein parts according to their predicted cleavage sites by SignalP. Then, amino acid composition (ACC) of the individual subsequence together with pseudo-ACC and stereochemical properties of whole chain were extracted to represent a given protein sequence. Jackknife test by support vector machine on three broadly used datasets (ZD98, ZW225, and CL317 datasets) of apoptosis proteins demonstrated that the total accuracies by this approach are 93.9, 87.6, and 91.5%, respectively. In addition, an independent nonapoptosis benchmark dataset (NNPSL) was also used to evaluate the performance of this method, and predictive accuracies for eukaryotic and prokaryotic proteins are also comparable to existing methods. © 2013 Wiley Periodicals, Inc.

Book ChapterDOI
TL;DR: A hybrid feature extraction strategy is shown to be suitable to represent GPCRs and to be able to exploit GPCR amino acid sequence discrimination capability in spatial as well as transform domain.
Abstract: G-protein-coupled receptors (GPCRs) initiate signaling pathways via trimetric guanine nucleotide-binding proteins. GPCRs are classified based on their ligand-binding properties and molecular phylogenetic analyses. Nonetheless, these later analyses are in most case dependent on multiple sequence alignments, themselves dependent on human intervention and expertise. Alignment-free classifications of GPCR sequences, in addition to being unbiased, present many applications uncovering hidden physicochemical parameters shared among specific groups of receptors, to being used in automated workflows for large-scale molecular modeling applications. Current alignment-free classification methods, however, do not reach a full accuracy. This chapter discusses how GPCRs amino acid sequences can be classified using pseudo amino acid composition and multiscale energy representation of different physiochemical properties of amino acids. A hybrid feature extraction strategy is shown to be suitable to represent GPCRs and to be able to exploit GPCR amino acid sequence discrimination capability in spatial as well as transform domain. Classification strategies such as support vector machine and probabilistic neural network are then discussed in regards to GPCRs classification. The work of GPCR-Hybrid web predictor is also discussed.

Journal ArticleDOI
TL;DR: MitProt-Pred is developed that utilizes Bi-profile Bayes, Pseudo Average Chemical Shift, Split Amino Acid Composition, and Pseudo Amino acid Composition based features of the protein sequences to achieve significantly improved prediction performance for two standard datasets.

Journal ArticleDOI
TL;DR: A structural classification by the aid of support vector machine (SVM) classifier of Amino acid composition and pseudo amino acid composition features was applied with different variants to avoid the redundancy and to ensure a maximal amount of available data.

Patent
25 Sep 2013
TL;DR: In this article, a membrane protein sub-cell positioning method based on complex space multi-view feature fusion is proposed, where features of pseudo amino acid composition of a protein sequence and features of a position-specific scoring matrix based on autocorrelation transform are extracted, and two kinds of features are combined into a feature vector in a complex space in a parallel mode.
Abstract: The invention discloses a membrane protein sub-cell positioning method based on complex space multi-view feature fusion. Firstly, features of pseudo amino acid composition of a protein sequence and features of a position-specific scoring matrix based on autocorrelation transform are extracted; secondly, the two kinds of features are combined into a feature vector in a complex space in a parallel mode; thirdly, dimension reduction is conducted on the complex features after parallel combination through general principal component analysis method so as to remove noise; finally, the fused features are classified through an optimization evidence theory based K nearest neighbor classifier, and the position of a sub-cell is determined. The membrane protein sub-cell positioning method has the advantages that the complex space multi-view feature fusion technology is adopted, so that the diagnostic features of the protein sequence are extracted effectively; the K nearest neighbor classifier based on the optimization evidence theory is used, so that the accuracy of the membrane protein sub-cell positioning is improved.

Journal ArticleDOI
TL;DR: In this paper, the occurrence frequency of 20 amino acids and the new numerical characteristic of 2D graphical representation based on three physicochemical properties indexes as pseudo amino acid components were taken into account.

Journal ArticleDOI
TL;DR: An ensemble classification approach is developed using K-nearest neighbor and Probabilistic Neural Network as the basic learning mechanisms and the success rate has been obtained on all the tests such as self-consistency, jackknife, and independent dataset test is quite promising, indicating that the ensemble classifier may become a useful and high performance tool in identifying membrane proteins and their types.
Abstract:  Abstract—Predicting membrane protein types is an important and challenging research in current molecular and cellular biology. The knowledge of membrane proteins types often provides crucial hints for determining the function of uncharacterized membrane proteins. It is thus highly desirable to develop an automated method that can serve as a high throughput tool in identifying the types of newly found membrane proteins by their primary sequence information only. In this paper, features are extracted from membrane protein sequences using pseudo-amino acid (PseAA) composition. An ensemble classification approach is developed using K-nearest neighbor and Probabilistic Neural Network as the basic learning mechanisms. Each basic classifier is trained using PseAA composition with different tiers. The success rate has been obtained by the ensemble classifier on all the tests such as self-consistency, jackknife, and independent dataset test is quite promising and indicating that the ensemble classifier may become a useful and high performance tool in identifying membrane proteins and their types.

Proceedings ArticleDOI
07 Dec 2013
TL;DR: This study divides a protein sequence into two parts according to its N-terminal sorting signals and extracts their pseudo amino acid composition features respectively and uses the multi-label KNN, shorted for ML-KNN to deal with the proteins which have two, three or even more locations.
Abstract: Sub cellular localization of proteins is an important attribute in bioinformatics, closely related to its functions, signal transduction and biological process. In this research field, great progress has been made in recent years. However, some shortcomings still exist in the prediction methods. Such as the extracted features information is not complete enough to achieve a higher prediction accuracy rate, some important protein information and the correlation of the amino acid sequence are usually ignored and so on. Some proteins do not have only one location, they may have two locations or three and even more, but were considered to have only one location. In this study, we divide a protein sequence into two parts according to its N-terminal sorting signals and extract their pseudo amino acid composition features respectively. And then we use the multi-label KNN, shorted for ML-KNN to deal with the proteins which have two, three or even more locations. The results are satisfied by Jack Knife test.

Book ChapterDOI
19 Dec 2013
TL;DR: Chou's pseudo amino acid composition along with amphiphillic correlation factor and the spectral characteristics of the protein has been used to represent protein data to create a statistical framework for structural class prediction.
Abstract: During last few decades' accurate prediction of protein structural class has been a challenging problem. Efficient and meaningful representation of protein molecule plays a significant role. In this paper Chou's pseudo amino acid composition along with amphiphillic correlation factor and the spectral characteristics of the protein has been used to represent protein data. Thus a protein sample is represented by a set of discrete components which incorporate both the sequence order and the sequence length effects. On the basis of such a statistical framework a simple functionally linked artificial neural network has been used for structural class prediction.

Journal ArticleDOI
TL;DR: A 20-dimension CGR-walk mode for representation of protein sample and the comparison results indicate that the present method may at least serve as an alternative to the existing predictors in this field.
Abstract: Information on subcellular localization of proteins plays a vitally important role in molecular cell biology, proteomics and drug discovery. In this field, finding the most suitable representation for protein sample is one of the most crucial procedures. Inspired by the modes of pseudo amino acid composition (PAA), cellular automaton image (CAI) for protein and the chaos game representation (CGR) for DNA sequence, a 20-dimension CGR-walk mode for representation of protein sample is proposed. In the proposed model, the sequence order effect is discussed and manifested with a point of the 20-dimension space. And then, the track of protein sample is projected to all of the twenty amino acids, in another word, a protein sample is expressed by a 20-dimension vector. Followed with the preparation work, the proposed mode is applied into four protein datasets. The comparison results indicate that the present method may at least serve as an alternative to the existing predictors in this field.

01 Jan 2013
TL;DR: Chou's pseudo amino acid composition along with amphiphillic correlation factor has been used to represent protein data and a simple functionally linked artificial neural network has beenUsed for structural class prediction.
Abstract: During last few decades' accurate prediction of protein structural class has been a challenging problem. Efficient and meaningful representation of protein molecule plays a significant role. In this paper Chou's pseudo amino acid composition along with amphiphillic correlation factor has been used to represent protein data. A simple functionally linked artificial neural network has been used for structural class prediction.