scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou's general PseAAC.

TL;DR: This study made an attempt to develop a support vector machine (SVM) based computational approach for prediction of AMPs with improved accuracy, and achieved higher accuracy than several existing approaches, while compared using benchmark dataset.
Abstract: Antimicrobial peptides (AMPs) are important components of the innate immune system that have been found to be effective against disease causing pathogens. Identification of AMPs through wet-lab experiment is expensive. Therefore, development of efficient computational tool is essential to identify the best candidate AMP prior to the in vitro experimentation. In this study, we made an attempt to develop a support vector machine (SVM) based computational approach for prediction of AMPs with improved accuracy. Initially, compositional, physico-chemical and structural features of the peptides were generated that were subsequently used as input in SVM for prediction of AMPs. The proposed approach achieved higher accuracy than several existing approaches, while compared using benchmark dataset. Based on the proposed approach, an online prediction server iAMPpred has also been developed to help the scientific community in predicting AMPs, which is freely accessible at http://cabgrid.res.in:8080/amppred/. The proposed approach is believed to supplement the tools and techniques that have been developed in the past for prediction of AMPs.

Content maybe subject to copyright    Report

1
Scientific REPORTS | 7:42362 | DOI: 10.1038/srep42362
www.nature.com/scientificreports
Predicting antimicrobial peptides
with improved accuracy by
incorporating the compositional,
physico-chemical and structural
features into Chou’s general
PseAAC
Prabina Kumar Meher
1
, Tanmaya Kumar Sahu
2
, Varsha Saini
2,3
& Atmakuri Ramakrishna Rao
2
Antimicrobial peptides (AMPs) are important components of the innate immune system that have
been found to be eective against disease causing pathogens. Identication of AMPs through wet-
lab experiment is expensive. Therefore, development of ecient computational tool is essential
to identify the best candidate AMP prior to the in vitro experimentation. In this study, we made an
attempt to develop a support vector machine (SVM) based computational approach for prediction of
AMPs with improved accuracy. Initially, compositional, physico-chemical and structural features of
the peptides were generated that were subsequently used as input in SVM for prediction of AMPs.
The proposed approach achieved higher accuracy than several existing approaches, while compared
using benchmark dataset. Based on the proposed approach, an online prediction server iAMPpred has
also been developed to help the scientic community in predicting AMPs, which is freely accessible at
http://cabgrid.res.in:8080/amppred/. The proposed approach is believed to supplement the tools and
techniques that have been developed in the past for prediction of AMPs.
Antimicrobial peptides (AMPs) are important innate immune molecules, which have been found to be eective
against several pathogenic micro-organisms like bacteria, virus, fungi, parasites etc
1
. AMP constitutes the rst
line of host defense against microbes
2
, where it causes the cell death of microbes either by disrupting its cell
membrane or its intracellular functions
3,4
. Due to growing resistance of microbial pathogens against chemical
antibiotics, AMPs have received attention as an alternative in recent years
5
. Specically, due to the broad spectrum
of activity and low propensity for developing resistance, AMPs are gaining popularity in clinical applications
6
.
Development of sequence-based computational tools can be helpful in designing the eective antimicrobial
agents by identifying the best candidate AMP prior to the synthesis and testing against pathogens in wet-lab
7
. In
this direction, computational tools like AntiBP
1
, AMPER
8
, CAMP
3
, AntiBP2
9
, AVPpred
10
, ClassAMP
11
, iAMP-2L
7
and EFC-FCBF
12
have been developed for the prediction of AMPs. e binary (0, 1) and compositional features
were used in AntiBP and AntiBP2 respectively to map the peptide sequences onto numeric feature vectors, where
the numeric vectors were used as input in articial neural network (ANN)
13
and support vector machine (SVM)
14
respectively for prediction of antibacterial peptides. In CAMP, random forest (RF)
15
, SVM and ANN supervised
learning techniques were employed for prediction of AMPs, based on dierent physico-chemical (PHYC) fea-
tures of peptides. In AVPpred, four dierent models viz., AVPmotif, AVPalign, AVMcompo and AVPphysico
were developed for prediction of antiviral peptides only. e ClassAMP
11
tool was developed for predicting the
propensity of a peptide sequence as antibacterial, antiviral or antifungal peptide, by using SVM and RF machine
1
Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi-110012, India.
2
Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi-110012,
India.
3
Department of Bioinformatics, Janta Vedic College, Baraut, Baghpat-250611, Uttar Pradesh, India.
Correspondence and requests for materials should be addressed to A.R.R. (email: rao.cshl.work@gmail.com)
received: 24 October 2016
accepted: 09 January 2017
Published: 13 February 2017
OPEN

www.nature.com/scientificreports/
2
Scientific REPORTS | 7:42362 | DOI: 10.1038/srep42362
learning techniques. In an another study, a two-level multi-class predictor was developed for identication of
AMPs, based on Chous pseudo amino acid composition
16
and fuzzy k-nearest neighbor
7
. Recently, Veltri et al.
12
have developed a machine learning based computational approach for improved recognition of AMPs.
e above mentioned methods have their own advantages in generating knowledge for the prediction of
AMPs. However, further improvement in prediction accuracy is required to minimize the number of false pos-
itives. In this study, we made an attempt to develop a computational approach for prediction of antibacterial,
antiviral and antifungal peptides with higher accuracy. In this approach, combinations of compositional, PHYC
and structural (STRL) features were used to map the peptide sequences onto numeric feature vectors, which were
subsequently used as input in SVM for prediction. e proposed approach was found to perform better than sev-
eral existing approaches for predicting AMPs, when comparison was made using bench mark dataset.
Material and Methods
As summarized and demonstrated by a series of recent publications
17–22
, in compliance with Chou’s 5-step rule
23
,
to establish a really useful sequence-based statistical predictor for a biological system, the following ve guide-
lines should be followed: (a) construct or select a valid benchmark dataset to train and test the predictor; (b)
formulate the biological sequence samples with an eective mathematical expression that can truly reect their
intrinsic correlation with the target to be predicted; (c) introduce or develop a powerful algorithm (or engine) to
operate the prediction; (d) properly perform cross-validation tests to objectively evaluate the anticipated accuracy
of the predictor; (e) establish a user-friendly web-server for the predictor that is freely accessible to the public. In
the following sections, we have described how to deal with these steps one-by-one.
Dataset. Positive. To construct the positive dataset, antibacterial, antiviral and antifungal peptide sequences
were collected from publicly available databases (or datasets). Specically, antibacterial peptides were collected
from CAMP, APD3
24
and AntiBP2; antiviral peptides were collected from CAMP, APD3, LAMP
25
and AVPpred;
antifungal peptides were collected from CAMP, LAMP and APD3. e sequences having non-standard amino
acids were then removed followed by removal of redundant sequences, similar to earlier studies
7,12,26
. Since AMPs
are mostly 10–100 amino acids long
1
, sequences having less than 10 amino acids were also excluded from further
analysis. A summary of the positive datasets is given in Table1.
Negative. e non-antibacterial and non-antiviral peptides were collected from AntiBP2 and AVPpred respec-
tively. ese non-antibacterial and non-antiviral peptides were respectively used as the negative dataset against
the antibacterial and antiviral peptides. Further, these non-antibacterial and non-antiviral peptides were consid-
ered together as the negative dataset against the antifungal peptides. Similar to the positive dataset, sequences of
the negative dataset were also processed. A summary of the negative datasets is also given in Table1.
Feature generation. Since the peptide sequences are the strings of amino acids, they need to be mapped
onto numeric feature vectors before being used as an input in supervised learning classiers. In this study, three
dierent categories of features i.e., compositional, PHYC and STRL were considered. In particular, 3 compo-
sitional (amino acid composition-AAC, pseudo amino acid composition-PAAC and normalized amino acid
composition-NAAC), 3 PHYC (hydrophobicity, net-charge and iso-electric point) and 3 STRL (α -helix propen-
sity, β -sheet propensity and turn propensity) features were considered (Table2) for prediction of AMPs. e
compositional and PHYC features were computed by using the “Peptide” package
27
of R-soware
28
, whereas the
STRL features were computed by using the TANGO soware
29
available at http://tango.crg.es/. e TANGO
server was rst used by Torrent et al.
30
for recognition of AMPs. Furthermore, to know the importance of each
feature in predicting the antibacterial, antiviral and antifungal peptides, information gain was computed for all
the 66 features [AAC (20) + PAAC (20) + NAA C (20) + PHYC (3) + STRL (3)]. To compute the information gain,
the InfoGainAttributeEval function available in RWeka
31
package was used.
SVM-based prediction. We used SVM for prediction of AMPs because it is a non-parametric (does not
make any assumption about the underlying probability distribution of the input dataset) and most widely used
supervised learning technique in the eld of bioinformatics, attributed to its sound statistical background
32
.
e predictive ability of SVM, mainly depends upon the type of kernel function that maps the input data to a
high-dimensional feature space, where the observations belong to dierent classes are linearly separable by a
optimal separating hyper plane. In this work, the radial basis function (RBF) was used as kernel, due to its wide
and successful application in most of the AMP prediction studies
1,9–10,33
. Further, in RBF kernel, default values of
parameters gamma (gamma = 1/number of attributes) and cost (C = 1) were used to train and test the prediction
model. e svm function available in the e1071 package
34
of R-soware was used to execute the SVM model. e
scaling option was kept as TRUE in svm function, while training the model.
Dataset Bacterial Viral Fungal
Positive
CAMP
3
, APD3
24
,
AntiBP2
9
{3417}
CAMP, APD3, LAMP
25
,
AVPpred
10
{739}
CAMP, LAMP,
APD3 {1496}
Negative AntiBP2 {984} AVPpred {893}
AntiBP2, AVPpred
{1384}
Table 1. Summary of the positive and negative datasets. e value inside bracket {} is the number of
sequences collected in that category.

www.nature.com/scientificreports/
3
Scientific REPORTS | 7:42362 | DOI: 10.1038/srep42362
Performance evaluation. We considered dierent performance metrics viz., sensitivity (Sn), specicity
(Sp), accuracy (Ac) and Matthew’s correlation coecient (MCC) to evaluate the performance of the proposed
approach. Since, the conventional formulae of these metrics are not quite intuitive, particularly MCC, Chen
et al.
35
derived a new set of equations for the above mentioned metrics based on the Chou’s symbols used in stud-
ying protein signal peptide cleavage sites
36
. e new formulae for these metrics are given in equation(1)
=
×
=
×
=
+
+
×
=
−+
++
+
+
+
+
+
+−
−−
+
+
+
+
+
+
+
+
()
Sn
N
N
Sp
N
N
Ac
NN
NN
MCC
1 100
1 100
1 100
1
(1 )(1)
,
(1)
N
N
N
N
NN
N
NN
N
where
+
N
represents the total number of AMPs investigated,
+
N
represents the number of AMPs incorrectly
predicted as non-AMPs,
N
represents the total number of non-AMPs investigated and
+
N
represents the number
of non-AMPs incorrectly predicted as AMPs. e formulae given in equation(1) has made the meaning of Sn, Sp,
Ac, and MCC much more intuitive and easier-to-understand, particularly for the meaning of MCC, as concurred
by a series of studies published very recently
19–20,37–41
. e above formulae are valid only for the single-label sys-
tems, whereas for the multi-label systems, whose emergence has become more frequent in system biology
42–43
and
system medicine
22,44–45
, a dierent set of metrics is needed as elaborated in Chou
46
.
Training and validation. In an unbalanced dataset (i.e., the number of AMPs and non-AMPs are not same),
machine learning based classier may produce results biased towards the major class
47
(having large number of
sequences than the other class). erefore, number of sequences of the major class was kept same as the number
of sequences present in the minor class to train the prediction model eectively. Here, sequences of the major
class were randomly drawn from the available sequences. Since one random set from major class may not be ade-
quate to judge the generalized predictive ability of the classier, one thousand random samples (drawn without
replacement from major class) were used. Further, in each sample (consists of AMPs and non-AMPs) a 10-fold
cross validation
48
procedure was employed to assess performance of the predictor. Furthermore, to assess the
impact of size (number of sequences) of dataset, three datasets with dierent sample sizes were used (Table3).
Comparison with existing methods. Performance of the proposed approach was compared with that of
latest AMP prediction tools viz., CAMP
3
, iAMP-2L
7
, EFC-FCBF
12
, EFC + 307-FCBF
12
. e comparison was
made by using the Xiao et al. benchmark dataset
7
(http://www.jci-bioinfo.cn/iAMP/data.html). In this dataset,
the training set contains 770 antibacterial peptides and 2405 non-AMPs and the test set contains 920 AMPs and
920 non-AMPs. e same datasets have been used by Veltri et al.
12
to evaluate the performance of EFC-FCBF
and EFC + 307-FCBF approaches. Further, performances of the methods were compared in terms of area under
receiving operating characteristics curve
49
(AUC-ROC), area under precision-recall curve
50
(AUC-PR) and
MCC. For a binary classifier, recall is same as Sn (as defined in equation-1) and precision is defined as
−−+
+
++
+
+
NNNNN()/( )
.
Development of prediction server. An online prediction server was also developed using hyper text
markup language (HTML) and hypertext preprocessor (PHP), where a developed R-code was executed in the
backend upon submission of peptide sequences in the FASTA format. e user can submit single or multiple
sequences having only standard amino acid residues. is web server can be used to predict the probabilities
with which a candidate peptide sequence can be classied into antiviral, antibacterial and antifungal categories.
Feature category Features in each category #Features
Compositional
Amino acid composition (AAC) 20
Normalized AAC (NAAC) 20
Structural (STRL)
Pseudo AAC (PAAC) 20
α -helix propensity 1
β -sheet propensity 1
Turn propensity 1
Physico-chemical (PHYC)
Iso-electric point 1
Hydrophobicity 1
Net-charge 1
Table 2. Summary of the feature sets.

www.nature.com/scientificreports/
4
Scientific REPORTS | 7:42362 | DOI: 10.1038/srep42362
Results
Performance analysis for predicting the antibacterial peptides. ree dierent sample sizes (100,
500, 983) were used for prediction of antibacterial peptides. Prediction accuracies for the sample size 983 are
given in Table4, whereas for the sample sizes 100 and 500 accuracies are provided in SupplementaryTableS1.
It is observed that the prediction accuracies are more precise (low standard error) for the sample size 983 as
compared to that of sample sizes 100 and 500. Further, low prediction accuracies are observed with the compo-
sitional features alone, whereas 2–6%, ~1%, 2–4% and 4–5% increment in sensitivity, specicity, accuracy and
MCC are observed respectively while the compositional, PHYC and STRL features are used together (Table4 and
SupplementaryTableS1).
Performance analysis for predicting the antiviral peptides. For the sample size 738, performance
metrics of the proposed approach in predicting the antiviral peptides are given in Table5, whereas for the sample
sizes 100 and 500 accuracies are provided in SupplementaryTableS2. It is seen that the prediction models based
on the sample size 738 are more stable (low standard error) as compared to those based on sample sizes 100 and
500. Similar to antibacterial peptides, low prediction accuracies are also observed while only compositional fea-
tures are used, whereas sensitivity, specicity, accuracy and MCC are observed to be increased by 1–3%, 1%, ~1%
and 1–3% respectively while all the three features are accounted together (Table5 and SupplementaryTableS2).
Besides, it is seen that the accuracies in predicting the antiviral peptides are low as compared to the antibacterial
peptides.
Dataset
Bacterial Viral Fungal
#ABP #nonABP #AVP #nonAVP #AFP #nonAFP
1
st
set 100 100 100 100 100 100
2
nd
set 500 500 500 500 500 500
3
rd
set 983 983 738 738 1383 1383
Table 3. Number of sequences present (sample size) in three dierent datasets used for prediction of
antibacterial, antiviral and antifungal peptides. #ABP: Number of antibacterial peptides, #nonABP: Number
of non-antibacterial peptides, #AVP: Number of antiviral peptides, #nonAVP: Number of non-antiviral
peptides, #AFP: Number of antifungal peptides, #nonAFP: Number of non-antifungal peptides. In all the cases
the instances were randomly drawn (without replacement) from the available number of instances present in
the respective classes.
Features
Performance metrics
Sn ± SE Sp ± SE Ac ± SE MCC
AAC + PAAC 91.16 ± 0.71 93.41 ± 0.49 92.29 ± 0.36 0.85 ± 0.007
AAC + NAAC 91.29 ± 0.79 93.44 ± 0.49 92.37 ± 0.45 0.85 ± 0.009
PAAC + NAAC 91.29 ± 0.65 93.37 ± 0.51 92.33 ± 0.37 0.85 ± 0.007
AAC + PAAC + NAAC 91.35 ± 0.69 93.48 ± 0.52 92.41 ± 0.41 0.85 ± 0.008
AAC + PAAC + PHYC + STRL 93.81 ± 0.55 94.96 ± 0.40 94.39 ± 0.35 0.89 ± 0.007
AAC + NAAC + PHYC + STRL 93.87 ± 0.61 94.85 ± 0.39 94.36 ± 0.36 0.89 ± 0.007
PAAC + NAAC + PHYC + STRL 93.86 ± 0.65 94.91 ± 0.38 94.39 ± 0.35 0.89 ± 0.007
AAC + PAAC + NAAC + PHYC + STRL 93.85 ± 0.59 94.98 ± 0.36 94.69 ± 0.38 0.89 ± 0.008
Table 4. Performance metrics of SVM in predicting antibacterial peptides for the sample size 983. SE:
Standard Error.
Features
Performance metrics
Sn ± SE Sp ± SE Ac ± SE MCC
AAC + PAAC 85.60 ± 0.56 90.72 ± 0.61 88.16 ± 0.38 0.76 ± 0.008
AAC + NAAC 85.42 ± 0.58 90.59 ± 0.69 88.00 ± 0.41 0.76 ± 0.008
PAAC + NAAC 85.47 ± 0.61 90.68 ± 0.59 88.08 ± 0.40 0.76 ± 0.008
AAC + PAAC + NAAC 85.49 ± 0.61 90.77 ± 0.62 88.13 ± 0.40 0.76 ± 0.008
AAC + PAAC + PHYC + STRL 88.67 ± 0.56 91.49 ± 0.68 90.08 ± 0.42 0.80 ± 0.008
AAC + NAAC + PHYC + STRL 88.46 ± 0.59 91.57 ± 0.64 90.01 ± 0.39 0.80 ± 0.008
PAAC + NAAC + PHYC + STRL 88.69 ± 0.59 91.49 ± 0.57 90.09 ± 0.34 0.80 ± 0.007
AAC + PAAC + NAAC + PHYC + STRL 88.65 ± 0.65 91.42 ± 0.67 90.08 ± 0.40 0.80 ± 0.008
Table 5. Performance metrics of SVM in predicting antiviral peptides for the sample size 738. SE: Standard
Error.

www.nature.com/scientificreports/
5
Scientific REPORTS | 7:42362 | DOI: 10.1038/srep42362
Performance analysis for predicting the antifungal peptides. In case of antifungal peptides, pre-
diction accuracies for the sample size 1383 are given in Table6 and accuracies for the sample sizes 100 and 500
are provided in SupplementaryTableS3. It is observed that the accuracies are more precise for the sample size
1383 as compared that of sample sizes 100 and 500. Similar to antibacterial and antiviral peptides, a decreas-
ing trend in accuracies is observed for all the sample sizes, while PHYC and STRL features are not included
in prediction. In particular, sensitivity, specicity, accuracy and MCC are increased by 1–2%, ~1%, ~1% and
1–2% respectively while compositional features are used along with the PHYC and STRL features (Table6 & and
SupplementaryTableS3). Furthermore, the accuracies for predicting the antifungal peptides are found higher
than that of antiviral peptides and lower than that of antibacterial peptides.
Feature importance. Based on top the model (AAC + PAAC + NAAC + STRL + PHYC), information gain
for all the features was computed by using the largest sample size and are shown in Fig.1. From the gure, it
can be seen that the values of information gain are almost same for both the AAC and NAAC features. Further,
it is observed that the information gain is highest for the feature net-charge followed by iso-electric point, while
predicting the antibacterial and antifungal peptides. On the other hand, highest information gain is observed
for the composition of amino acid C, while predicting the antiviral peptides. Furthermore, the STRL features are
found less important (low information gain) than that of PHYC features and several compositional features. In
particular, values of information gain are seen 0.05 for the amino acid compositions K, E. G, P, C and I in case
of antibacterial and antifungal peptides, whereas it is 0.05 for the amino acid compositions R, K, W, S, T, P, H,
C and I in case of antiviral peptides. Besides, values of information gain are observed close to zero for the amino
acid compositions {N, W, V, L, M, F, H, Y}, {N, E, L, F} and {A, Y, N} in predicting the antibacterial, antiviral and
antifungal peptides respectively. e values of information gain for other amino acids are observed to lie between
0 and 0.05.
Performance analysis for predicting the AMPs. For prediction of AMPs in general, positive data-
set of AMPs was constructed by combining the antibacterial, antiviral and antifungal peptides, whereas neg-
ative dataset (non-AMP) was constructed by combining the non-antibacterial and non-antiviral peptides
collected from AntiBP2 and AVPpred respectively. Besides, AMPs available in the LAMP were also included
in the positive dataset. Finally, a dataset consisting of 5155 AMPs and 1384 non-AMPs was prepared. Similar
to antibacterial, antiviral and antifungal, prediction of AMPs was also made with three dierent sample sizes
i.e., 100, 500 and 1383. Moreover, the prediction was made only for the AAC + PAAC + PHYC + STRL and
PAAC + NAAC + PHYC + STRL feature combinations, as little higher accuracies were obtained with these com-
binations in earlier predictions. e values of dierent performance metrics (averaged over 10-fold) are given in
Table7. From the table it is seen that the sensitivity, specicity and accuracy are > 90% for all the sample sizes.
In addition, the performance of SVM with the above mentioned feature sets were also assessed by using Xiao
benchmark training dataset, based on three dierent sample sizes (100, 500 and 769). e values of dierent per-
formance metrics (averaged over 10-folds) are given in Table8. From the table it is observed that the sensitivity,
specicity and accuracy are ~94%, whereas for MCC it is ~88%. It is further seen that the prediction accuracies
are more precise (low standard error) for the sample size 769.
Comparative analysis. To further assess the predictive ability as compared to the existing approaches, per-
formance of SVM with PAAC + NAAC + PHYC + STRL feature set (we call it iAMPpred) was compared with
the performances of latest AMP prediction tools, by using Xiao benchmark dataset
7
. e results are given in
Table9. We observed that the accuracies of iAMPpred are much higher than that of all the four models of CAMP.
In particular, it is observed that the AUC-ROC, AUC-PR and MCC values of iAMPpred are ~15%, ~20% and
~30% higher than all the four models of CAMP respectively. ough, iAMPpred and iAMP-2L performed at par
in terms of MCC, AUC-ROC of iAMPpred is observed ~3% higher than that of iAMP-2L. Further, it is seen that
the prediction accuracies (AUC-ROC, AUC-PR and MCC) of iAMPpred are also higher than that of EFC-FCBF
and EFC + 307-FCBF (Table9).
Comparison of iAMPpred with AntiBP2. e performance of the iAMPpred was also compared with
that of AntiBP2 (http://www.imtech.res.in/raghava/antibp2/) by considering the same dataset used in AntiBP2
that contains 999 antibacterial peptides and 999 non-antibacterial peptides. Since 5 sequences in the negative
Features
Performance metrics
Sn ± SE Sp ± SE Ac ± SE MCC
AAC + PAAC 90.71 ± 0.29 93.14 ± 0.24 91.93 ± 0.16 0.84 ± 0.003
AAC + NAAC 90.82 ± 0.32 93.22 ± 0.25 92.02 ± 0.19 0.84 ± 0.004
PAAC + NAAC 90.76 ± 0.35 93.16 ± 0.25 91.96 ± 0.23 0.84 ± 0.005
AAC + PAAC + NAAC 90.77 ± 0.32 93.22 ± 0.21 92.00 ± 0.18 0.84 ± 0.004
AAC + PAAC + PHYC + STRL 92.33 ± 0.37 94.36 ± 0.22 93.35 ± 0.22 0.87 ± 0.004
AAC + NAAC + PHYC + STRL 92.32 ± 0.32 94.36 ± 0.23 93.34 ± 0.20 0.87 ± 0.004
PAAC + NAAC + PHYC + STRL 92.25 ± 0.29 94.38 ± 0.25 93.31 ± 0.17 0.87 ± 0.003
AAC + PAAC + NAAC + PHYC + STRL 92.30 ± 0.27 94.41 ± 0.25 93.35 ± 0.18 0.87 ± 0.004
Table 6. Performance metrics of SVM in predicting antifungal peptides for the sample size 1383. SE:
Standard Error.

Citations
More filters
Journal ArticleDOI
31 Jan 2018-Genomics
TL;DR: A novel predictor called iDNA6mA-PseKNC is proposed that is established by incorporating nucleotide physicochemical properties into Pseudo K-tuple Nucleotide Composition (PSEKNC), and it has been observed via rigorous cross-validations that the predictor's sensitivity, specificity, accuracy, and stability are excellent.

261 citations

Journal ArticleDOI
TL;DR: A two-layer seamless predictor named as 'iPromoter-2 L', which serves to identify a query DNA sequence as a promoter or non-promoter, and the second layer to predict which of the following six types the identified promoter belongs to.
Abstract: Motivation Being responsible for initiating transaction of a particular gene in genome, promoter is a short region of DNA. Promoters have various types with different functions. Owing to their importance in biological process, it is highly desired to develop computational tools for timely identifying promoters and their types. Such a challenge has become particularly critical and urgent in facing the avalanche of DNA sequences discovered in the postgenomic age. Although some prediction methods were developed, they can only be used to discriminate a specific type of promoters from non-promoters. None of them has the ability to identify the types of promoters. This is due to the facts that different types of promoters may share quite similar consensus sequence pattern, and that the promoters of same type may have considerably different consensus sequences. Results To overcome such difficulty, using the multi-window-based PseKNC (pseudo K-tuple nucleotide composition) approach to incorporate the short-, middle-, and long-range sequence information, we have developed a two-layer seamless predictor named as 'iPromoter-2 L'. The first layer serves to identify a query DNA sequence as a promoter or non-promoter, and the second layer to predict which of the following six types the identified promoter belongs to: σ24, σ28, σ32, σ38, σ54 and σ70. Availability and implementation For the convenience of most experimental scientists, a user-friendly and publicly accessible web-server for the powerful new predictor has been established at http://bioinformatics.hitsz.edu.cn/iPromoter-2L/. It is anticipated that iPromoter-2 L will become a very useful high throughput tool for genome analysis. Contact bliu@hit.edu.cn or dshuang@tongji.edu.cn or kcchou@gordonlifescience.org. Supplementary information Supplementary data are available at Bioinformatics online.

255 citations

Journal ArticleDOI
TL;DR: A novel platform called “iRNA-PseColl” has been developed, formed by incorporating both the individual and collective features of the sequence elements into the general pseudo K-tuple nucleotide composition (PseKNC) of RNA via the chemicophysical properties and density distribution of its constituent nucleotides.
Abstract: There are many different types of RNA modifications, which are essential for numerous biological processes. Knowledge about the occurrence sites of RNA modifications in its sequence is a key for in-depth understanding of their biological functions and mechanism. Unfortunately, it is both time-consuming and laborious to determine these sites purely by experiments alone. Although some computational methods were developed in this regard, each one could only be used to deal with some type of modification individually. To our knowledge, no method has thus far been developed that can identify the occurrence sites for several different types of RNA modifications with one seamless package or platform. To address such a challenge, a novel platform called "iRNA-PseColl" has been developed. It was formed by incorporating both the individual and collective features of the sequence elements into the general pseudo K-tuple nucleotide composition (PseKNC) of RNA via the chemicophysical properties and density distribution of its constituent nucleotides. Rigorous cross-validations have indicated that the anticipated success rates achieved by the proposed platform are quite high. To maximize the convenience for most experimental biologists, the platform's web-server has been provided at http://lin.uestc.edu.cn/server/iRNA-PseColl along with a step-by-step user guide that will allow users to easily achieve their desired results without the need to go through the mathematical details involved in this paper.

254 citations

Journal ArticleDOI
TL;DR: This work proposes a neural network model with convolutional and recurrent layers that leverage primary sequence composition and shows that the proposed model outperforms state-of-the-art classification models on a comprehensive dataset.
Abstract: Motivation Bacterial resistance to antibiotics is a growing concern. Antimicrobial peptides (AMPs), natural components of innate immunity, are popular targets for developing new drugs. Machine learning methods are now commonly adopted by wet-laboratory researchers to screen for promising candidates. Results In this work, we utilize deep learning to recognize antimicrobial activity. We propose a neural network model with convolutional and recurrent layers that leverage primary sequence composition. Results show that the proposed model outperforms state-of-the-art classification models on a comprehensive dataset. By utilizing the embedding weights, we also present a reduced-alphabet representation and show that reasonable AMP recognition can be maintained using nine amino acid types. Availability and implementation Models and datasets are made freely available through the Antimicrobial Peptide Scanner vr.2 web server at www.ampscanner.com. Supplementary information Supplementary data are available at Bioinformatics online.

240 citations

Journal ArticleDOI
TL;DR: The updated Pse-in-One 2.0 package has incorporated 23 new pseudo component modes as well as a series of new feature analysis approaches, and is available at http://bioinformatics.hitsz.edu.cn/Pse- in-One2.0/.
Abstract: Pse-in-One 2.0 is a package of web-servers evolved from Pse-in-One (Liu, B., Liu, F., Wang, X., Chen, J. Fang, L. & Chou, K.C. Nucleic Acids Research, 2015, 43:W65-W71). In order to make it more flexible and comprehensive as suggested by many users, the updated package has incorporated 23 new pseudo component modes as well as a series of new feature analysis approaches. It is available at http://bioinformatics.hitsz.edu.cn/Pse-in-One2.0/. Moreover, to maximize the convenience of users, provided is also the stand-alone version called “Pse-in-One-Analysis”, by which users can significantly speed up the analysis of massive sequences.

222 citations

References
More filters
Journal ArticleDOI
01 Oct 2001
TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Abstract: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, aaa, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

79,257 citations

Book
16 Jul 1998
TL;DR: Thorough, well-organized, and completely up to date, this book examines all the important aspects of this emerging technology, including the learning process, back-propagation learning, radial-basis function networks, self-organizing systems, modular networks, temporal processing and neurodynamics, and VLSI implementation of neural networks.
Abstract: From the Publisher: This book represents the most comprehensive treatment available of neural networks from an engineering perspective. Thorough, well-organized, and completely up to date, it examines all the important aspects of this emerging technology, including the learning process, back-propagation learning, radial-basis function networks, self-organizing systems, modular networks, temporal processing and neurodynamics, and VLSI implementation of neural networks. Written in a concise and fluid manner, by a foremost engineering textbook author, to make the material more accessible, this book is ideal for professional engineers and graduate students entering this exciting field. Computer experiments, problems, worked examples, a bibliography, photographs, and illustrations reinforce key concepts.

29,130 citations

01 Jan 1998
TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.
Abstract: A comprehensive look at learning and generalization theory. The statistical theory of learning and generalization concerns the problem of choosing desired functions on the basis of empirical data. Highly applicable to a variety of computer science and robotics fields, this book offers lucid coverage of the theory as a whole. Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.

26,531 citations

Journal ArticleDOI
TL;DR: In this review the different models of antimicrobial-peptide-induced pore formation and cell killing are presented and several observations suggest that translocated peptides can alter cytoplasmic membrane septum formation, inhibit cell-wall synthesis, inhibit nucleic-acid synthesis, inhibits protein synthesis or inhibit enzymatic activity.
Abstract: Antimicrobial peptides are an abundant and diverse group of molecules that are produced by many tissues and cell types in a variety of invertebrate, plant and animal species. Their amino acid composition, amphipathicity, cationic charge and size allow them to attach to and insert into membrane bilayers to form pores by 'barrel-stave', 'carpet' or 'toroidal-pore' mechanisms. Although these models are helpful for defining mechanisms of antimicrobial peptide activity, their relevance to how peptides damage and kill microorganisms still need to be clarified. Recently, there has been speculation that transmembrane pore formation is not the only mechanism of microbial killing. In fact several observations suggest that translocated peptides can alter cytoplasmic membrane septum formation, inhibit cell-wall synthesis, inhibit nucleic-acid synthesis, inhibit protein synthesis or inhibit enzymatic activity. In this review the different models of antimicrobial-peptide-induced pore formation and cell killing are presented.

5,102 citations

Proceedings ArticleDOI
25 Jun 2006
TL;DR: It is shown that a deep connection exists between ROC space and PR space, such that a curve dominates in R OC space if and only if it dominates in PR space.
Abstract: Receiver Operator Characteristic (ROC) curves are commonly used to present results for binary decision problems in machine learning. However, when dealing with highly skewed datasets, Precision-Recall (PR) curves give a more informative picture of an algorithm's performance. We show that a deep connection exists between ROC space and PR space, such that a curve dominates in ROC space if and only if it dominates in PR space. A corollary is the notion of an achievable PR curve, which has properties much like the convex hull in ROC space; we show an efficient algorithm for computing this curve. Finally, we also note differences in the two types of curves are significant for algorithm design. For example, in PR space it is incorrect to linearly interpolate between points. Furthermore, algorithms that optimize the area under the ROC curve are not guaranteed to optimize the area under the PR curve.

5,063 citations