scispace - formally typeset
Open AccessBook ChapterDOI

Predicting Secretory Proteins with SignalP

TLDR
This chapter includes a case study where the scores of SignalP were used in a novel way to predict the functional effects of amino acid substitutions in signal peptides.
Abstract
SignalP is the currently most widely used program for prediction of signal peptides from amino acid sequences. Proteins with signal peptides are targeted to the secretory pathway, but are not necessarily secreted. After a brief introduction to the biology of signal peptides and the history of signal peptide prediction, this chapter will describe all the options of the current version of SignalP and the details of the output from the program. The chapter includes a case study where the scores of SignalP were used in a novel way to predict the functional effects of amino acid substitutions in signal peptides.

read more

Content maybe subject to copyright    Report

General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright
owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.
Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
You may not further distribute the material or use it for any profit-making activity or commercial gain
You may freely distribute the URL identifying the publication in the public portal
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.
Downloaded from orbit.dtu.dk on: Aug 09, 2022
Predicting Secretory Proteins with SignalP
Nielsen, Henrik
Published in:
Methods in Molecular Biology
Link to article, DOI:
10.1007/978-1-4939-7015-5_6
Publication date:
2017
Document Version
Peer reviewed version
Link back to DTU Orbit
Citation (APA):
Nielsen, H. (2017). Predicting Secretory Proteins with SignalP. Methods in Molecular Biology, 1611, 59-73.
https://doi.org/10.1007/978-1-4939-7015-5_6

Predicting secretory proteins with
SignalP
Henrik Nielsen
Abstract
SignalP is the currently most widely used program for prediction of signal peptides from amino acid
sequences. Proteins with signal peptides are targeted to the secretory pathway, but are not necessarily
secreted. After a brief introduction to the biology of signal peptides and the history of signal peptide
prediction, this chapter will describe all the options of the current version of SignalP and the details of the
output from the program. The chapter includes a case study where the scores of SignalP were used in a novel
way to predict the functional effects of amino acid substitutions in signal peptides.
Key words
Signal peptides, prediction, secretion, protein sorting, protein subcellular location
1. Introduction
A signal peptide (SP) is the N-terminal part of a protein that is targeted to the secretory pathway in both pro-
and eukaryotes [1] (see, however, Note 1). In eukaryotes, a protein with an SP will be targeted to the
endoplasmic reticulum (ER) membrane and be co-translationally translocated across the membrane. In
prokaryotes, translocation takes place across the cytoplasmic membrane (inner membrane in Gram-negative
bacteria), and the process can happen during or after translation. The SP-carrying protein is threaded through
a protein complex known as the translocon, comprising the subunits SecY, E, and G in bacteria and Sec61 α,
β, and γ in eukaryotes [2]. During translocation, the SP is cleaved off by an enzyme known as signal
peptidase I or leader peptidase (Lep) in bacteria or signal peptidase complex in eukaryotes [3]. See Notes 2-4
for exceptions to this general picture.
It is important to stress that the presence of an SP does not necessarily mean that the protein is secreted to the
extracellular environment—it only means that it enters the secretory pathway. In all kinds of organisms, the
protein could have one or more transmembrane helices downstream of the SP and therefore be retained in the
membrane [4]. In eukaryotes, the protein could also be retained in one of the compartments that belong to the
secretory pathway: the ER, the Golgi apparatus, or the lysosome/vacuole [5]; or it could be anchored to the
outer face of the cytoplasmic membrane by a glycophosphatidylinositol (GPI) group [6]. In Gram-negative
bacteria, the protein could be retained in the periplasm, or be inserted into the outer membrane as a β-barrel
transmembrane protein [7]. In Gram-positive bacteria, the protein could be attached to the cell wall [8].
SPs are generally described as having three regions: an N-terminal n-region of variable length characterized
by positive charge, a central h-region of at least 7 hydrophobic residues, and a C-terminal c-region of
typically 3-7 polar residues. Positions –1 and –3 relative to the cleavage site are occupied by small
uncharged residues; in bacteria predominantly Alanine. SPs of Gram-positive bacteria tend to be longer than
those of Gram-negative bacteria, which in turn tend to be longer than eukaryotic SPs [1].

The SP is among the earliest prediction targets for bioinformatic algorithms, with the first simple prediction
methods being published already in the 1980’s [9–11]. In the early 1990’s, a few machine learning methods
were published [12, 13], but SignalP version 1.0 [14, 15] was in 1996 the first machine learning method for
SP prediction to be made into a publically available web server. SignalP 1.0 and 1.1 were based on artificial
neural networks (ANNs), while SignalP 2.0 from 1998 [16] added a hidden Markov model (HMM)
prediction in order to better distinguish between SPs and signal anchors (transmembrane helices close to the
N-terminus). SignalP 3.0 from 2004 [17] introduced the D-score for better discrimination between SPs and
other sequences and retained the HMM option, while SignalP 4.0 from 2011 [18] is again purely ANN-
based. While constructing SignalP 4.0, we did retrain the HMM part, but we found that it did not perform
better than the ANNs in any of the performance parameters we tested. The most important new feature of
SignalP 4.0 is the improved discrimination between signal peptides and transmembrane regions.
SignalP was updated to version 4.1 in 2012 with an option to set the D-score cutoff values so that the
sensitivity is the same as that of SignalP 3.0, and an option to set the minimum cleavage site position in the
sequence (the minimum SP length). More details about these options are given in Section 3.1. In addition,
the documentation on the website was completely rewritten, and a FAQ was added.
Earlier versions of SignalP have repeatedly been reported as the best performing method in independent
benchmarks [19–22]. SignalP 4 has not yet been independently evaluated, but in the SignalP 4.0 paper [18]
we compared the performance to ten other methods and found that it was superior. The best competing
methods were the combined SP and transmembrane helix predictors Phobius [23], Philius [24], and
SPOCTOPUS [25]. Interestingly, the advantage of SignalP 4.0 over these three programs was larger for
bacteria than for eukaryotes. This may be due to the fact that these three methods did not divide their training
data into different organism groups but pooled them all together, resulting in methods that are optimized for
the most abundant organism group in the data, the eukaryotes.
The performance values for SignalP 3.0 and 4.0 and the ten competing methods can be found in Table E of
the supplementary materials of the SignalP 4.0 paper, which is available on the SignalP web site (click on
“Article abstracts” and then “Update to SignalP v. 4.0”). It should be noted that those values are calculated
by cross-validation on a homology-reduced data set, i.e. they are the performances you should expect when
submitting proteins that are unrelated to anything in the SignalP 4.0 data set. When submitting close
homologs to proteins in the SignalP 4.0 data set, a higher performance should be expected (compare the
aforementioned Table E with the table on the “performance” page of the website documentation).
2. Materials
1. Input data: Amino acid sequences in FASTA format. Note that any letters not corresponding to the
twenty standard amino acids, e.g. ‘U’, ‘B’, or ‘Z’, will be converted to ‘X’ and treated as unknown
amino acids. See also Notes 5 and 6.
2. Website: SignalP 4.1 is available at http://www.cbs.dtu.dk/services/SignalP/, see Figure 1. The
previous versions are also kept online; just click “version history” near the top of the page.
3. Downloadable package: For those who prefer running SignalP on their own computers, there is an
option to download a software package for command line use. The package is free for academic
institutions, while there is a license fee for commercial users. Academic users can go to the page
http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?signalp to fill out the details and accept the license,
while commercial users are asked to contact software@cbs.dtu.dk. The package is available for

Linux, IRIX, Darwin (Mac OS X), and from March 2016 also for Windows computers via the free
Unix-like environments provided by Cygwin [26] or MobaXterm [27].
3. Methods
Running SignalP with the default options is straightforward: On the website, you paste or upload the
sequences and click “submit”; on the command line you write “signalp input.fasta”. The output
will tell you, for each sequence, whether there is an SP predicted, and if yes, where the cleavage site is
predicted to be. However, as seen in Figure 1, there are a number of options, of which especially “Organism
group” and “Cutoff” are important to know about, and there are details of the output format that will help
interpret the predictions.
3.1. Options
Organism group: It is important to choose the correct organism group— Eukaryotes, Gram-negative
bacteria, or Gram-positive bacteria—otherwise, the predictive performance will suffer. In this context,
Gram-positive bacteria are defined as the phyla Actinobacteria (high G+C Gram-positive bacteria) and
Firmicutes. Gram-negative bacteria are defined as all bacteria having both a plasma membrane and an outer
membrane—basically all other bacteria except for the phylum Tenericutes (Mycoplasma and related genera).
SignalP probably should not be used for Tenericutes at all, since they seem to lack a type I signal peptidase
completely [28]. On the command line, organism group is chosen with one of the options “-t euk” (the
default), “-t gram-”, or -t gram+”. Concerning organism groups, see also Notes 7-11.
Output format: There are four levels of detail possible: “short”, “standard”, “long”, and “all”. The two first
formats report scores and conclusion at the sequence level; “short” in a one-line format and “standard” in a
more human-readable format. “Standard” is the default on the web server, and “short” on the command line.
The “long” and “all” formats additionally report scores for each position in each sequence (for an
explanation of the scores, see next section). The difference between “long” and “all” is that “long” reports
scores for the chosen ANN method only, while “all” reports scores for both ANN methods (SignalP-noTM
and SignalP-TM, see “Method” below for an explanation). On the command line, output format is chosen
with the “-f” option; note that “standard” is chosen with “-f summary”.
Graphics output: SignalP can make a plot of the scores for each position in each sequence in portable
network graphics (PNG) format and optionally also in encapsulated postscript (EPS) format. The default on
the web is to make PNG graphics, while the default on the command line is no graphics. If you want
graphics from the command line, use the options “-g png” or “-g png+eps.
Method: SignalP 4 has two sets of ANNs: SignalP-noTM is trained with only cytosolic and nuclear proteins
in the negative set, while SignalP-TM is trained with a negative set that also included transmembrane
proteins. During training, we found that the two methods SignalP-TM and SignalP-noTM were to some
extent complementary, i.e. SignalP-TM did not yield as good results as SignalP-noTM when there were no
transmembrane sequences involved. As a compromise, SignalP 4 per default uses a heuristic to decide which
of the two sets of networks is used for prediction of each sequence. If the user is positive that all proteins in
the input are soluble, it is possible to override this heuristic and get a slightly better performance by using
only the SignalP-noTM networks. This is done in the web interface by selecting “Input sequences do not
include TM regions” and on the command line by including the option “-s notm”.

Cutoff: The D-score (see next section) is used for determining whether each input sequence contains an SP or
not. The user can set cutoff values (for SignalP-TM and SignalP-noTM separately) if a different balance
between sensitivity and specificity is desired. The web interface offers two sets of predefined cutoff values,
“Default” and “Sensitive”. The “Default” cutoffs, corresponding to SignalP 4.0, are optimized to give the
best Matthews correlation coefficient (see the “Performance” page on the website for definition), but they
result in a quite conservative prediction with a sensitivity that is actually lower than that of SignalP 3.0. The
“Sensitive” cutoffs, introduced in SignalP 4.1, are set to reproduce the sensitivity of SignalP 3.0. This of
course results in a slightly higher false positive rate, but still significantly better than that of SignalP 3.0
when measured on the whole data set (with transmembrane proteins included in the negative set). Our
recommendation is to use the “Sensitive” setting if it is important to avoid false negatives, but use the
“Default” setting for estimating the proportion of SPs in an organism. The estimation by the “Default” cutoff
was found to be in accordance with an estimate of the number of SPs in Escherichia coli by a recent
proteogenomics study [29]. At the website, you can see the preset cutoff values change when you select
“Default” or “Sensitive” or change the organism group. On the command line, the “Sensitive” cutoffs are
selected by including the options “-U 0.34 -u 0.34” for organism group “euk” or “-U 0.42 -u
0.42” for organism groups “gram+” and “gram-”.
Truncation of input sequence: By default, SignalP truncates every sequence to 70 amino acids before
prediction. This gives enough included sequence after the cleavage site to give the optimal prediction for the
vast majority of SPs. If you want to predict extremely long signal peptides, you can try a higher value, or
disable truncation completely by entering 0 (zero). Note that the neural networks are trained with sequences
with a maximal length of 70, and they include the relative position in the sequence in their input. Therefore,
general performance may deteriorate if you change this setting. On the command line, truncation is changed
with the “-c” option.
Minimal predicted signal peptide length: SignalP 4.0 could, in rare cases, erroneously predict extremely
short signal peptides. These errors have in SignalP 4.1 been eliminated by imposing a lower limit on the
cleavage site position (SP length). The minimum length is by default 10, but you can adjust it. Signal
peptides shorter than 15 residues are very rare, at the time of writing there are 17 experimentally confirmed
cases in UniProt that are not fragments. If you want to disable this length restriction completely, enter 0
(zero). On the command line, minimal SP length is changed with the “-M” option.
3.2. Output
The neural networks in SignalP produce three output scores for each position in the input sequence:
C-score (raw cleavage site score): The output from the cleavage site networks, which are trained to
distinguish SP cleavage sites from everything else. Note the position numbering of the cleavage site:
The C-score is trained to be high at the position immediately after the cleavage site (the first residue
in the mature protein).
S-score (signal peptide score): The output from the signal peptide networks, which are trained to
distinguish positions within SPs from positions in the mature part of the proteins and from proteins
without SPs.
Y-score (combined cleavage site score): A combination (geometric average) of the C-score and the
slope of the S-score, resulting in a better cleavage site prediction than the raw C-score alone. This is
due to the fact that multiple high-peaking C-scores can be found in one sequence, where only one is
the true cleavage site. The Y-score distinguishes between C-score peaks by choosing the one where
the slope of the S-score is steep.

Citations
More filters
Journal ArticleDOI

InterPro in 2019: improving coverage, classification and access to protein sequence annotations.

TL;DR: Recent developments with InterPro (version 70.0) and its associated software are reported, including an 18% growth in the size of the database in terms on new InterPro entries, updates to content, the inclusion of an additional entry type, refined modelling of discontinuous domains, and the development of a new programmatic interface and website.
Journal ArticleDOI

Structure of a human synaptic GABAA receptor

TL;DR: The cryo-electron microscopy structure of the type A GABA receptor bound to GABA and the benzodiazepine site antagonist flumazenil reveals structural mechanisms that underlie intersubunit interactions and ligand selectivity of the receptor.
Journal ArticleDOI

In-silico design of a multi-epitope vaccine candidate against onchocerciasis and related filarial diseases.

TL;DR: An immuno-informatics approach was applied to design a filarial multi-epitope subunit vaccine peptide consisting of linear B-cell and T-cell epitopes of proteins reported to be potential novel vaccine candidates that demonstrated antigenicity superior to current vaccine candidates.
Journal ArticleDOI

A candidate multi-epitope vaccine against SARS-CoV-2.

TL;DR: The computational analyses suggest that the designed multi-epitope vaccine is structurally stable which can induce specific immune responses and thus, can be a potential vaccine candidate against SARS-CoV-2.
References
More filters
Journal ArticleDOI

A method and server for predicting damaging missense mutations.

TL;DR: A new method and the corresponding software tool, PolyPhen-2, which is different from the early tool polyPhen1 in the set of predictive features, alignment pipeline, and the method of classification is presented and performance, as presented by its receiver operating characteristic curves, was consistently superior.
Journal ArticleDOI

Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes

TL;DR: A new membrane protein topology prediction method, TMHMM, based on a hidden Markov model is described and validated, and it is discovered that proteins with N(in)-C(in) topologies are strongly preferred in all examined organisms, except Caenorhabditis elegans, where the large number of 7TM receptors increases the counts for N(out)-C-in topologies.
Journal ArticleDOI

SignalP 4.0: discriminating signal peptides from transmembrane regions

TL;DR: SignalP 4.0 was the best signal-peptide predictor for all three organism types but was not in all cases as good as SignalP 3.0 according to cleavage-site sensitivity or signal- peptide correlation when there are no transmembrane proteins present.
Journal ArticleDOI

Improved Prediction of Signal Peptides: SignalP 3.0

TL;DR: Improvements of the currently most popular method for prediction of classically secreted proteins, SignalP, which consists of two different predictors based on neural network and hidden Markov model algorithms, where both components have been updated.
Journal ArticleDOI

Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites.

TL;DR: A new method for the identification of signal peptides and their cleavage sites based on neural networks trained on separate sets of prokaryotic and eukaryotic sequence that performs significantly better than previous prediction schemes and can easily be applied on genome-wide data sets.
Related Papers (5)