Predicting Secretory Proteins with SignalP

doi:10.1007/978-1-4939-7015-5_6

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright

owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

 Users may download and print one copy of any publication from the public portal for the purpose of private study or research.

 You may not further distribute the material or use it for any profit-making activity or commercial gain

 You may freely distribute the URL identifying the publication in the public portal

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately

and investigate your claim.

Downloaded from orbit.dtu.dk on: Aug 09, 2022

Predicting Secretory Proteins with SignalP

Nielsen, Henrik

Published in:

Methods in Molecular Biology

Link to article, DOI:

10.1007/978-1-4939-7015-5_6

Publication date:

2017

Document Version

Peer reviewed version

Link back to DTU Orbit

Citation (APA):

Nielsen, H. (2017). Predicting Secretory Proteins with SignalP. Methods in Molecular Biology, 1611, 59-73.

https://doi.org/10.1007/978-1-4939-7015-5_6

Predicting secretory proteins with

SignalP

Henrik Nielsen

Abstract

SignalP is the currently most widely used program for prediction of signal peptides from amino acid

sequences. Proteins with signal peptides are targeted to the secretory pathway, but are not necessarily

secreted. After a brief introduction to the biology of signal peptides and the history of signal peptide

prediction, this chapter will describe all the options of the current version of SignalP and the details of the

output from the program. The chapter includes a case study where the scores of SignalP were used in a novel

way to predict the functional effects of amino acid substitutions in signal peptides.

Key words

Signal peptides, prediction, secretion, protein sorting, protein subcellular location

1. Introduction

A signal peptide (SP) is the N-terminal part of a protein that is targeted to the secretory pathway in both pro-

and eukaryotes [1] (see, however, Note 1). In eukaryotes, a protein with an SP will be targeted to the

endoplasmic reticulum (ER) membrane and be co-translationally translocated across the membrane. In

prokaryotes, translocation takes place across the cytoplasmic membrane (inner membrane in Gram-negative

bacteria), and the process can happen during or after translation. The SP-carrying protein is threaded through

a protein complex known as the translocon, comprising the subunits SecY, E, and G in bacteria and Sec61 α,

β, and γ in eukaryotes [2]. During translocation, the SP is cleaved off by an enzyme known as signal

peptidase I or leader peptidase (Lep) in bacteria or signal peptidase complex in eukaryotes [3]. See Notes 2-4

for exceptions to this general picture.

It is important to stress that the presence of an SP does not necessarily mean that the protein is secreted to the

extracellular environment—it only means that it enters the secretory pathway. In all kinds of organisms, the

protein could have one or more transmembrane helices downstream of the SP and therefore be retained in the

membrane [4]. In eukaryotes, the protein could also be retained in one of the compartments that belong to the

secretory pathway: the ER, the Golgi apparatus, or the lysosome/vacuole [5]; or it could be anchored to the

outer face of the cytoplasmic membrane by a glycophosphatidylinositol (GPI) group [6]. In Gram-negative

bacteria, the protein could be retained in the periplasm, or be inserted into the outer membrane as a β-barrel

transmembrane protein [7]. In Gram-positive bacteria, the protein could be attached to the cell wall [8].

SPs are generally described as having three regions: an N-terminal n-region of variable length characterized

by positive charge, a central h-region of at least 7 hydrophobic residues, and a C-terminal c-region of

typically 3-7 polar residues. Positions –1 and –3 relative to the cleavage site are occupied by small

uncharged residues; in bacteria predominantly Alanine. SPs of Gram-positive bacteria tend to be longer than

those of Gram-negative bacteria, which in turn tend to be longer than eukaryotic SPs [1].

The SP is among the earliest prediction targets for bioinformatic algorithms, with the first simple prediction

methods being published already in the 1980’s [9–11]. In the early 1990’s, a few machine learning methods

were published [12, 13], but SignalP version 1.0 [14, 15] was in 1996 the first machine learning method for

SP prediction to be made into a publically available web server. SignalP 1.0 and 1.1 were based on artificial

neural networks (ANNs), while SignalP 2.0 from 1998 [16] added a hidden Markov model (HMM)

prediction in order to better distinguish between SPs and signal anchors (transmembrane helices close to the

N-terminus). SignalP 3.0 from 2004 [17] introduced the D-score for better discrimination between SPs and

other sequences and retained the HMM option, while SignalP 4.0 from 2011 [18] is again purely ANN-

based. While constructing SignalP 4.0, we did retrain the HMM part, but we found that it did not perform

better than the ANNs in any of the performance parameters we tested. The most important new feature of

SignalP 4.0 is the improved discrimination between signal peptides and transmembrane regions.

SignalP was updated to version 4.1 in 2012 with an option to set the D-score cutoff values so that the

sensitivity is the same as that of SignalP 3.0, and an option to set the minimum cleavage site position in the

sequence (the minimum SP length). More details about these options are given in Section 3.1. In addition,

the documentation on the website was completely rewritten, and a FAQ was added.

Earlier versions of SignalP have repeatedly been reported as the best performing method in independent

benchmarks [19–22]. SignalP 4 has not yet been independently evaluated, but in the SignalP 4.0 paper [18]

we compared the performance to ten other methods and found that it was superior. The best competing

methods were the combined SP and transmembrane helix predictors Phobius [23], Philius [24], and

SPOCTOPUS [25]. Interestingly, the advantage of SignalP 4.0 over these three programs was larger for

bacteria than for eukaryotes. This may be due to the fact that these three methods did not divide their training

data into different organism groups but pooled them all together, resulting in methods that are optimized for

the most abundant organism group in the data, the eukaryotes.

The performance values for SignalP 3.0 and 4.0 and the ten competing methods can be found in Table E of

the supplementary materials of the SignalP 4.0 paper, which is available on the SignalP web site (click on

“Article abstracts” and then “Update to SignalP v. 4.0”). It should be noted that those values are calculated

by cross-validation on a homology-reduced data set, i.e. they are the performances you should expect when

submitting proteins that are unrelated to anything in the SignalP 4.0 data set. When submitting close

homologs to proteins in the SignalP 4.0 data set, a higher performance should be expected (compare the

aforementioned Table E with the table on the “performance” page of the website documentation).

2. Materials

1. Input data: Amino acid sequences in FASTA format. Note that any letters not corresponding to the

twenty standard amino acids, e.g. ‘U’, ‘B’, or ‘Z’, will be converted to ‘X’ and treated as unknown

amino acids. See also Notes 5 and 6.

2. Website: SignalP 4.1 is available at http://www.cbs.dtu.dk/services/SignalP/, see Figure 1. The

previous versions are also kept online; just click “version history” near the top of the page.

3. Downloadable package: For those who prefer running SignalP on their own computers, there is an

option to download a software package for command line use. The package is free for academic

institutions, while there is a license fee for commercial users. Academic users can go to the page

http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?signalp to fill out the details and accept the license,

while commercial users are asked to contact software@cbs.dtu.dk. The package is available for

Linux, IRIX, Darwin (Mac OS X), and from March 2016 also for Windows computers via the free

Unix-like environments provided by Cygwin [26] or MobaXterm [27].

3. Methods

Running SignalP with the default options is straightforward: On the website, you paste or upload the

sequences and click “submit”; on the command line you write “signalp input.fasta”. The output

will tell you, for each sequence, whether there is an SP predicted, and if yes, where the cleavage site is

predicted to be. However, as seen in Figure 1, there are a number of options, of which especially “Organism

group” and “Cutoff” are important to know about, and there are details of the output format that will help

interpret the predictions.

3.1. Options

Organism group: It is important to choose the correct organism group— Eukaryotes, Gram-negative

bacteria, or Gram-positive bacteria—otherwise, the predictive performance will suffer. In this context,

Gram-positive bacteria are defined as the phyla Actinobacteria (high G+C Gram-positive bacteria) and

Firmicutes. Gram-negative bacteria are defined as all bacteria having both a plasma membrane and an outer

membrane—basically all other bacteria except for the phylum Tenericutes (Mycoplasma and related genera).

SignalP probably should not be used for Tenericutes at all, since they seem to lack a type I signal peptidase

completely [28]. On the command line, organism group is chosen with one of the options “-t euk” (the

default), “-t gram-”, or “-t gram+”. Concerning organism groups, see also Notes 7-11.

Output format: There are four levels of detail possible: “short”, “standard”, “long”, and “all”. The two first

formats report scores and conclusion at the sequence level; “short” in a one-line format and “standard” in a

more human-readable format. “Standard” is the default on the web server, and “short” on the command line.

The “long” and “all” formats additionally report scores for each position in each sequence (for an

explanation of the scores, see next section). The difference between “long” and “all” is that “long” reports

scores for the chosen ANN method only, while “all” reports scores for both ANN methods (SignalP-noTM

and SignalP-TM, see “Method” below for an explanation). On the command line, output format is chosen

with the “-f” option; note that “standard” is chosen with “-f summary”.

Graphics output: SignalP can make a plot of the scores for each position in each sequence in portable

network graphics (PNG) format and optionally also in encapsulated postscript (EPS) format. The default on

the web is to make PNG graphics, while the default on the command line is no graphics. If you want

graphics from the command line, use the options “-g png” or “-g png+eps”.

Method: SignalP 4 has two sets of ANNs: SignalP-noTM is trained with only cytosolic and nuclear proteins

in the negative set, while SignalP-TM is trained with a negative set that also included transmembrane

proteins. During training, we found that the two methods SignalP-TM and SignalP-noTM were to some

extent complementary, i.e. SignalP-TM did not yield as good results as SignalP-noTM when there were no

transmembrane sequences involved. As a compromise, SignalP 4 per default uses a heuristic to decide which

of the two sets of networks is used for prediction of each sequence. If the user is positive that all proteins in

the input are soluble, it is possible to override this heuristic and get a slightly better performance by using

only the SignalP-noTM networks. This is done in the web interface by selecting “Input sequences do not

include TM regions” and on the command line by including the option “-s notm”.

Cutoff: The D-score (see next section) is used for determining whether each input sequence contains an SP or

not. The user can set cutoff values (for SignalP-TM and SignalP-noTM separately) if a different balance

between sensitivity and specificity is desired. The web interface offers two sets of predefined cutoff values,

“Default” and “Sensitive”. The “Default” cutoffs, corresponding to SignalP 4.0, are optimized to give the

best Matthews correlation coefficient (see the “Performance” page on the website for definition), but they

result in a quite conservative prediction with a sensitivity that is actually lower than that of SignalP 3.0. The

“Sensitive” cutoffs, introduced in SignalP 4.1, are set to reproduce the sensitivity of SignalP 3.0. This of

course results in a slightly higher false positive rate, but still significantly better than that of SignalP 3.0

when measured on the whole data set (with transmembrane proteins included in the negative set). Our

recommendation is to use the “Sensitive” setting if it is important to avoid false negatives, but use the

“Default” setting for estimating the proportion of SPs in an organism. The estimation by the “Default” cutoff

was found to be in accordance with an estimate of the number of SPs in Escherichia coli by a recent

proteogenomics study [29]. At the website, you can see the preset cutoff values change when you select

“Default” or “Sensitive” or change the organism group. On the command line, the “Sensitive” cutoffs are

selected by including the options “-U 0.34 -u 0.34” for organism group “euk” or “-U 0.42 -u

0.42” for organism groups “gram+” and “gram-”.

Truncation of input sequence: By default, SignalP truncates every sequence to 70 amino acids before

prediction. This gives enough included sequence after the cleavage site to give the optimal prediction for the

vast majority of SPs. If you want to predict extremely long signal peptides, you can try a higher value, or

disable truncation completely by entering 0 (zero). Note that the neural networks are trained with sequences

with a maximal length of 70, and they include the relative position in the sequence in their input. Therefore,

general performance may deteriorate if you change this setting. On the command line, truncation is changed

with the “-c” option.

Minimal predicted signal peptide length: SignalP 4.0 could, in rare cases, erroneously predict extremely

short signal peptides. These errors have in SignalP 4.1 been eliminated by imposing a lower limit on the

cleavage site position (SP length). The minimum length is by default 10, but you can adjust it. Signal

peptides shorter than 15 residues are very rare, at the time of writing there are 17 experimentally confirmed

cases in UniProt that are not fragments. If you want to disable this length restriction completely, enter 0

(zero). On the command line, minimal SP length is changed with the “-M” option.

3.2. Output

The neural networks in SignalP produce three output scores for each position in the input sequence:

• C-score (raw cleavage site score): The output from the cleavage site networks, which are trained to

distinguish SP cleavage sites from everything else. Note the position numbering of the cleavage site:

The C-score is trained to be high at the position immediately after the cleavage site (the first residue

in the mature protein).

• S-score (signal peptide score): The output from the signal peptide networks, which are trained to

distinguish positions within SPs from positions in the mature part of the proteins and from proteins

without SPs.

• Y-score (combined cleavage site score): A combination (geometric average) of the C-score and the

slope of the S-score, resulting in a better cleavage site prediction than the raw C-score alone. This is

due to the fact that multiple high-peaking C-scores can be found in one sequence, where only one is

the true cleavage site. The Y-score distinguishes between C-score peaks by choosing the one where

the slope of the S-score is steep.

Predicting Secretory Proteins with SignalP

Citations

InterPro in 2019: improving coverage, classification and access to protein sequence annotations.

The InterPro protein families and domains database: 20 years on.

Structure of a human synaptic GABAA receptor

In-silico design of a multi-epitope vaccine candidate against onchocerciasis and related filarial diseases.

A candidate multi-epitope vaccine against SARS-CoV-2.

References

A method and server for predicting damaging missense mutations.

Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes

SignalP 4.0: discriminating signal peptides from transmembrane regions

Improved Prediction of Signal Peptides: SignalP 3.0

Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites.

Related Papers (5)

Basic Local Alignment Search Tool

MEGA7: Molecular Evolutionary Genetics Analysis version 7.0 for bigger datasets

Trimmomatic: a flexible trimmer for Illumina sequence data

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Full-length transcriptome assembly from RNA-Seq data without a reference genome.