FFPred 3: feature-based function prediction for all Gene Ontology domains.

doi:10.1038/SREP31865

1

Scientific RepoRts | 6:31865 | DOI: 10.1038/srep31865

www.nature.com/scientificreports

FFPred 3: feature-based function

prediction for all Gene Ontology

domains

Domenico Cozzetto

*

, Federico Minneci

*

, Hannah Currant & David T. Jones

Predicting protein function has been a major goal of bioinformatics for several decades, and it has

gained fresh momentum thanks to recent community-wide blind tests aimed at benchmarking available

tools on a genomic scale. Sequence-based predictors, especially those performing homology-based

transfers, remain the most popular but increasing understanding of their limitations has stimulated

the development of complementary approaches, which mostly exploit machine learning. Here we

present FFPred 3, which is intended for assigning Gene Ontology terms to human protein chains, when

homology with characterized proteins can provide little aid. Predictions are made by scanning the

input sequences against an array of Support Vector Machines (SVMs), each examining the relationship

between protein function and biophysical attributes describing secondary structure, transmembrane

helices, intrinsically disordered regions, signal peptides and other motifs. This update features a

larger SVM library that extends its coverage to the cellular component sub-ontology for the rst time,

prompted by the establishment of a dedicated evaluation category within the Critical Assessment of

Functional Annotation. The eectiveness of this approach is demonstrated through benchmarking

experiments, and its usefulness is illustrated by analysing the potential functional consequences of

alternative splicing in human and their relationship to patterns of biological features.

anks to a combination of experimental assays and computational studies, knowledge about protein function

has been steadily accumulating in public databases, where it is commonly described through the Gene Ontology

1

(GO). On the one hand, hypothesis-driven research has traditionally led to the thorough characterization of

one or few proteins at a time. On the other hand, high-throughput technologies have opened the way to very

large-scale exploratory surveys to study biological processes, identify binding partners, or establish subcellu-

lar locations. Meanwhile, some homology-based approaches for annotation transfers have developed enough to

produce fairly condent results. e GO consortium, for instance, makes wide use of a semi-automated tool for

phylogenetic analysis and functional inference

2

, and of mappings between protein domain families to GO terms

that are valid for all their members

3

. Despite these multi-pronged eorts, however, a substantial fraction of depos-

ited sequences still have no functional annotation at all, and the remaining ones usually lack assignments for at

least one GO domain. When available, this information may not be at the nest level of detail possible, not only

because of the way some electronically inferred annotations are generated, but also because of the varying levels

of resolution characterizing experimental results

4,5

. Finally, nature can still spring surprises: protein moonlighting

demonstrates that novel functions can still await discovery even for well-researched proteins

6

.

One way to ll in some of these gaps employs machine learning to examine diverse biological data types

separately or in combination, and to provide functional hypotheses that complement homology-based annota-

tion transfers

7–9

. In particular, over the years several supervised methods have been devised for function predic-

tion from amino acid sequences, which are easier to collect than structural data or genome-wide measurements

of gene expression or protein-protein interactions. GOStruct

10

and FANN-GO

11

, for instance, make GO term

assignments by analysing the patterns of BLAST

12

E-values to experimentally characterized proteins using struc-

tured Support Vector Machines (SVM) and multioutput neural networks, respectively. Given the computational

complexity of training classiers with multiple correlated outputs, it is dicult to learn the relationship between

the input features and the whole GO; the proponents have therefore adopted workarounds such as reducing the

number of output terms and ensemble modelling. Rather than tackling this complex structured learning problem,

Bioinformatics Group, Department of Computer Science, University College London, Gower Street, London, WC1E

6BT, UK. * These authors contributed equally to the work. Correspondence and requests for materials should be

addressed to D.T.J. (email: d.t.jones@ucl.ac.uk)

received: 05 May 2016

accepted: 25 July 2016

Published: 26 August 2016

OPEN

www.nature.com/scientificreports/

2

Scientific RepoRts | 6:31865 | DOI: 10.1038/srep31865

other researchers have tested with success the possibility of converting it into a set of simpler binary classication

tasks. is approach has recently allowed our group to train GO term-specic neural networks from features

describing the results of prole-prole comparisons

13

.

Alignment-derived features, such as similarity scores, sequence coverage and E-values, can help learn which

sequence similarity patterns correlate with the conservation of individual annotations, thus allowing more eective

control on homology-based annotation transfers. Complementary eorts have investigated the usefulness of bio-

physical attributes to make homology-free inferences, under the assumption that proteins with similar functions

would have similar biological features despite the lack of signicant sequence similarities. For example, the occur-

rence of signal peptides gives useful hints about protein subcellular location, and also limits the number of their

molecular functions and of the biological processes they partake. e idea was rst implemented in ProtFun,

which is based on neural networks trained for the functional classication of protein sequences from similarities

in amino acid composition, and content of signal peptides, trans-membrane helices, post-translationally modi-

ed residues as well as other biological features

14,15

. e observation that the length and position of intrinsically

disordered protein regions strongly correlates with some molecular activities and biological processes led to an

expanded set of sequence-derived features, which FFPred scans through a library of GO term-specic SVMs to

annotate protein chains

16,17

. A more recent study has conrmed the eectiveness of this feature-based approach

with the use of random forests for supervised learning

18

.

In this paper, we describe the latest FFPred release, which updates the previous one with an extended vocab-

ulary spanning all three GO domains, reecting the increasing attention in cellular component annotations, as

evidenced from recent experiments in the Critical Assessment of Functional Annotation initiative. We evaluate

FFPred 3 prediction accuracy using two complementary approaches and describe its improvements over the pre-

vious version. Finally, we show how its predictions can help get a glimpse into the eects of alternative splicing on

human protein function. e results show patterns of functional conservation and variation consistent with the

presence or absence of particular biophysical attributes and with general biological knowledge.

Results and Discussion

Summary of tool updates. anks to the continued growth of annotation databases, the latest FFPred

release features a GO term vocabulary, which spans all three GO domains for the rst time and is almost twice the

size of that in the previous update. Supplementary Data le 1 lists the 868 GO terms, for which a dedicated SVM

is available along with the classication accuracy estimated from the validation experiments following the train-

ing procedures. e new release makes still use of SVMs, which are known to successfully handle imbalanced

classication tasks–typical in computational biology–where it is extremely important to allow for error control

and avoid overtting to known observations. Subcellular localization prediction has been the focus of many

previous studies, which mostly focused on the well-known compartments of eukaryotic cells–such as nucleus,

cytosol, endoplasmic reticulum, Golgi apparatus, mitochondrion and other organelles. e newly added cellular

component terms in FFPred 3 also include some of the numerous macromolecular complexes found in them. e

extensions to the other two sub-ontologies provide more specic descriptions for functional categories previously

covered, and they reect the increasing body of knowledge in areas such as organelle localization, immune sys-

tem and reproductive processes, response to stimuli and chromosome segregation. A small fraction of molecular

function and biological process terms have been removed (Fig.1a,b), because they no longer occur in curated

databases–mostly aer the GO consortium made them obsolete. e majority of functional categories that have

been retained can be predicted with negligible changes in expected accuracy–though some exceptions exist. As

a consequence of the extended knowledge about human protein function since the last update, the patterns of

biophysical attrbutes linked to terms such as sulfur compound metabolic process (GO:0006790), neurotrophin

TRK receptor signaling pathway (GO:0048011), growth factor activity (GO:0008083) and protein kinase binding

(GO:0019901) can be more easily identied and modelled. For other functions, such as calcium ion transport

(GO:0006816), single organismal cell-cell adhesion (GO:0016337), ATPase activity (GO:0016887), and nuclease

Figure 1. Comparison between FFPred 2 and FFPred 3. Extent of the overlap between FFPred 2 and FFPred 3

GO term lists in the MF (a) and BP (b) domains. Most common terms in the MF (c) and BP (d) sub-ontologies

are expected to be predicted with similar accuracy, as measured by the MCC.

www.nature.com/scientificreports/

3

Scientific RepoRts | 6:31865 | DOI: 10.1038/srep31865

activity (GO:0004518), SVM performance has dropped, suggesting that their relationships to sequence-derived

features are more complex than previously appreciated (Fig.1c,d).

e tool is designed with a focus on the function of human proteins, and so annotations curated for other

organisms are never used for training. To learn eectively the relationship between biophysical attributes and

GO terms, suciently large numbers of positive instances are needed, thus limiting the specicity of the func-

tional categories that can be currently predicted. While this feature may not be desirable for all applications, its

benets to overcome some well-known limitations of homology-based annotation transfers have already been

reported

15,17

. Interestingly, previous work showed that the tool can also help annotate protein function for other

eukaryotic organisms. e updated tool is publicly available on the web at http://bioinf.cs.ucl.ac.uk/pred.

Performance evaluation. e accuracy estimates in Supplementary Data le 1 are GO term-specic and

point out the usefulness of FFPred 3 to prioritize human genes for downstream experimental screening when

homology oers little or no help. To complement this analysis and gauge how well protein function as a whole

can be predicted for such dicult cases, a timed experiment similar to the Critical Assessment of Functional

Annotation challenge was conducted, by training a separate SVM library using the public databases released

in November 2013. e resulting 597 classiers were then used to assign GO terms to human proteins with

no experimentally veried biological roles at that time, and their accuracy was nally measured against the

UniProtKB-GOA data as of March 2016. For comparison purposes under dicult working conditions with lim-

ited or completely missing homology information, additional predictions were generated by a baseline method

(Naïve), which ranks GO terms by prevalence in UniProtKB-GOA, and by a sequence similarity-based approach

(BLAST), which can transfer annotations only from distantly related and experimentally characterized proteins

as detailed in Methods. Other machine-learning based tools for GO term prediction from patterns of biological

features could not be included in the study: ProtFun

15

has not been updated in a very long time and only covers

a handful of currently valid GO terms, whereas ProFET

18

requires training from scratch classiers for all GO

categories of interest.

e precision-recall plots in Fig.2 and the data in Table1 provide graphical and numerical reports on the

evaluation results for the three separate GO domains, according to standard practice in the eld. At high levels of

recall (i.e. above roughly 40% for molecular function and 20% for the other two sub-ontologies), FFPred 3 pre-

dictions achieve higher precision values than the baseline approaches do, and the maximum F-scores in Table1

clearly back up this observation. However, the highest scoring predictions made by BLAST for subcellular loca-

tions and by Naïve for all sub-ontologies attain higher precision than the corresponding ones by FFPred 3. is

result surprisingly suggests that these less sophisticated approaches are more useful than FFPred 3, when only a

handful of assays can be run on each protein. Or are they?

It is widely accepted that an obvious pitfall of precision-recall analysis is the total disregard of how informa-

tive predictions are. e most condent GO term assignments made by Naïve for each test protein–GO:0043226

(binding), GO:0005488 (organelle) and GO:0009987 (cellular process)–are far from useful in cutting down the

options for the design of experiments, indeed. Nonetheless, their very shallow nature guarantees that they will

be eventually conrmed for most, if not all, proteins. Furthermore, comparing the precision values achieved

by dierent methods and plotted against the same level of recall could be more ambiguous than it looks at rst

sight. If the recall is less than 1.0, the predictors are evaluated on non-identical sets of target proteins, which

can even be disjoint. Another confounding aspect is the number of GO term predictions above a given decision

threshold made for individual proteins: predictors based on high-throughput functional data aim at high recall

and generally produce longer lists of assignments than those generated by methods based on homology trans-

fers, which tend to achieve higher precision. Finally, correctly assigning the term t to distinct proteins p and q

can pose prediction challenges of diverse nature, depending on how many proteins are annotated with t, and on

how closely p and q follow the patterns of features used to build the classiers–e.g. sequence similarity, domain

architecture, biological attributes, gene expression and so on. erefore, it is useful to look at method perfor-

mance from a dierent angle, by considering both the accuracy and the informativeness of equal numbers of high

Figure 2. Graphical summary of the precision–recall analysis. e three panels show the evaluation results

for the MF (le), BP (centre) and CC (right) domains, respectively. e full triangles mark the points associated

with the maximum F-measure.

www.nature.com/scientificreports/

4

Scientific RepoRts | 6:31865 | DOI: 10.1038/srep31865

scoring predictions for each target and sub-ontology–thus reducing the above biases and yielding results that can

be interpreted more clearly and more easily by non-specialists, too.

e top row panels in Fig.3 summarize prediction quality in terms of F

1

measure and the underlying pre-

cision and recall values are plotted in Figure S1. It is quite clear that FFPred 3 is superior to both Naïve and

BLAST across all three GO domains, because it achieves higher recall than the other predictors do, in combina-

tion with intermediate values of precision. e data also clearly conrm the expectation that Naïve predictions

GO

domain Method reshold TP FP FN NP Precision Recall F

1

MF

FFPred 0.581 1443 3457 1818 427 0.390 0.461 0.422

BLAST 0.210 952 5740 2309 216 0.266 0.282 0.274

Naïve 0.152 1081 1643 2180 454 0.397 0.391 0.394

BP

FFPred 0.576 5792 13013 14469 655 0.353 0.331 0.342

BLAST 0.203 5272 83543 14989 345 0.173 0.271 0.211

Naïve 0.273 4136 8423 16125 661 0.329 0.241 0.278

CC

FFPred 0.730 3800 7424 4576 985 0.369 0.500 0.425

BLAST 0.204 2030 15655 6346 422 0.215 0.251 0.232

Naïve 0.579 2869 3077 5507 991 0.483 0.340 0.399

Table 1. Performance comparison between FFPred 3 and the baseline prediction methods. For each

method, the table reports the total numbers of true positives (TP), false positives (FP) and false negatives

(FN) each method achieves at the decision threshold that maximises the F

1

score for each GO domain. NP

is the number of proteins with at least one prediction with a condence score greater than or equal to the

corresponding threshold value, which is used to calculate the average precision of each method according to

equation(4) in the main text. e average recall is calculated using equation(5) using the number of proteins

with annotations in the GO domain under consideration, which can be found in the section “Methods”. e

latter two values are used to locate the full triangles in the precision-recall space shown in Fig.2.

Figure 3. Comparison of the prediction accuracy and informativeness against number of top ranked

predictions. e graphs on the top row compare the average F-measure of the highest scoring GO term

assignments made by FFPred 3, Naïve and BLAST for the MF (le), BP (centre) and CC (right) domains,

respectively. e bottom row shows the average information content of the true positives for the same

predictions in the top row. Data are plotted only when there are at least 25 targets with x ∈ {1, 2, 3, 4, 5}

predictions and

x

validated annotations or more. e label n represents the case where for each protein the

number of predictions assessed equals the number of experimentally supported functions.

www.nature.com/scientificreports/

5

Scientific RepoRts | 6:31865 | DOI: 10.1038/srep31865

generally are highly precise, but not deep enough in the GO graph to outperform the other approaches in terms

of recall. e results for the CC sub-ontology are an interesting exception: the low numbers of false negatives

most likely arise from the relatively shorter distances between nodes associated with experimental annotations

and nodes associated with the most frequent terms in UniProtKB-GOA. e plots also clearly illustrate the limits

of homology-based transfers in such challenging situations. When the evolutionary distances from previously

annotated proteins are large, only the most general functional aspects are retained (e.g. catalytic or transporter

activity), while the ner details diverge (e.g. the nature of the substrates and the chemistry of the reactions), thus

resulting in high numbers of both false positives and false negatives, and ultimately aecting negatively precision,

recall and F-measure values.

As mentioned above, the design and implementation of FFPred 3 produced a list of GO terms with varying

levels of detail, so it could be questioned how informative its predictions are and how helpful they can be to exper-

imenters. In Fig.3, the plots in the bottom row show the average amount of useful information the highest scoring

predictions would actually provide. For this purpose, the analysis only considers true positive predictions, which

are not regarded as equally valuable as in the standard precision-recall analysis, however. ey are rather weighted

according to their information content, which estimates their specicity and informativeness from their occur-

rence in the UniProtKB/SwissProt database – so that more frequent functional categories are down-weighted,

and vice versa. e plots undoubtedly prove that FFPred 3 correct predictions are consistently more specic than

those generated by BLAST, which in turn are more specic than those made by Naïve. erefore, despite the

relatively low levels of term specicity, FFPred 3 can give useful hints to drive the experimental characterization

of proteins, when routes alternative to homology transfers are needed. Table S1 gives some clear examples of how

well FFPred 3 top-ranked predictions compare with the validated GO term assignments, which some proteins

with no prior experimental functional data have recently acquired.

Insights into the functional consequences of alternative splicing in humans. Experimentally sup-

ported functional information for individual splice variants is generally scarce–only a handful of isoform-level

GO term annotations have been reviewed and included in public databases. Even when some isoforms encoded

by the same gene have been assayed, the data are still largely incomplete, because the experiments are usually

focussed on a particular functional aspect. Within this active area of research, FFPred 3 and similar methods for

protein function prediction have the opportunity to help investigate the functional ramications of alternative

splicing. Indeed, very oen comparative sequence analysis can only suggest that the relatively small sequence

changes between splice isoforms cause more or less pronounced structural and functional dierences. In other

words, this approach is typically unable to put forward more detailed testable hypotheses. is opens up the pos-

sibility that alternative splicing products may not encode biochemically active molecules, but rather constitute

a reservoir for natural selection

19–21

–a conjecture that is also hard to verify. Notwithstanding, experimental evi-

dence shows that the functional divergence between alternative splice variants can vary from subtle modulations

of biochemical activities to completely antagonistic regulatory roles

22

. It is therefore interesting to investigate: i)

which functional aspects tend to be more robust to splicing, and consequently conserved across splice variants of

the same gene; and ii) whether canonical isoforms tend to be enriched in functions that are dierent from those

over-represented in their alternative variants–see Methods for further details on the conservation and primarity

scores.

To examine these patterns, a large-scale survey was carried out on 9,214 human proteins and their recorded

splice variants using FFPred 3, under the assumption that eventually they all full a physiological role in the

cell. e analysis was restricted to the GO term predictions compatible with the manually curated assignments

existing in UniProtKB/SwissProt, as to reduce the eects of spurious results on the biological interpretation. e

summary data in Supplementary Data le 2 indicate that the GO terms used in this study display varying levels

of conservation across sets of alternatively spliced transcripts, even though it is dicult to assess the statistical

signicance of the observed dierences. Only ve predicted (and admittedly broad) functions appear to be con-

sistently assigned to all the variants of a gene, and very few of them are highly conserved, when the focus is on the

most reliably predicted GO terms–i.e. the SVM Matthews correlation coecient value is in the top 50% of the dis-

tribution recorded for the corresponding sub-ontology. For instance, only six of such terms annotate all isoforms

of a gene in 90% or more of the cases examined. erefore, despite the use of a consolidated set of predictions,

the ndings support the expectation that alternative splicing plays a role in diversifying the cellular functional

repertoire. Support for this theory is strengthened by the dierential associations of individual biological roles

with canonical or alternative splice isoforms – as gauged by the GO term primarity scores. e Supplementary

Data le 3 indicate that there are many more GO categories preferentially associated with principal variants than

with alternative ones, partly because these analyses are restricted to predicted functions in line with available

annotations in UniProtKB/SwissProt. Nevertheless, the GO terms with high primarity scores tend to represent

more constitutive cellular functions, and those with negative scores appear to be mostly associated with larger

sets of alternatively spliced genes or to be induced by changes in the environment or in the cellular conditions.

As mentioned above, it is dicult to draw statistically sound conclusions from this initial study: identifying the

canonical isoform of each gene is still an open question, and here a rather simple and pragmatic approach was

taken just like in previous studies.

To emphasize the unique advantages that analyzing biological features can oer, Fig.4 gives some insight

into their relationship with some of the most conserved functions in each GO domain–see Methods for more

details. e heatmap allows to link the over- and under-representation of specic biophysical attributes with

the conservation of particular functional aspects. Similarly, Figs5 and 6 show the extent of positive or negative

correlation between sequence-derived feature groups and the GO terms that are preferentially associated with

principal or alternative splice variants, respectively. e results generally reect well-established trends between

functional categories and the occurrence or lack of intrinsically disordered residues, transmembrane helices and

FFPred 3: feature-based function prediction for all Gene Ontology domains.

Citations

LSTM-GRU Based Deep Learning Model with Word2Vec for Transcription Factors in Primates

In Silico Identification and Characterization of B12D Family Proteins in Viridiplantae

An Aedes aegypti seryl-tRNA synthetase paralog controls bacteroidetes growth in the midgut

Systems biology's role in leveraging microalgal biomass potential: Current status and future perspectives

Ontology-based validation and identification of regulatory phenotypes

References

Basic Local Alignment Search Tool

Gene Ontology: tool for the unification of biology

Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods

Making large scale SVM learning practical

UniProt: A hub for protein information

Related Papers (5)

A large-scale evaluation of computational protein function prediction

Gene Ontology: tool for the unification of biology

Basic Local Alignment Search Tool

Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics.

UniProt: the Universal Protein knowledgebase