scispace - formally typeset
Open AccessJournal ArticleDOI

FFPred 3: feature-based function prediction for all Gene Ontology domains.

TLDR
This update features a larger SVM library that extends its coverage to the cellular component sub-ontology for the first time, prompted by the establishment of a dedicated evaluation category within the Critical Assessment of Functional Annotation.
Abstract
Predicting protein function has been a major goal of bioinformatics for several decades, and it has gained fresh momentum thanks to recent community-wide blind tests aimed at benchmarking available tools on a genomic scale. Sequence-based predictors, especially those performing homology-based transfers, remain the most popular but increasing understanding of their limitations has stimulated the development of complementary approaches, which mostly exploit machine learning. Here we present FFPred 3, which is intended for assigning Gene Ontology terms to human protein chains, when homology with characterized proteins can provide little aid. Predictions are made by scanning the input sequences against an array of Support Vector Machines (SVMs), each examining the relationship between protein function and biophysical attributes describing secondary structure, transmembrane helices, intrinsically disordered regions, signal peptides and other motifs. This update features a larger SVM library that extends its coverage to the cellular component sub-ontology for the first time, prompted by the establishment of a dedicated evaluation category within the Critical Assessment of Functional Annotation. The effectiveness of this approach is demonstrated through benchmarking experiments, and its usefulness is illustrated by analysing the potential functional consequences of alternative splicing in human and their relationship to patterns of biological features.

read more

Content maybe subject to copyright    Report

1
Scientific RepoRts | 6:31865 | DOI: 10.1038/srep31865
www.nature.com/scientificreports
FFPred 3: feature-based function
prediction for all Gene Ontology
domains
Domenico Cozzetto
*
, Federico Minneci
*
, Hannah Currant & David T. Jones
Predicting protein function has been a major goal of bioinformatics for several decades, and it has
gained fresh momentum thanks to recent community-wide blind tests aimed at benchmarking available
tools on a genomic scale. Sequence-based predictors, especially those performing homology-based
transfers, remain the most popular but increasing understanding of their limitations has stimulated
the development of complementary approaches, which mostly exploit machine learning. Here we
present FFPred 3, which is intended for assigning Gene Ontology terms to human protein chains, when
homology with characterized proteins can provide little aid. Predictions are made by scanning the
input sequences against an array of Support Vector Machines (SVMs), each examining the relationship
between protein function and biophysical attributes describing secondary structure, transmembrane
helices, intrinsically disordered regions, signal peptides and other motifs. This update features a
larger SVM library that extends its coverage to the cellular component sub-ontology for the rst time,
prompted by the establishment of a dedicated evaluation category within the Critical Assessment of
Functional Annotation. The eectiveness of this approach is demonstrated through benchmarking
experiments, and its usefulness is illustrated by analysing the potential functional consequences of
alternative splicing in human and their relationship to patterns of biological features.
anks to a combination of experimental assays and computational studies, knowledge about protein function
has been steadily accumulating in public databases, where it is commonly described through the Gene Ontology
1
(GO). On the one hand, hypothesis-driven research has traditionally led to the thorough characterization of
one or few proteins at a time. On the other hand, high-throughput technologies have opened the way to very
large-scale exploratory surveys to study biological processes, identify binding partners, or establish subcellu-
lar locations. Meanwhile, some homology-based approaches for annotation transfers have developed enough to
produce fairly condent results. e GO consortium, for instance, makes wide use of a semi-automated tool for
phylogenetic analysis and functional inference
2
, and of mappings between protein domain families to GO terms
that are valid for all their members
3
. Despite these multi-pronged eorts, however, a substantial fraction of depos-
ited sequences still have no functional annotation at all, and the remaining ones usually lack assignments for at
least one GO domain. When available, this information may not be at the nest level of detail possible, not only
because of the way some electronically inferred annotations are generated, but also because of the varying levels
of resolution characterizing experimental results
4,5
. Finally, nature can still spring surprises: protein moonlighting
demonstrates that novel functions can still await discovery even for well-researched proteins
6
.
One way to ll in some of these gaps employs machine learning to examine diverse biological data types
separately or in combination, and to provide functional hypotheses that complement homology-based annota-
tion transfers
7–9
. In particular, over the years several supervised methods have been devised for function predic-
tion from amino acid sequences, which are easier to collect than structural data or genome-wide measurements
of gene expression or protein-protein interactions. GOStruct
10
and FANN-GO
11
, for instance, make GO term
assignments by analysing the patterns of BLAST
12
E-values to experimentally characterized proteins using struc-
tured Support Vector Machines (SVM) and multioutput neural networks, respectively. Given the computational
complexity of training classiers with multiple correlated outputs, it is dicult to learn the relationship between
the input features and the whole GO; the proponents have therefore adopted workarounds such as reducing the
number of output terms and ensemble modelling. Rather than tackling this complex structured learning problem,
Bioinformatics Group, Department of Computer Science, University College London, Gower Street, London, WC1E
6BT, UK. * These authors contributed equally to the work. Correspondence and requests for materials should be
addressed to D.T.J. (email: d.t.jones@ucl.ac.uk)
received: 05 May 2016
accepted: 25 July 2016
Published: 26 August 2016
OPEN

www.nature.com/scientificreports/
2
Scientific RepoRts | 6:31865 | DOI: 10.1038/srep31865
other researchers have tested with success the possibility of converting it into a set of simpler binary classication
tasks. is approach has recently allowed our group to train GO term-specic neural networks from features
describing the results of prole-prole comparisons
13
.
Alignment-derived features, such as similarity scores, sequence coverage and E-values, can help learn which
sequence similarity patterns correlate with the conservation of individual annotations, thus allowing more eective
control on homology-based annotation transfers. Complementary eorts have investigated the usefulness of bio-
physical attributes to make homology-free inferences, under the assumption that proteins with similar functions
would have similar biological features despite the lack of signicant sequence similarities. For example, the occur-
rence of signal peptides gives useful hints about protein subcellular location, and also limits the number of their
molecular functions and of the biological processes they partake. e idea was rst implemented in ProtFun,
which is based on neural networks trained for the functional classication of protein sequences from similarities
in amino acid composition, and content of signal peptides, trans-membrane helices, post-translationally modi-
ed residues as well as other biological features
14,15
. e observation that the length and position of intrinsically
disordered protein regions strongly correlates with some molecular activities and biological processes led to an
expanded set of sequence-derived features, which FFPred scans through a library of GO term-specic SVMs to
annotate protein chains
16,17
. A more recent study has conrmed the eectiveness of this feature-based approach
with the use of random forests for supervised learning
18
.
In this paper, we describe the latest FFPred release, which updates the previous one with an extended vocab-
ulary spanning all three GO domains, reecting the increasing attention in cellular component annotations, as
evidenced from recent experiments in the Critical Assessment of Functional Annotation initiative. We evaluate
FFPred 3 prediction accuracy using two complementary approaches and describe its improvements over the pre-
vious version. Finally, we show how its predictions can help get a glimpse into the eects of alternative splicing on
human protein function. e results show patterns of functional conservation and variation consistent with the
presence or absence of particular biophysical attributes and with general biological knowledge.
Results and Discussion
Summary of tool updates. anks to the continued growth of annotation databases, the latest FFPred
release features a GO term vocabulary, which spans all three GO domains for the rst time and is almost twice the
size of that in the previous update. Supplementary Data le 1 lists the 868 GO terms, for which a dedicated SVM
is available along with the classication accuracy estimated from the validation experiments following the train-
ing procedures. e new release makes still use of SVMs, which are known to successfully handle imbalanced
classication tasks–typical in computational biology–where it is extremely important to allow for error control
and avoid overtting to known observations. Subcellular localization prediction has been the focus of many
previous studies, which mostly focused on the well-known compartments of eukaryotic cells–such as nucleus,
cytosol, endoplasmic reticulum, Golgi apparatus, mitochondrion and other organelles. e newly added cellular
component terms in FFPred 3 also include some of the numerous macromolecular complexes found in them. e
extensions to the other two sub-ontologies provide more specic descriptions for functional categories previously
covered, and they reect the increasing body of knowledge in areas such as organelle localization, immune sys-
tem and reproductive processes, response to stimuli and chromosome segregation. A small fraction of molecular
function and biological process terms have been removed (Fig.1a,b), because they no longer occur in curated
databases–mostly aer the GO consortium made them obsolete. e majority of functional categories that have
been retained can be predicted with negligible changes in expected accuracy–though some exceptions exist. As
a consequence of the extended knowledge about human protein function since the last update, the patterns of
biophysical attrbutes linked to terms such as sulfur compound metabolic process (GO:0006790), neurotrophin
TRK receptor signaling pathway (GO:0048011), growth factor activity (GO:0008083) and protein kinase binding
(GO:0019901) can be more easily identied and modelled. For other functions, such as calcium ion transport
(GO:0006816), single organismal cell-cell adhesion (GO:0016337), ATPase activity (GO:0016887), and nuclease
Figure 1. Comparison between FFPred 2 and FFPred 3. Extent of the overlap between FFPred 2 and FFPred 3
GO term lists in the MF (a) and BP (b) domains. Most common terms in the MF (c) and BP (d) sub-ontologies
are expected to be predicted with similar accuracy, as measured by the MCC.

www.nature.com/scientificreports/
3
Scientific RepoRts | 6:31865 | DOI: 10.1038/srep31865
activity (GO:0004518), SVM performance has dropped, suggesting that their relationships to sequence-derived
features are more complex than previously appreciated (Fig.1c,d).
e tool is designed with a focus on the function of human proteins, and so annotations curated for other
organisms are never used for training. To learn eectively the relationship between biophysical attributes and
GO terms, suciently large numbers of positive instances are needed, thus limiting the specicity of the func-
tional categories that can be currently predicted. While this feature may not be desirable for all applications, its
benets to overcome some well-known limitations of homology-based annotation transfers have already been
reported
15,17
. Interestingly, previous work showed that the tool can also help annotate protein function for other
eukaryotic organisms. e updated tool is publicly available on the web at http://bioinf.cs.ucl.ac.uk/pred.
Performance evaluation. e accuracy estimates in Supplementary Data le 1 are GO term-specic and
point out the usefulness of FFPred 3 to prioritize human genes for downstream experimental screening when
homology oers little or no help. To complement this analysis and gauge how well protein function as a whole
can be predicted for such dicult cases, a timed experiment similar to the Critical Assessment of Functional
Annotation challenge was conducted, by training a separate SVM library using the public databases released
in November 2013. e resulting 597 classiers were then used to assign GO terms to human proteins with
no experimentally veried biological roles at that time, and their accuracy was nally measured against the
UniProtKB-GOA data as of March 2016. For comparison purposes under dicult working conditions with lim-
ited or completely missing homology information, additional predictions were generated by a baseline method
(Naïve), which ranks GO terms by prevalence in UniProtKB-GOA, and by a sequence similarity-based approach
(BLAST), which can transfer annotations only from distantly related and experimentally characterized proteins
as detailed in Methods. Other machine-learning based tools for GO term prediction from patterns of biological
features could not be included in the study: ProtFun
15
has not been updated in a very long time and only covers
a handful of currently valid GO terms, whereas ProFET
18
requires training from scratch classiers for all GO
categories of interest.
e precision-recall plots in Fig.2 and the data in Table1 provide graphical and numerical reports on the
evaluation results for the three separate GO domains, according to standard practice in the eld. At high levels of
recall (i.e. above roughly 40% for molecular function and 20% for the other two sub-ontologies), FFPred 3 pre-
dictions achieve higher precision values than the baseline approaches do, and the maximum F-scores in Table1
clearly back up this observation. However, the highest scoring predictions made by BLAST for subcellular loca-
tions and by Naïve for all sub-ontologies attain higher precision than the corresponding ones by FFPred 3. is
result surprisingly suggests that these less sophisticated approaches are more useful than FFPred 3, when only a
handful of assays can be run on each protein. Or are they?
It is widely accepted that an obvious pitfall of precision-recall analysis is the total disregard of how informa-
tive predictions are. e most condent GO term assignments made by Naïve for each test protein–GO:0043226
(binding), GO:0005488 (organelle) and GO:0009987 (cellular process)–are far from useful in cutting down the
options for the design of experiments, indeed. Nonetheless, their very shallow nature guarantees that they will
be eventually conrmed for most, if not all, proteins. Furthermore, comparing the precision values achieved
by dierent methods and plotted against the same level of recall could be more ambiguous than it looks at rst
sight. If the recall is less than 1.0, the predictors are evaluated on non-identical sets of target proteins, which
can even be disjoint. Another confounding aspect is the number of GO term predictions above a given decision
threshold made for individual proteins: predictors based on high-throughput functional data aim at high recall
and generally produce longer lists of assignments than those generated by methods based on homology trans-
fers, which tend to achieve higher precision. Finally, correctly assigning the term t to distinct proteins p and q
can pose prediction challenges of diverse nature, depending on how many proteins are annotated with t, and on
how closely p and q follow the patterns of features used to build the classiers–e.g. sequence similarity, domain
architecture, biological attributes, gene expression and so on. erefore, it is useful to look at method perfor-
mance from a dierent angle, by considering both the accuracy and the informativeness of equal numbers of high
Figure 2. Graphical summary of the precision–recall analysis. e three panels show the evaluation results
for the MF (le), BP (centre) and CC (right) domains, respectively. e full triangles mark the points associated
with the maximum F-measure.

www.nature.com/scientificreports/
4
Scientific RepoRts | 6:31865 | DOI: 10.1038/srep31865
scoring predictions for each target and sub-ontology–thus reducing the above biases and yielding results that can
be interpreted more clearly and more easily by non-specialists, too.
e top row panels in Fig.3 summarize prediction quality in terms of F
1
measure and the underlying pre-
cision and recall values are plotted in Figure S1. It is quite clear that FFPred 3 is superior to both Naïve and
BLAST across all three GO domains, because it achieves higher recall than the other predictors do, in combina-
tion with intermediate values of precision. e data also clearly conrm the expectation that Naïve predictions
GO
domain Method reshold TP FP FN NP Precision Recall F
1
MF
FFPred 0.581 1443 3457 1818 427 0.390 0.461 0.422
BLAST 0.210 952 5740 2309 216 0.266 0.282 0.274
Naïve 0.152 1081 1643 2180 454 0.397 0.391 0.394
BP
FFPred 0.576 5792 13013 14469 655 0.353 0.331 0.342
BLAST 0.203 5272 83543 14989 345 0.173 0.271 0.211
Naïve 0.273 4136 8423 16125 661 0.329 0.241 0.278
CC
FFPred 0.730 3800 7424 4576 985 0.369 0.500 0.425
BLAST 0.204 2030 15655 6346 422 0.215 0.251 0.232
Naïve 0.579 2869 3077 5507 991 0.483 0.340 0.399
Table 1. Performance comparison between FFPred 3 and the baseline prediction methods. For each
method, the table reports the total numbers of true positives (TP), false positives (FP) and false negatives
(FN) each method achieves at the decision threshold that maximises the F
1
score for each GO domain. NP
is the number of proteins with at least one prediction with a condence score greater than or equal to the
corresponding threshold value, which is used to calculate the average precision of each method according to
equation(4) in the main text. e average recall is calculated using equation(5) using the number of proteins
with annotations in the GO domain under consideration, which can be found in the section “Methods”. e
latter two values are used to locate the full triangles in the precision-recall space shown in Fig.2.
Figure 3. Comparison of the prediction accuracy and informativeness against number of top ranked
predictions. e graphs on the top row compare the average F-measure of the highest scoring GO term
assignments made by FFPred 3, Naïve and BLAST for the MF (le), BP (centre) and CC (right) domains,
respectively. e bottom row shows the average information content of the true positives for the same
predictions in the top row. Data are plotted only when there are at least 25 targets with x {1, 2, 3, 4, 5}
predictions and
x
validated annotations or more. e label n represents the case where for each protein the
number of predictions assessed equals the number of experimentally supported functions.

www.nature.com/scientificreports/
5
Scientific RepoRts | 6:31865 | DOI: 10.1038/srep31865
generally are highly precise, but not deep enough in the GO graph to outperform the other approaches in terms
of recall. e results for the CC sub-ontology are an interesting exception: the low numbers of false negatives
most likely arise from the relatively shorter distances between nodes associated with experimental annotations
and nodes associated with the most frequent terms in UniProtKB-GOA. e plots also clearly illustrate the limits
of homology-based transfers in such challenging situations. When the evolutionary distances from previously
annotated proteins are large, only the most general functional aspects are retained (e.g. catalytic or transporter
activity), while the ner details diverge (e.g. the nature of the substrates and the chemistry of the reactions), thus
resulting in high numbers of both false positives and false negatives, and ultimately aecting negatively precision,
recall and F-measure values.
As mentioned above, the design and implementation of FFPred 3 produced a list of GO terms with varying
levels of detail, so it could be questioned how informative its predictions are and how helpful they can be to exper-
imenters. In Fig.3, the plots in the bottom row show the average amount of useful information the highest scoring
predictions would actually provide. For this purpose, the analysis only considers true positive predictions, which
are not regarded as equally valuable as in the standard precision-recall analysis, however. ey are rather weighted
according to their information content, which estimates their specicity and informativeness from their occur-
rence in the UniProtKB/SwissProt database – so that more frequent functional categories are down-weighted,
and vice versa. e plots undoubtedly prove that FFPred 3 correct predictions are consistently more specic than
those generated by BLAST, which in turn are more specic than those made by Naïve. erefore, despite the
relatively low levels of term specicity, FFPred 3 can give useful hints to drive the experimental characterization
of proteins, when routes alternative to homology transfers are needed. Table S1 gives some clear examples of how
well FFPred 3 top-ranked predictions compare with the validated GO term assignments, which some proteins
with no prior experimental functional data have recently acquired.
Insights into the functional consequences of alternative splicing in humans. Experimentally sup-
ported functional information for individual splice variants is generally scarce–only a handful of isoform-level
GO term annotations have been reviewed and included in public databases. Even when some isoforms encoded
by the same gene have been assayed, the data are still largely incomplete, because the experiments are usually
focussed on a particular functional aspect. Within this active area of research, FFPred 3 and similar methods for
protein function prediction have the opportunity to help investigate the functional ramications of alternative
splicing. Indeed, very oen comparative sequence analysis can only suggest that the relatively small sequence
changes between splice isoforms cause more or less pronounced structural and functional dierences. In other
words, this approach is typically unable to put forward more detailed testable hypotheses. is opens up the pos-
sibility that alternative splicing products may not encode biochemically active molecules, but rather constitute
a reservoir for natural selection
19–21
–a conjecture that is also hard to verify. Notwithstanding, experimental evi-
dence shows that the functional divergence between alternative splice variants can vary from subtle modulations
of biochemical activities to completely antagonistic regulatory roles
22
. It is therefore interesting to investigate: i)
which functional aspects tend to be more robust to splicing, and consequently conserved across splice variants of
the same gene; and ii) whether canonical isoforms tend to be enriched in functions that are dierent from those
over-represented in their alternative variants–see Methods for further details on the conservation and primarity
scores.
To examine these patterns, a large-scale survey was carried out on 9,214 human proteins and their recorded
splice variants using FFPred 3, under the assumption that eventually they all full a physiological role in the
cell. e analysis was restricted to the GO term predictions compatible with the manually curated assignments
existing in UniProtKB/SwissProt, as to reduce the eects of spurious results on the biological interpretation. e
summary data in Supplementary Data le 2 indicate that the GO terms used in this study display varying levels
of conservation across sets of alternatively spliced transcripts, even though it is dicult to assess the statistical
signicance of the observed dierences. Only ve predicted (and admittedly broad) functions appear to be con-
sistently assigned to all the variants of a gene, and very few of them are highly conserved, when the focus is on the
most reliably predicted GO terms–i.e. the SVM Matthews correlation coecient value is in the top 50% of the dis-
tribution recorded for the corresponding sub-ontology. For instance, only six of such terms annotate all isoforms
of a gene in 90% or more of the cases examined. erefore, despite the use of a consolidated set of predictions,
the ndings support the expectation that alternative splicing plays a role in diversifying the cellular functional
repertoire. Support for this theory is strengthened by the dierential associations of individual biological roles
with canonical or alternative splice isoforms – as gauged by the GO term primarity scores. e Supplementary
Data le 3 indicate that there are many more GO categories preferentially associated with principal variants than
with alternative ones, partly because these analyses are restricted to predicted functions in line with available
annotations in UniProtKB/SwissProt. Nevertheless, the GO terms with high primarity scores tend to represent
more constitutive cellular functions, and those with negative scores appear to be mostly associated with larger
sets of alternatively spliced genes or to be induced by changes in the environment or in the cellular conditions.
As mentioned above, it is dicult to draw statistically sound conclusions from this initial study: identifying the
canonical isoform of each gene is still an open question, and here a rather simple and pragmatic approach was
taken just like in previous studies.
To emphasize the unique advantages that analyzing biological features can oer, Fig.4 gives some insight
into their relationship with some of the most conserved functions in each GO domain–see Methods for more
details. e heatmap allows to link the over- and under-representation of specic biophysical attributes with
the conservation of particular functional aspects. Similarly, Figs5 and 6 show the extent of positive or negative
correlation between sequence-derived feature groups and the GO terms that are preferentially associated with
principal or alternative splice variants, respectively. e results generally reect well-established trends between
functional categories and the occurrence or lack of intrinsically disordered residues, transmembrane helices and

Citations
More filters
Journal ArticleDOI

LSTM-GRU Based Deep Learning Model with Word2Vec for Transcription Factors in Primates

TL;DR: In this article , the authors proposed a deep learning model to classify transcription factor proteins of primates using a hybrid structure that uses Recurrent Neural Network (RNN) based Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks with Word2Vec preprocessing step.
Journal ArticleDOI

In Silico Identification and Characterization of B12D Family Proteins in Viridiplantae

TL;DR: Despite the strong conservation of the B12D proteins of Viridiplantae, gene association analysis, promoter analysis, and digital expression indicate different roles for the members of the b12D family during plant developmental stages.
Posted ContentDOI

An Aedes aegypti seryl-tRNA synthetase paralog controls bacteroidetes growth in the midgut

TL;DR: This is the first time that the absence of a mitochondrial enzyme is linked to intestinal microbiota without any visible effects in mitochondrial respiration and mitochondrial ROS production and indicates that the intestinal microbiota can be controlled in a blood-feeding vector by a novel, unprecedent mechanism.
Journal ArticleDOI

Systems biology's role in leveraging microalgal biomass potential: Current status and future perspectives

TL;DR: In this paper , a review of the most important factors that need to be considered in microalgae cultivation, providing strategies to improve process cost-effectiveness, focusing on the in silico-guided optimization as a real alternative, using GSM models to enhance the production of defined compounds and biomass.
Posted ContentDOI

Ontology-based validation and identification of regulatory phenotypes

TL;DR: A novel ontology-based method to validate the mutual consistency of function and phenotype annotations is developed and it is shown that the predicted phenotypes can be utilized for identification of protein-protein interactions and gene-disease associations.
References
More filters
Journal ArticleDOI

Basic Local Alignment Search Tool

TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.
Journal ArticleDOI

Gene Ontology: tool for the unification of biology

TL;DR: The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing.
Posted ContentDOI

Making large scale SVM learning practical

TL;DR: SVM light as discussed by the authors is an implementation of an SVM learner which addresses the problem of large-scale SVM training with many training examples on the shelf, which makes large scale SVM learning more practical.
Journal ArticleDOI

UniProt: A hub for protein information

Alex Bateman, +127 more
TL;DR: An annotation score for all entries in UniProt is introduced to represent the relative amount of knowledge known about each protein to help identify which proteins are the best characterized and most informative for comparative analysis.
Related Papers (5)

A large-scale evaluation of computational protein function prediction

Predrag Radivojac, +107 more
- 01 Mar 2013 -