scispace - formally typeset
Search or ask a question
Journal ArticleDOI

FFPred 3: feature-based function prediction for all Gene Ontology domains.

26 Aug 2016-Scientific Reports (NATURE PUBLISHING GROUP)-Vol. 6, Iss: 1, pp 31865-31865
TL;DR: This update features a larger SVM library that extends its coverage to the cellular component sub-ontology for the first time, prompted by the establishment of a dedicated evaluation category within the Critical Assessment of Functional Annotation.
Abstract: Predicting protein function has been a major goal of bioinformatics for several decades, and it has gained fresh momentum thanks to recent community-wide blind tests aimed at benchmarking available tools on a genomic scale. Sequence-based predictors, especially those performing homology-based transfers, remain the most popular but increasing understanding of their limitations has stimulated the development of complementary approaches, which mostly exploit machine learning. Here we present FFPred 3, which is intended for assigning Gene Ontology terms to human protein chains, when homology with characterized proteins can provide little aid. Predictions are made by scanning the input sequences against an array of Support Vector Machines (SVMs), each examining the relationship between protein function and biophysical attributes describing secondary structure, transmembrane helices, intrinsically disordered regions, signal peptides and other motifs. This update features a larger SVM library that extends its coverage to the cellular component sub-ontology for the first time, prompted by the establishment of a dedicated evaluation category within the Critical Assessment of Functional Annotation. The effectiveness of this approach is demonstrated through benchmarking experiments, and its usefulness is illustrated by analysing the potential functional consequences of alternative splicing in human and their relationship to patterns of biological features.

Content maybe subject to copyright    Report

1
Scientific RepoRts | 6:31865 | DOI: 10.1038/srep31865
www.nature.com/scientificreports
FFPred 3: feature-based function
prediction for all Gene Ontology
domains
Domenico Cozzetto
*
, Federico Minneci
*
, Hannah Currant & David T. Jones
Predicting protein function has been a major goal of bioinformatics for several decades, and it has
gained fresh momentum thanks to recent community-wide blind tests aimed at benchmarking available
tools on a genomic scale. Sequence-based predictors, especially those performing homology-based
transfers, remain the most popular but increasing understanding of their limitations has stimulated
the development of complementary approaches, which mostly exploit machine learning. Here we
present FFPred 3, which is intended for assigning Gene Ontology terms to human protein chains, when
homology with characterized proteins can provide little aid. Predictions are made by scanning the
input sequences against an array of Support Vector Machines (SVMs), each examining the relationship
between protein function and biophysical attributes describing secondary structure, transmembrane
helices, intrinsically disordered regions, signal peptides and other motifs. This update features a
larger SVM library that extends its coverage to the cellular component sub-ontology for the rst time,
prompted by the establishment of a dedicated evaluation category within the Critical Assessment of
Functional Annotation. The eectiveness of this approach is demonstrated through benchmarking
experiments, and its usefulness is illustrated by analysing the potential functional consequences of
alternative splicing in human and their relationship to patterns of biological features.
anks to a combination of experimental assays and computational studies, knowledge about protein function
has been steadily accumulating in public databases, where it is commonly described through the Gene Ontology
1
(GO). On the one hand, hypothesis-driven research has traditionally led to the thorough characterization of
one or few proteins at a time. On the other hand, high-throughput technologies have opened the way to very
large-scale exploratory surveys to study biological processes, identify binding partners, or establish subcellu-
lar locations. Meanwhile, some homology-based approaches for annotation transfers have developed enough to
produce fairly condent results. e GO consortium, for instance, makes wide use of a semi-automated tool for
phylogenetic analysis and functional inference
2
, and of mappings between protein domain families to GO terms
that are valid for all their members
3
. Despite these multi-pronged eorts, however, a substantial fraction of depos-
ited sequences still have no functional annotation at all, and the remaining ones usually lack assignments for at
least one GO domain. When available, this information may not be at the nest level of detail possible, not only
because of the way some electronically inferred annotations are generated, but also because of the varying levels
of resolution characterizing experimental results
4,5
. Finally, nature can still spring surprises: protein moonlighting
demonstrates that novel functions can still await discovery even for well-researched proteins
6
.
One way to ll in some of these gaps employs machine learning to examine diverse biological data types
separately or in combination, and to provide functional hypotheses that complement homology-based annota-
tion transfers
7–9
. In particular, over the years several supervised methods have been devised for function predic-
tion from amino acid sequences, which are easier to collect than structural data or genome-wide measurements
of gene expression or protein-protein interactions. GOStruct
10
and FANN-GO
11
, for instance, make GO term
assignments by analysing the patterns of BLAST
12
E-values to experimentally characterized proteins using struc-
tured Support Vector Machines (SVM) and multioutput neural networks, respectively. Given the computational
complexity of training classiers with multiple correlated outputs, it is dicult to learn the relationship between
the input features and the whole GO; the proponents have therefore adopted workarounds such as reducing the
number of output terms and ensemble modelling. Rather than tackling this complex structured learning problem,
Bioinformatics Group, Department of Computer Science, University College London, Gower Street, London, WC1E
6BT, UK. * These authors contributed equally to the work. Correspondence and requests for materials should be
addressed to D.T.J. (email: d.t.jones@ucl.ac.uk)
received: 05 May 2016
accepted: 25 July 2016
Published: 26 August 2016
OPEN

www.nature.com/scientificreports/
2
Scientific RepoRts | 6:31865 | DOI: 10.1038/srep31865
other researchers have tested with success the possibility of converting it into a set of simpler binary classication
tasks. is approach has recently allowed our group to train GO term-specic neural networks from features
describing the results of prole-prole comparisons
13
.
Alignment-derived features, such as similarity scores, sequence coverage and E-values, can help learn which
sequence similarity patterns correlate with the conservation of individual annotations, thus allowing more eective
control on homology-based annotation transfers. Complementary eorts have investigated the usefulness of bio-
physical attributes to make homology-free inferences, under the assumption that proteins with similar functions
would have similar biological features despite the lack of signicant sequence similarities. For example, the occur-
rence of signal peptides gives useful hints about protein subcellular location, and also limits the number of their
molecular functions and of the biological processes they partake. e idea was rst implemented in ProtFun,
which is based on neural networks trained for the functional classication of protein sequences from similarities
in amino acid composition, and content of signal peptides, trans-membrane helices, post-translationally modi-
ed residues as well as other biological features
14,15
. e observation that the length and position of intrinsically
disordered protein regions strongly correlates with some molecular activities and biological processes led to an
expanded set of sequence-derived features, which FFPred scans through a library of GO term-specic SVMs to
annotate protein chains
16,17
. A more recent study has conrmed the eectiveness of this feature-based approach
with the use of random forests for supervised learning
18
.
In this paper, we describe the latest FFPred release, which updates the previous one with an extended vocab-
ulary spanning all three GO domains, reecting the increasing attention in cellular component annotations, as
evidenced from recent experiments in the Critical Assessment of Functional Annotation initiative. We evaluate
FFPred 3 prediction accuracy using two complementary approaches and describe its improvements over the pre-
vious version. Finally, we show how its predictions can help get a glimpse into the eects of alternative splicing on
human protein function. e results show patterns of functional conservation and variation consistent with the
presence or absence of particular biophysical attributes and with general biological knowledge.
Results and Discussion
Summary of tool updates. anks to the continued growth of annotation databases, the latest FFPred
release features a GO term vocabulary, which spans all three GO domains for the rst time and is almost twice the
size of that in the previous update. Supplementary Data le 1 lists the 868 GO terms, for which a dedicated SVM
is available along with the classication accuracy estimated from the validation experiments following the train-
ing procedures. e new release makes still use of SVMs, which are known to successfully handle imbalanced
classication tasks–typical in computational biology–where it is extremely important to allow for error control
and avoid overtting to known observations. Subcellular localization prediction has been the focus of many
previous studies, which mostly focused on the well-known compartments of eukaryotic cells–such as nucleus,
cytosol, endoplasmic reticulum, Golgi apparatus, mitochondrion and other organelles. e newly added cellular
component terms in FFPred 3 also include some of the numerous macromolecular complexes found in them. e
extensions to the other two sub-ontologies provide more specic descriptions for functional categories previously
covered, and they reect the increasing body of knowledge in areas such as organelle localization, immune sys-
tem and reproductive processes, response to stimuli and chromosome segregation. A small fraction of molecular
function and biological process terms have been removed (Fig.1a,b), because they no longer occur in curated
databases–mostly aer the GO consortium made them obsolete. e majority of functional categories that have
been retained can be predicted with negligible changes in expected accuracy–though some exceptions exist. As
a consequence of the extended knowledge about human protein function since the last update, the patterns of
biophysical attrbutes linked to terms such as sulfur compound metabolic process (GO:0006790), neurotrophin
TRK receptor signaling pathway (GO:0048011), growth factor activity (GO:0008083) and protein kinase binding
(GO:0019901) can be more easily identied and modelled. For other functions, such as calcium ion transport
(GO:0006816), single organismal cell-cell adhesion (GO:0016337), ATPase activity (GO:0016887), and nuclease
Figure 1. Comparison between FFPred 2 and FFPred 3. Extent of the overlap between FFPred 2 and FFPred 3
GO term lists in the MF (a) and BP (b) domains. Most common terms in the MF (c) and BP (d) sub-ontologies
are expected to be predicted with similar accuracy, as measured by the MCC.

www.nature.com/scientificreports/
3
Scientific RepoRts | 6:31865 | DOI: 10.1038/srep31865
activity (GO:0004518), SVM performance has dropped, suggesting that their relationships to sequence-derived
features are more complex than previously appreciated (Fig.1c,d).
e tool is designed with a focus on the function of human proteins, and so annotations curated for other
organisms are never used for training. To learn eectively the relationship between biophysical attributes and
GO terms, suciently large numbers of positive instances are needed, thus limiting the specicity of the func-
tional categories that can be currently predicted. While this feature may not be desirable for all applications, its
benets to overcome some well-known limitations of homology-based annotation transfers have already been
reported
15,17
. Interestingly, previous work showed that the tool can also help annotate protein function for other
eukaryotic organisms. e updated tool is publicly available on the web at http://bioinf.cs.ucl.ac.uk/pred.
Performance evaluation. e accuracy estimates in Supplementary Data le 1 are GO term-specic and
point out the usefulness of FFPred 3 to prioritize human genes for downstream experimental screening when
homology oers little or no help. To complement this analysis and gauge how well protein function as a whole
can be predicted for such dicult cases, a timed experiment similar to the Critical Assessment of Functional
Annotation challenge was conducted, by training a separate SVM library using the public databases released
in November 2013. e resulting 597 classiers were then used to assign GO terms to human proteins with
no experimentally veried biological roles at that time, and their accuracy was nally measured against the
UniProtKB-GOA data as of March 2016. For comparison purposes under dicult working conditions with lim-
ited or completely missing homology information, additional predictions were generated by a baseline method
(Naïve), which ranks GO terms by prevalence in UniProtKB-GOA, and by a sequence similarity-based approach
(BLAST), which can transfer annotations only from distantly related and experimentally characterized proteins
as detailed in Methods. Other machine-learning based tools for GO term prediction from patterns of biological
features could not be included in the study: ProtFun
15
has not been updated in a very long time and only covers
a handful of currently valid GO terms, whereas ProFET
18
requires training from scratch classiers for all GO
categories of interest.
e precision-recall plots in Fig.2 and the data in Table1 provide graphical and numerical reports on the
evaluation results for the three separate GO domains, according to standard practice in the eld. At high levels of
recall (i.e. above roughly 40% for molecular function and 20% for the other two sub-ontologies), FFPred 3 pre-
dictions achieve higher precision values than the baseline approaches do, and the maximum F-scores in Table1
clearly back up this observation. However, the highest scoring predictions made by BLAST for subcellular loca-
tions and by Naïve for all sub-ontologies attain higher precision than the corresponding ones by FFPred 3. is
result surprisingly suggests that these less sophisticated approaches are more useful than FFPred 3, when only a
handful of assays can be run on each protein. Or are they?
It is widely accepted that an obvious pitfall of precision-recall analysis is the total disregard of how informa-
tive predictions are. e most condent GO term assignments made by Naïve for each test protein–GO:0043226
(binding), GO:0005488 (organelle) and GO:0009987 (cellular process)–are far from useful in cutting down the
options for the design of experiments, indeed. Nonetheless, their very shallow nature guarantees that they will
be eventually conrmed for most, if not all, proteins. Furthermore, comparing the precision values achieved
by dierent methods and plotted against the same level of recall could be more ambiguous than it looks at rst
sight. If the recall is less than 1.0, the predictors are evaluated on non-identical sets of target proteins, which
can even be disjoint. Another confounding aspect is the number of GO term predictions above a given decision
threshold made for individual proteins: predictors based on high-throughput functional data aim at high recall
and generally produce longer lists of assignments than those generated by methods based on homology trans-
fers, which tend to achieve higher precision. Finally, correctly assigning the term t to distinct proteins p and q
can pose prediction challenges of diverse nature, depending on how many proteins are annotated with t, and on
how closely p and q follow the patterns of features used to build the classiers–e.g. sequence similarity, domain
architecture, biological attributes, gene expression and so on. erefore, it is useful to look at method perfor-
mance from a dierent angle, by considering both the accuracy and the informativeness of equal numbers of high
Figure 2. Graphical summary of the precision–recall analysis. e three panels show the evaluation results
for the MF (le), BP (centre) and CC (right) domains, respectively. e full triangles mark the points associated
with the maximum F-measure.

www.nature.com/scientificreports/
4
Scientific RepoRts | 6:31865 | DOI: 10.1038/srep31865
scoring predictions for each target and sub-ontology–thus reducing the above biases and yielding results that can
be interpreted more clearly and more easily by non-specialists, too.
e top row panels in Fig.3 summarize prediction quality in terms of F
1
measure and the underlying pre-
cision and recall values are plotted in Figure S1. It is quite clear that FFPred 3 is superior to both Naïve and
BLAST across all three GO domains, because it achieves higher recall than the other predictors do, in combina-
tion with intermediate values of precision. e data also clearly conrm the expectation that Naïve predictions
GO
domain Method reshold TP FP FN NP Precision Recall F
1
MF
FFPred 0.581 1443 3457 1818 427 0.390 0.461 0.422
BLAST 0.210 952 5740 2309 216 0.266 0.282 0.274
Naïve 0.152 1081 1643 2180 454 0.397 0.391 0.394
BP
FFPred 0.576 5792 13013 14469 655 0.353 0.331 0.342
BLAST 0.203 5272 83543 14989 345 0.173 0.271 0.211
Naïve 0.273 4136 8423 16125 661 0.329 0.241 0.278
CC
FFPred 0.730 3800 7424 4576 985 0.369 0.500 0.425
BLAST 0.204 2030 15655 6346 422 0.215 0.251 0.232
Naïve 0.579 2869 3077 5507 991 0.483 0.340 0.399
Table 1. Performance comparison between FFPred 3 and the baseline prediction methods. For each
method, the table reports the total numbers of true positives (TP), false positives (FP) and false negatives
(FN) each method achieves at the decision threshold that maximises the F
1
score for each GO domain. NP
is the number of proteins with at least one prediction with a condence score greater than or equal to the
corresponding threshold value, which is used to calculate the average precision of each method according to
equation(4) in the main text. e average recall is calculated using equation(5) using the number of proteins
with annotations in the GO domain under consideration, which can be found in the section “Methods”. e
latter two values are used to locate the full triangles in the precision-recall space shown in Fig.2.
Figure 3. Comparison of the prediction accuracy and informativeness against number of top ranked
predictions. e graphs on the top row compare the average F-measure of the highest scoring GO term
assignments made by FFPred 3, Naïve and BLAST for the MF (le), BP (centre) and CC (right) domains,
respectively. e bottom row shows the average information content of the true positives for the same
predictions in the top row. Data are plotted only when there are at least 25 targets with x {1, 2, 3, 4, 5}
predictions and
x
validated annotations or more. e label n represents the case where for each protein the
number of predictions assessed equals the number of experimentally supported functions.

www.nature.com/scientificreports/
5
Scientific RepoRts | 6:31865 | DOI: 10.1038/srep31865
generally are highly precise, but not deep enough in the GO graph to outperform the other approaches in terms
of recall. e results for the CC sub-ontology are an interesting exception: the low numbers of false negatives
most likely arise from the relatively shorter distances between nodes associated with experimental annotations
and nodes associated with the most frequent terms in UniProtKB-GOA. e plots also clearly illustrate the limits
of homology-based transfers in such challenging situations. When the evolutionary distances from previously
annotated proteins are large, only the most general functional aspects are retained (e.g. catalytic or transporter
activity), while the ner details diverge (e.g. the nature of the substrates and the chemistry of the reactions), thus
resulting in high numbers of both false positives and false negatives, and ultimately aecting negatively precision,
recall and F-measure values.
As mentioned above, the design and implementation of FFPred 3 produced a list of GO terms with varying
levels of detail, so it could be questioned how informative its predictions are and how helpful they can be to exper-
imenters. In Fig.3, the plots in the bottom row show the average amount of useful information the highest scoring
predictions would actually provide. For this purpose, the analysis only considers true positive predictions, which
are not regarded as equally valuable as in the standard precision-recall analysis, however. ey are rather weighted
according to their information content, which estimates their specicity and informativeness from their occur-
rence in the UniProtKB/SwissProt database – so that more frequent functional categories are down-weighted,
and vice versa. e plots undoubtedly prove that FFPred 3 correct predictions are consistently more specic than
those generated by BLAST, which in turn are more specic than those made by Naïve. erefore, despite the
relatively low levels of term specicity, FFPred 3 can give useful hints to drive the experimental characterization
of proteins, when routes alternative to homology transfers are needed. Table S1 gives some clear examples of how
well FFPred 3 top-ranked predictions compare with the validated GO term assignments, which some proteins
with no prior experimental functional data have recently acquired.
Insights into the functional consequences of alternative splicing in humans. Experimentally sup-
ported functional information for individual splice variants is generally scarce–only a handful of isoform-level
GO term annotations have been reviewed and included in public databases. Even when some isoforms encoded
by the same gene have been assayed, the data are still largely incomplete, because the experiments are usually
focussed on a particular functional aspect. Within this active area of research, FFPred 3 and similar methods for
protein function prediction have the opportunity to help investigate the functional ramications of alternative
splicing. Indeed, very oen comparative sequence analysis can only suggest that the relatively small sequence
changes between splice isoforms cause more or less pronounced structural and functional dierences. In other
words, this approach is typically unable to put forward more detailed testable hypotheses. is opens up the pos-
sibility that alternative splicing products may not encode biochemically active molecules, but rather constitute
a reservoir for natural selection
19–21
–a conjecture that is also hard to verify. Notwithstanding, experimental evi-
dence shows that the functional divergence between alternative splice variants can vary from subtle modulations
of biochemical activities to completely antagonistic regulatory roles
22
. It is therefore interesting to investigate: i)
which functional aspects tend to be more robust to splicing, and consequently conserved across splice variants of
the same gene; and ii) whether canonical isoforms tend to be enriched in functions that are dierent from those
over-represented in their alternative variants–see Methods for further details on the conservation and primarity
scores.
To examine these patterns, a large-scale survey was carried out on 9,214 human proteins and their recorded
splice variants using FFPred 3, under the assumption that eventually they all full a physiological role in the
cell. e analysis was restricted to the GO term predictions compatible with the manually curated assignments
existing in UniProtKB/SwissProt, as to reduce the eects of spurious results on the biological interpretation. e
summary data in Supplementary Data le 2 indicate that the GO terms used in this study display varying levels
of conservation across sets of alternatively spliced transcripts, even though it is dicult to assess the statistical
signicance of the observed dierences. Only ve predicted (and admittedly broad) functions appear to be con-
sistently assigned to all the variants of a gene, and very few of them are highly conserved, when the focus is on the
most reliably predicted GO terms–i.e. the SVM Matthews correlation coecient value is in the top 50% of the dis-
tribution recorded for the corresponding sub-ontology. For instance, only six of such terms annotate all isoforms
of a gene in 90% or more of the cases examined. erefore, despite the use of a consolidated set of predictions,
the ndings support the expectation that alternative splicing plays a role in diversifying the cellular functional
repertoire. Support for this theory is strengthened by the dierential associations of individual biological roles
with canonical or alternative splice isoforms – as gauged by the GO term primarity scores. e Supplementary
Data le 3 indicate that there are many more GO categories preferentially associated with principal variants than
with alternative ones, partly because these analyses are restricted to predicted functions in line with available
annotations in UniProtKB/SwissProt. Nevertheless, the GO terms with high primarity scores tend to represent
more constitutive cellular functions, and those with negative scores appear to be mostly associated with larger
sets of alternatively spliced genes or to be induced by changes in the environment or in the cellular conditions.
As mentioned above, it is dicult to draw statistically sound conclusions from this initial study: identifying the
canonical isoform of each gene is still an open question, and here a rather simple and pragmatic approach was
taken just like in previous studies.
To emphasize the unique advantages that analyzing biological features can oer, Fig.4 gives some insight
into their relationship with some of the most conserved functions in each GO domain–see Methods for more
details. e heatmap allows to link the over- and under-representation of specic biophysical attributes with
the conservation of particular functional aspects. Similarly, Figs5 and 6 show the extent of positive or negative
correlation between sequence-derived feature groups and the GO terms that are preferentially associated with
principal or alternative splice variants, respectively. e results generally reect well-established trends between
functional categories and the occurrence or lack of intrinsically disordered residues, transmembrane helices and

Citations
More filters
Journal ArticleDOI
TL;DR: The work to update the PSIPRED Protein Analysis Workbench and make it ready for the next 20 years is presented and updates to some of the key predictive algorithms available through the website are surveyed.
Abstract: The PSIPRED Workbench is a web server offering a range of predictive methods to the bioscience community for 20 years. Here, we present the work we have completed to update the PSIPRED Protein Analysis Workbench and make it ready for the next 20 years. The main focus of our recent website upgrade work has been the acceleration of analyses in the face of increasing protein sequence database size. We additionally discuss any new software, the new hardware infrastructure, our webservices and web site. Lastly we survey updates to some of the key predictive algorithms available through our website.

858 citations

Journal ArticleDOI
TL;DR: Machine learning is becoming a widely used tool for the analysis of biological data as mentioned in this paper, however, proper use of machine learning methods can be challenging for experimentalists, proper application of ML methods can also be challenging, and best practices and points to consider when embarking on experiments involving machine learning are discussed.
Abstract: The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks. We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning. Some emerging directions in machine learning methodology are also discussed. Machine learning is becoming a widely used tool for the analysis of biological data. However, for experimentalists, proper use of machine learning methods can be challenging. This Review provides an overview of machine learning techniques and provides guidance on their applications in biology.

325 citations

Journal ArticleDOI
TL;DR: This work has developed a novel method to predict protein function from sequence that uses deep learning to learn features from protein sequences as well as a cross-species protein–protein interaction network.
Abstract: Motivation A large number of protein sequences are becoming available through the application of novel high-throughput sequencing technologies. Experimental functional characterization of these proteins is time-consuming and expensive, and is often only done rigorously for few selected model organisms. Computational function prediction approaches have been suggested to fill this gap. The functions of proteins are classified using the Gene Ontology (GO), which contains over 40 000 classes. Additionally, proteins have multiple functions, making function prediction a large-scale, multi-class, multi-label problem. Results We have developed a novel method to predict protein function from sequence. We use deep learning to learn features from protein sequences as well as a cross-species protein-protein interaction network. Our approach specifically outputs information in the structure of the GO and utilizes the dependencies between GO classes as background information to construct a deep learning model. We evaluate our method using the standards established by the Computational Assessment of Function Annotation (CAFA) and demonstrate a significant improvement over baseline methods such as BLAST, in particular for predicting cellular locations. Availability and implementation Web server: http://deepgo.bio2vec.net, Source code: https://github.com/bio-ontology-research-group/deepgo. Contact robert.hoehndorf@kaust.edu.sa. Supplementary information Supplementary data are available at Bioinformatics online.

309 citations

Journal ArticleDOI
TL;DR: DeepFRI as mentioned in this paper is a graph convolutional network for predicting protein functions by leveraging sequence features extracted from a protein language model and protein structures, which scales to the size of current sequence repositories.
Abstract: The rapid increase in the number of proteins in sequence databases and the diversity of their functions challenge computational approaches for automated function prediction. Here, we introduce DeepFRI, a Graph Convolutional Network for predicting protein functions by leveraging sequence features extracted from a protein language model and protein structures. It outperforms current leading methods and sequence-based Convolutional Neural Networks and scales to the size of current sequence repositories. Augmenting the training set of experimental structures with homology models allows us to significantly expand the number of predictable functions. DeepFRI has significant de-noising capability, with only a minor drop in performance when experimental structures are replaced by protein models. Class activation mapping allows function predictions at an unprecedented resolution, allowing site-specific annotations at the residue-level in an automated manner. We show the utility and high performance of our method by annotating structures from the PDB and SWISS-MODEL, making several new confident function predictions. DeepFRI is available as a webserver at https://beta.deepfri.flatironinstitute.org/ .

158 citations

Journal ArticleDOI
TL;DR: The success, promise and pitfalls of applying NLP algorithms to the study of proteins, and methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search.
Abstract: Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models Finally, we discuss trends and challenges in the intersection of NLP and protein research

144 citations


Cites background from "FFPred 3: feature-based function pr..."

  • ...gene ontology, GO) such as antiviral activity [79,64,41,84,5,25]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

88,255 citations

Journal ArticleDOI
TL;DR: The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing.
Abstract: Genomic sequencing has made it clear that a large fraction of the genes specifying the core biological functions are shared by all eukaryotes. Knowledge of the biological role of such shared proteins in one organism can often be transferred to other organisms. The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing. To this end, three independent ontologies accessible on the World-Wide Web (http://www.geneontology.org) are being constructed: biological process, molecular function and cellular component.

35,225 citations

Posted ContentDOI
TL;DR: SVM light as discussed by the authors is an implementation of an SVM learner which addresses the problem of large-scale SVM training with many training examples on the shelf, which makes large scale SVM learning more practical.
Abstract: Training a support vector machine SVM leads to a quadratic optimization problem with bound constraints and one linear equality constraint Despite the fact that this type of problem is well understood, there are many issues to be considered in designing an SVM learner In particular, for large learning tasks with many training examples on the shelf optimization techniques for general quadratic programs quickly become intractable in their memory and time requirements SVM light is an implementation of an SVM learner which addresses the problem of large tasks This chapter presents algorithmic and computational results developed for SVM light V 20, which make large-scale SVM training more practical The results give guidelines for the application of SVMs to large domains

4,145 citations

Journal ArticleDOI
Alex Bateman, Maria Jesus Martin, Claire O'Donovan, Michele Magrane, Rolf Apweiler, Emanuele Alpi, Ricardo Antunes, Joanna Arganiska, Benoit Bely, Mark Bingley, Carlos Bonilla, Ramona Britto, Borisas Bursteinas, Gayatri Chavali, Elena Cibrian-Uhalte, Alan Wilter Sousa da Silva, Maurizio De Giorgi, Tunca Doğan, Francesco Fazzini, Paul Gane, Leyla Jael Garcia Castro, Penelope Garmiri, Emma Hatton-Ellis, Reija Hieta, Rachael P. Huntley, Duncan Legge, W Liu, Jie Luo, Alistair MacDougall, Prudence Mutowo, Andrew Nightingale, Sandra Orchard, Klemens Pichler, Diego Poggioli, Sangya Pundir, Luis Pureza, Guoying Qi, Steven Rosanoff, Rabie Saidi, Tony Sawford, Aleksandra Shypitsyna, Edward Turner, Vladimir Volynkin, Tony Wardell, Xavier Watkins, Hermann Zellner, Andrew Peter Cowley, Luis Figueira, Weizhong Li, Hamish McWilliam, Rodrigo Lopez, Ioannis Xenarios, Lydie Bougueleret, Alan Bridge, Sylvain Poux, Nicole Redaschi, Lucila Aimo, Ghislaine Argoud-Puy, Andrea H. Auchincloss, Kristian B. Axelsen, Parit Bansal, Delphine Baratin, Marie Claude Blatter, Brigitte Boeckmann, Jerven Bolleman, Emmanuel Boutet, Lionel Breuza, Cristina Casal-Casas, Edouard de Castro, Elisabeth Coudert, Béatrice A. Cuche, M Doche, Dolnide Dornevil, Séverine Duvaud, Anne Estreicher, L Famiglietti, Marc Feuermann, Elisabeth Gasteiger, Sebastien Gehant, Vivienne Baillie Gerritsen, Arnaud Gos, Nadine Gruaz-Gumowski, Ursula Hinz, Chantal Hulo, Florence Jungo, Guillaume Keller, Vicente Lara, P Lemercier, Damien Lieberherr, Thierry Lombardot, Xavier D. Martin, Patrick Masson, Anne Morgat, Teresa Batista Neto, Nevila Nouspikel, Salvo Paesano, Ivo Pedruzzi, Sandrine Pilbout, Monica Pozzato, Manuela Pruess, Catherine Rivoire, Bernd Roechert, Michel Schneider, Christian J. A. Sigrist, K Sonesson, S Staehli, Andre Stutz, Shyamala Sundaram, Michael Tognolli, Laure Verbregue, Anne Lise Veuthey, Cathy H. Wu, Cecilia N. Arighi, Leslie Arminski, Chuming Chen, Yongxing Chen, John S. Garavelli, Hongzhan Huang, Kati Laiho, Peter B. McGarvey, Darren A. Natale, Baris E. Suzek, C. R. Vinayaka, Qinghua Wang, Yuqi Wang, Lai-Su L. Yeh, Meher Shruti Yerramalla, Jian Zhang 
TL;DR: An annotation score for all entries in UniProt is introduced to represent the relative amount of knowledge known about each protein to help identify which proteins are the best characterized and most informative for comparative analysis.
Abstract: UniProt is an important collection of protein sequences and their annotations, which has doubled in size to 80 million sequences during the past year. This growth in sequences has prompted an extension of UniProt accession number space from 6 to 10 characters. An increasing fraction of new sequences are identical to a sequence that already exists in the database with the majority of sequences coming from genome sequencing projects. We have created a new proteome identifier that uniquely identifies a particular assembly of a species and strain or subspecies to help users track the provenance of sequences. We present a new website that has been designed using a user-experience design process. We have introduced an annotation score for all entries in UniProt to represent the relative amount of knowledge known about each protein. These scores will be helpful in identifying which proteins are the best characterized and most informative for comparative analysis. All UniProt data is provided freely and is available on the web at http://www.uniprot.org/.

4,050 citations

Related Papers (5)
Predrag Radivojac, Wyatt T. Clark, Tal Ronnen Oron, Alexandra M. Schnoes, Tobias Wittkop, Artem Sokolov, Artem Sokolov, Kiley Graim, Christopher S. Funk, Karin Verspoor, Asa Ben-Hur, Gaurav Pandey, Gaurav Pandey, Jeffrey M. Yunes, Ameet Talwalkar, Susanna Repo, Susanna Repo, Michael L Souza, Damiano Piovesan, Rita Casadio, Zheng Wang, Jianlin Cheng, Hai Fang, Julian Gough, Patrik Koskinen, Petri Törönen, Jussi Nokso-Koivisto, Liisa Holm, Domenico Cozzetto, Daniel W. A. Buchan, Kevin Bryson, David T. Jones, Bhakti Limaye, Harshal Inamdar, Avik Datta, Sunitha K Manjari, Rajendra Joshi, Meghana Chitale, Daisuke Kihara, Andreas Martin Lisewski, Serkan Erdin, Eric Venner, Olivier Lichtarge, Robert Rentzsch, Haixuan Yang, Alfonso E. Romero, Prajwal Bhat, Alberto Paccanaro, Tobias Hamp, Rebecca Kaßner, Stefan Seemayer, Esmeralda Vicedo, Christian Schaefer, Dominik Achten, Florian Auer, Ariane Boehm, Tatjana Braun, Maximilian Hecht, Mark Heron, Peter Hönigschmid, Thomas A. Hopf, Stefanie Kaufmann, Michael Kiening, Denis Krompass, Cedric Landerer, Yannick Mahlich, Manfred Roos, Jari Björne, Tapio Salakoski, Andrew Wong, Hagit Shatkay, Hagit Shatkay, Fanny Gatzmann, Ingolf Sommer, Mark N. Wass, Michael J.E. Sternberg, Nives Škunca, Fran Supek, Matko Bošnjak, Panče Panov, Sašo Džeroski, Tomislav Šmuc, Yiannis A. I. Kourmpetis, Yiannis A. I. Kourmpetis, Aalt D. J. van Dijk, Cajo J. F. ter Braak, Yuanpeng Zhou, Qingtian Gong, Xinran Dong, Weidong Tian, Marco Falda, Paolo Fontana, Enrico Lavezzo, Barbara Di Camillo, Stefano Toppo, Liang Lan, Nemanja Djuric, Yuhong Guo, Slobodan Vucetic, Amos Marc Bairoch, Amos Marc Bairoch, Michal Linial, Patricia C. Babbitt, Steven E. Brenner, Christine A. Orengo, Burkhard Rost, Sean D. Mooney, Iddo Friedberg