scispace - formally typeset
Search or ask a question

Showing papers by "Mikael Bodén published in 2005"


Journal ArticleDOI
TL;DR: This work contrasts the use of feed forward models as employed by the popular TargetP/SignalP predictors with a sequence-biased recurrent network model, and demonstrates that recurrent networks improve the overall prediction performance.
Abstract: Motivation: Targeting peptides direct nascent proteins to their specific subcellular compartment. Knowledge of targeting signals enables informed drug design and reliable annotation of gene products. However, due to the low similarity of such sequences and the dynamical nature of the sorting process, the computational prediction of subcellular localization of proteins is challenging. Results: We contrast the use of feed forward models as employed by the popular TargetP/SignalP predictors with a sequence-biased recurrent network model. The models are evaluated in terms of performance at the residue level and at the sequence level, and demonstrate that recurrent networks improve the overall prediction performance. Compared to the original results reported for TargetP, an ensemble of the tested models increases the accuracy by 6 and 5% on non-plant and plant data, respectively. Availability: The Protein Prowler incorporating the recurrent network predictor described in this paper is available online at http://pprowler.imb.uq.edu.au/ Contact: mikael@itee.uq.edu.au

146 citations


Journal ArticleDOI
TL;DR: This paper argues that recurrent neural networks have a natural bias toward a problem domain of which biological sequence analysis tasks are a subset, and demonstrates that this bias can be exploitable using a data set of protein sequences containing several classes of subcellular localization targeting peptides.
Abstract: Selection of machine learning techniques requires a certain sensitivity to the requirements of the problem. In particular, the problem can be made more tractable by deliberately using algorithms that are biased toward solutions of the requisite kind. In this paper, we argue that recurrent neural networks have a natural bias toward a problem domain of which biological sequence analysis tasks are a subset. We use experiments with synthetic data to illustrate this bias. We then demonstrate that this bias can be exploitable using a data set of protein sequences containing several classes of subcellular localization targeting peptides. The results show that, compared with feed forward, recurrent neural networks will generally perform better on sequence analysis tasks. Furthermore, as the patterns within the sequence become more ambiguous, the choice of specific recurrent architecture becomes more critical.

45 citations


Proceedings ArticleDOI
01 Jan 2005
TL;DR: A comparison of several standard encoding methods shows, that for cleavage site prediction the frequently used orthonormal encoding is inferior compared to other methods.
Abstract: Research on cleavage site prediction for signal peptides has focused mainly on the application of different classification algorithms to achieve improved prediction accuracies. This paper addresses the fundamental issue of amino acid encoding to present amino acid sequences in the most beneficial way for machine learning algorithms. A comparison of several standard encoding methods shows, that for cleavage site prediction the frequently used orthonormal encoding is inferior compared to other methods. The best results are achieved with a new encoding method named BLOMAP - based on the BLOSUM62 substitution matrix - using a Naive Bayes classifier.

30 citations


Journal ArticleDOI
TL;DR: By experimentation, it is shown that the bias of recurrent neural networks-recently analyzed by Tino et al. and Hammer and Tino-offers superior access to motifs compared to the standardly used feedforward neural networks.
Abstract: For many biological sequence problems the available data occupies only sparse regions of the problem space. To use machine learning effectively for the analysis of sparse data we must employ architectures with an appropriate bias. By experimentation we show that the bias of recurrent neural networks-recently analyzed by Tino et al. and Hammer and Tino-offers superior access to motifs (sequential patterns) compared to the, in bioinformatics, standardly used feedforward neural networks.

12 citations


Book ChapterDOI
06 Jul 2005
TL;DR: A range of machine learning algorithms are benchmarked, and it is shown that a classifier – based on the Support Vector Machine – produces more accurate results when dependencies between the conserved motif and the preceding section are exploited.
Abstract: Prediction of peroxisomal matrix proteins generally depends on the presence of one of two distinct motifs at the end of the amino acid sequence. PTS1 peroxisomal proteins have a well conserved tripeptide at the C-terminal end. However, the preceding residues in the sequence arguably play a crucial role in targeting the protein to the peroxisome. Previous work in applying machine learning to the prediction of peroxisomal matrix proteins has failed to capitalize on the full extent of these dependencies. We benchmark a range of machine learning algorithms, and show that a classifier – based on the Support Vector Machine – produces more accurate results when dependencies between the conserved motif and the preceding section are exploited. We publish an updated and rigorously curated data set that results in increased prediction accuracy of most tested models.

5 citations


Proceedings ArticleDOI
01 Jan 2005
TL;DR: This paper reports on the development of an SVM classifier with a separately trained logistic output function that uses an input window containing 12 consecutive residues at the C-terminus and the amino acid composition of the full sequence to predict peroxisomal proteins.
Abstract: PTS1 proteins are peroxisomal matrix proteins that have a well conserved targeting motif at the C-terminal end. However, this motif is present in many non peroxisomal proteins as well, thus predicting peroxisomal proteins involves differentiating fake PTS1 signals from actual ones. In this paper we report on the development of an SVM classifier with a separately trained logistic output function. The model uses an input window containing 12 consecutive residues at the C-terminus and the amino acid composition of the full sequence. The final model gives a Matthews Correlation Coefficient of 0.77, representing an increase of 54% compared with the well-known PeroxiP predictor. We test the model by applying it to several proteomes of eukaryotes for which there is no evidence of a peroxisome, producing a false positive rate of 0.088%.

5 citations



Proceedings ArticleDOI
01 Jan 2005
TL;DR: A new heuristic algorithm is proposed to compute the reversal distance between two genomes with multigene families via the concept of binary integer programming without removing gene duplicates.
Abstract: Hannenhalli and Pevzner developed the first polynomial-time algorithm for the combinatorial problem of sorting of signed genomic data. Their algorithm solves the minimum number of reversals required for rearranging a genome to another when gene duplication is nonexisting. In this paper, we show how to extend the Hannenhalli-Pevzner approach to genomes with multigene families. We propose a new heuristic algorithm to compute the reversal distance between two genomes with multigene families via the concept of binary integer programming without removing gene duplicates. The experimental results on simulated and real biological data demonstrate that the proposed algorithm is able to find the reversal distance accurately.

1 citations


Proceedings ArticleDOI
01 Jan 2005
TL;DR: This work can be seen as building upon the currently popular series of predictors SignalP and TargetP, by exploiting the inherent bias for sequential pattern recognition exhibited by recurrent networks.
Abstract: Knowledge of targeting signals is of immense importance for understanding the cellular processes by which proteins are sorted and transported. This paper presents a system of recurrent neural networks which demonstrate an ability to detect residues belonging to specific targeting peptides with greater accuracy than current feed forward models. The system can subsequently be used for determining sub-cellular localisation of proteins and for understanding the factors underlying translocation. The work can be seen as building upon the currently popular series of predictors SignalP and TargetP, by exploiting the inherent bias for sequential pattern recognition exhibited by recurrent networks.

1 citations


Proceedings ArticleDOI
01 Jan 2005
TL;DR: A machine learning model is presented that predicts a structural disruption score from a protein’s primary structure using a two step approach and indicates the feasibility of replacing SCHEMA with little loss of precision.
Abstract: We present a machine learning model that predicts a structural disruption score from a protein’s primary structure. SCHEMA was introduced by Frances Arnold and colleagues as a method for determining putative recombination sites of a protein on the basis of the full (PDB) description of its structure. The present method provides an alternative to SCHEMA that is able to determine the same score from sequence data only. Circumventing the need for resolving the full structure enables the exploration of yet unresolved and even hypothetical sequences for protein design efforts. Deriving the SCHEMA score from a primary structure is achieved using a two step approach: first predicting a secondary structure from the sequence and then predicting the SCHEMA score from the predicted secondary structure. The correlation coefficient for the prediction is 0.88 and indicates the feasibility of replacing SCHEMA with little loss of precision.