scispace - formally typeset
Search or ask a question
Journal ArticleDOI

ChloroP, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites.

01 Jan 1999-Protein Science (Wiley-Blackwell)-Vol. 8, Iss: 5, pp 978-984
TL;DR: An analysis of 715 Arabidopsis thaliana sequences from SWISS‐PROT suggests that the ChloroP method should be useful for the identification of putative transit peptides in genome‐wide sequence data.
Abstract: We present a neural network based method (ChloroP) for identifying chloroplast transit peptides and their cleavage sites. Using cross-validation, 88% of the sequences in our homology reduced training set were correctly classified as transit peptides or nontransit peptides. This performance level is well above that of the publicly available chloroplast localization predictor PSORT. Cleavage sites are predicted using a scoring matrix derived by an automatic motif-finding algorithm. Approximately 60% of the known cleavage sites in our sequence collection were predicted to within +/-2 residues from the cleavage sites given in SWISS-PROT. An analysis of 715 Arabidopsis thaliana sequences from SWISS-PROT suggests that the ChloroP method should be useful for the identification of putative transit peptides in genome-wide sequence data. The ChloroP predictor is available as a web-server at http://www.cbs.dtu.dk/services/ChloroP/.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: A neural network-based tool, TargetP, for large-scale subcellular location prediction of newly identified proteins has been developed and it is estimated that 10% of all plant proteins are mitochondrial and 14% chloroplastic, and that the abundance of secretory proteins, in both Arabidopsis and Homo, is around 10%.

4,268 citations


Cites background or methods from "ChloroP, a neural network-based met..."

  • ...The full-size sets were also tested on PSORT (Nakai & Kanehisa, 1992; Horton & Nakai, 1997) and MitoProt (Claros, 1995; Claros & Vincens, 1996) as well as on TargetP's predecessors SignalP (Nielsen et al., 1997) and ChloroP (Emanuelsson et al., 1999)....

    [...]

  • ...tested contains no sequences that participated in the development of teolytic activity in the stroma (Emanuelsson et al., 1999)....

    [...]

  • ...The TargetP predictor has neural networks in two layers (Figure 2), roughly in the same manner as for ChloroP (Emanuelsson et al., 1999), with the ®rst layer consisting of one network for each type of presequence (i....

    [...]

  • ..., 1997) or cTPs (ChloroP) (Emanuelsson et al., 1999) in a protein sequence....

    [...]

  • ...…network architecture and training The TargetP predictor has neural networks in two layers (Figure 2), roughly in the same manner as for ChloroP (Emanuelsson et al., 1999), with the ®rst layer consisting of one network for each type of presequence (i.e. three in the plant version and two in the…...

    [...]

Journal ArticleDOI
TL;DR: The properties of three well-known N-terminal sequence motifs directing proteins to the secretory pathway, mitochondria and chloroplasts are described and a brief history of methods to predict subcellular localization based on these sorting signals and other sequence properties are sketched.
Abstract: Determining the subcellular localization of a protein is an important first step toward understanding its function. Here, we describe the properties of three well-known N-terminal sequence motifs directing proteins to the secretory pathway, mitochondria and chloroplasts, and sketch a brief history of methods to predict subcellular localization based on these sorting signals and other sequence properties. We then outline how to use a number of internet-accessible tools to arrive at a reliable subcellular localization prediction for eukaryotic and prokaryotic proteins. In particular, we provide detailed step-by-step instructions for the coupled use of the amino-acid sequence-based predictors TargetP, SignalP, ChloroP and TMHMM, which are all hosted at the Center for Biological Sequence Analysis, Technical University of Denmark. In addition, we describe and provide web references to other useful subcellular localization predictors. Finally, we discuss predictive performance measures in general and the performance of TargetP and SignalP in particular.

3,235 citations


Cites background from "ChloroP, a neural network-based met..."

  • ...A motif, VRAmAAV, has been detected around the cleavage site (m...

    [...]

Journal ArticleDOI
15 Aug 2006-Proteins
TL;DR: An approach based on a two‐level support vector machine (SVM) system, which performs well down to 30% sequence identity, although its performance deteriorates considerably for sequences sharing lower sequence identity and when compared with other approaches, this approach performed significantly better.
Abstract: Because the protein's function is usually related to its subcellular localization, the ability to predict subcellular localization directly from protein sequences will be useful for inferring protein functions. Recent years have seen a surging interest in the development of novel computational tools to predict subcellular localization. At present, these approaches, based on a wide range of algorithms, have achieved varying degrees of success for specific organisms and for certain localization categories. A number of authors have noticed that sequence similarity is useful in predicting subcellular localization. For example, Nair and Rost (Protein Sci 2002;11:2836-2847) have carried out extensive analysis of the relation between sequence similarity and identity in subcellular localization, and have found a close relationship between them above a certain similarity threshold. However, many existing benchmark data sets used for the prediction accuracy assessment contain highly homologous sequences-some data sets comprising sequences up to 80-90% sequence identity. Using these benchmark test data will surely lead to overestimation of the performance of the methods considered. Here, we develop an approach based on a two-level support vector machine (SVM) system: the first level comprises a number of SVM classifiers, each based on a specific type of feature vectors derived from sequences; the second level SVM classifier functions as the jury machine to generate the probability distribution of decisions for possible localizations. We compare our approach with a global sequence alignment approach and other existing approaches for two benchmark data sets-one comprising prokaryotic sequences and the other eukaryotic sequences. Furthermore, we carried out all-against-all sequence alignment for several data sets to investigate the relationship between sequence homology and subcellular localization. Our results, which are consistent with previous studies, indicate that the homology search approach performs well down to 30% sequence identity, although its performance deteriorates considerably for sequences sharing lower sequence identity. A data set of high homology levels will undoubtedly lead to biased assessment of the performances of the predictive approaches-especially those relying on homology search or sequence annotations. Our two-level classification system based on SVM does not rely on homology search; therefore, its performance remains relatively unaffected by sequence homology. When compared with other approaches, our approach performed significantly better. Furthermore, we also develop a practical hybrid method, which combines the two-level SVM classifier and the homology search method, as a general tool for the sequence annotation of subcellular localization.

1,303 citations

Journal ArticleDOI
16 Dec 1999-Nature
TL;DR: The sequence of chromosome 2 from the Columbia ecotype is reported in two gap-free assemblies (contigs) of 3.6 and 16 megabases, which represents the longest published stretch of uninterrupted DNA sequence assembled from any organism to date.
Abstract: Arabidopsis thaliana (Arabidopsis) is unique among plant model organisms in having a small genome (130-140 Mb), excellent physical and genetic maps, and little repetitive DNA. Here we report the sequence of chromosome 2 from the Columbia ecotype in two gap-free assemblies (contigs) of 3.6 and 16 megabases (Mb). The latter represents the longest published stretch of uninterrupted DNA sequence assembled from any organism to date. Chromosome 2 represents 15% of the genome and encodes 4,037 genes, 49% of which have no predicted function. Roughly 250 tandem gene duplications were found in addition to large-scale duplications of about 0.5 and 4.5 Mb between chromosomes 2 and 1 and between chromosomes 2 and 4, respectively. Sequencing of nearly 2 Mb within the genetically defined centromere revealed a low density of recognizable genes, and a high density and diverse range of vestigial and presumably inactive mobile elements. More unexpected is what appears to be a recent insertion of a continuous stretch of 75% of the mitochondrial genome into chromosome 2.

792 citations

Journal ArticleDOI
TL;DR: This method uses the support vector machines trained by multiple feature vectors based on n‐peptide compositions to predict subcellular localization for Gram‐negative bacteria, and achieves the highest prediction rate ever reported.
Abstract: Gram-negative bacteria have five major subcellular localization sites: the cytoplasm, the periplasm, the inner membrane, the outer membrane, and the extracellular space. The subcellular location of a protein can provide valuable information about its function. With the rapid increase of sequenced genomic data, the need for an automated and accurate tool to predict subcellular localization becomes increasingly important. We present an approach to predict subcellular localization for Gram-negative bacteria. This method uses the support vector machines trained by multiple feature vectors based on n-peptide compositions. For a standard data set comprising 1443 proteins, the overall prediction accuracy reaches 89%, which, to the best of our knowledge, is the highest prediction rate ever reported. Our prediction is 14% higher than that of the recently developed multimodular PSORT-B. Because of its simplicity, this approach can be easily extended to other organisms and should be a useful tool for the high-throughput and large-scale analysis of proteomic and genomic data.

746 citations


Cites methods from "ChloroP, a neural network-based met..."

  • ...There are methods (Nakai and Kanehisa 1992; Nielsen et al. 1997; Emanuelsson et al. 1999, 2000; Nakai 2000) based on the observation that sequences targeted to specific locations rely on the N-terminal sorting or signal sequences....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This final installment of the paper considers the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now.
Abstract: In this final installment of the paper we consider the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now. To a considerable extent the continuous case can be obtained through a limiting process from the discrete case by dividing the continuum of messages and signals into a large but finite number of small regions and calculating the various parameters involved on a discrete basis. As the size of the regions is decreased these parameters in general approach as limits the proper values for the continuous case. There are, however, a few new effects that appear and also a general change of emphasis in the direction of specialization of the general results to particular cases.

65,425 citations

Book ChapterDOI
01 Jan 1988
TL;DR: This chapter contains sections titled: The Problem, The Generalized Delta Rule, Simulation Results, Some Further Generalizations, Conclusion.
Abstract: This chapter contains sections titled: The Problem, The Generalized Delta Rule, Simulation Results, Some Further Generalizations, Conclusion

17,604 citations


"ChloroP, a neural network-based met..." refers methods in this paper

  • ...…what is annotated in SWISS-PROT ~negative values!, 5 are correctly predicted, and only 3 predicted to be longer ~positive values!. neurons~Minsky & Papert, 1968! and zero or one layer of hidden units, trained using error backpropagation~Rumelhart et al., 1986!, but with different error functions....

    [...]

Journal ArticleDOI
TL;DR: A new method for the identification of signal peptides and their cleavage sites based on neural networks trained on separate sets of prokaryotic and eukaryotic sequence that performs significantly better than previous prediction schemes and can easily be applied on genome-wide data sets.
Abstract: We have developed a new method for the identification of signal peptides and their cleavage sites based on neural networks trained on separate sets of prokaryotic and eukaryotic sequence. The method performs significantly better than previous prediction schemes and can easily be applied on genome-wide data sets. Discrimination between cleaved signal peptides and uncleaved N-terminal signal-anchor sequences is also possible, though with lower precision. Predictions can be made on a publicly available WWW server.

5,480 citations


"ChloroP, a neural network-based met..." refers background or methods in this paper

  • ...The initial set was screened for thylakoid transfer domains using the signal peptide predictor SignalP~Nielsen et al., 1997!, and homology reduction was carried out using the Hobohm algorithm 2~Hobohm et al., 1992!....

    [...]

  • ...~Nielsen et al., 1997!, and those that were assigned a cleavage site within65 residues from the SWISS-PROT annotated cleavage site were excluded, since they most likely represent bi-partite stroma-thylakoid targeting sequences....

    [...]

  • ...In several neural network applications, this has been handled by monitoring test set performance during training and picking the network where performance on the test set was optimal~Qian & Sejnowski, 1988; Brunak et al., 1991; Nielsen et al., 1997!....

    [...]

01 Jan 1997
TL;DR: In this paper, a new method for the identification of in performance compared with the weight matrix method signal peptides and their cleavage sites based on neural (Arrigo et al., 1991; Ladunga et al, 1991; Schneider and networks trained on separate sets of prokaryotic and eukaryotic sequence.
Abstract: applicable prediction methods with significant improvements We have developed a new method for the identification of in performance compared with the weight matrix method signal peptides and their cleavage sites based on neural (Arrigo et al., 1991; Ladunga et al., 1991; Schneider and networks trained on separate sets of prokaryotic and Wrede, 1993). eukaryotic sequence. The method performs significantly better than previous prediction schemes and can easily be Materials and methods applied on genome-wide data sets. Discrimination between cleaved signal peptides and uncleaved N-terminal signal- The data were taken from SWISS-PROT version 29 (Bairoch anchor sequences is also possible, though with lower preci- and Boeckmann, 1994). The data sets were divided into sion. Predictions can be made on a publicly available prokaryotic and eukaryotic entries and the prokaryotic data sets WWW server.

5,191 citations