Hybrid biogeography based simultaneous feature selection and prediction of n-myristoylation substrate proteins using support vector machines and random forest classifiers
20 Dec 2012-pp 364-371
TL;DR: The simulations indicate that N-myristoylation sites can be identified with high accuracy using hybrid BBO wrappers in combination with weighted filter methods.
Abstract: Majority of proteins undergo important post-translational modifications (PTM) that may alter physical and chemical properties of the protein and mainly their functions. Laboratory processes of determining PTM sites in proteins are laborious and expensive. On the contrary, computational approaches are far swifter and economical; and the models for prediction of PTMs can be quite accurate too. Among the PTMs, Protein N- terminal N-myristoylation by myristoyl-CoA protein N-myristoyltransferase (NMT) is an important lipid anchor modification of eukaryotic and viral proteins; occurring in about 0.5% encoded NMT substrates. Reliable recognition of myristoylation capability from the substrate amino acid sequence is useful for proteomic functional annotation projects as also in building therapeutics targeting the NMT. Using computational techniques, prediction-based models can be developed and new functions of protein substrates can be identified.
In this study, we employ Biogeography based Optimization (BBO) for feature selection along with Support Vector Machines (SVM) and Random Forest for classification of N-myristoylation sequences. The simulations indicate that N-myristoylation sites can be identified with high accuracy using hybrid BBO wrappers in combination with weighted filter methods.
Citations
More filters
[...]
TL;DR: This work highlights the use of conformational similarity, a feature that reflects amino acid flexibility, and hydrophobicity for predicting phase separating proteins, and uses such "interpretable" features obtained from the ever-growing knowledgebase of phase separation to improve prediction performances further.
Abstract: Phase separation of proteins play key roles in cellular physiology including bacterial division, tumorigenesis etc. Consequently, understanding the molecular forces that drive phase separation has gained considerable attention and several factors including hydrophobicity, protein dynamics, etc., have been implicated in phase separation. Data-driven identification of new phase separating proteins can enable in-depth understanding of cellular physiology and may pave way towards developing novel methods of tackling disease progression. In this work, we exploit the existing wealth of data on phase separating proteins to develop sequence-based machine learning method for prediction of phase separating proteins. We use reduced alphabet schemes based on hydrophobicity and conformational similarity along with distributed representation of protein sequences and biochemical properties as input features to Support Vector Machine (SVM) and Random Forest (RF) machine learning algorithms. We used both curated and balanced dataset for building the models. RF trained on balanced dataset with hydropathy, conformational similarity embeddings and biochemical properties achieved accuracy of 97%. Our work highlights the use of conformational similarity, a feature that reflects amino acid flexibility, and hydrophobicity for predicting phase separating proteins. Use of such “interpretable” features obtained from the ever-growing knowledgebase of phase separation is likely to improve prediction performances further.
References
More filters
[...]
TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Abstract: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, aaa, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
58,232 citations
[...]
TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Abstract: LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
37,868 citations
[...]
TL;DR: High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated and the performance of the support- vector network is compared to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Abstract: The support-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.
High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
35,157 citations
Book•
[...]
TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.
Abstract: The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it's still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition's publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges. * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data
23,590 citations
[...]
TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Abstract: More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
18,835 citations
Related Papers (5)
[...]