Hybrid feature selection and peptide binding affinity prediction using an EDA based algorithm
20 Jun 2013-pp 2384-2389
TL;DR: This study employs a hybrid Estimation of Distribution Algorithm (EDA) based filter-wrapper methodology to simultaneously extract informative feature subsets and build robust QSAR models.
Abstract: Protein function prediction is an important problem in functional genomics. Typically, protein sequences are represented by feature vectors. A major problem of protein datasets that increase the complexity of classification models is their large number of features. The process of drug discovery often involves the use of quantitative structure-activity relationship (QSAR) models to identify chemical structures that could have good inhibitory effects on specific targets and have low toxicity (non-specific activity). QSAR models are regression or classification models used in the chemical and biological sciences. Because of high dimensionality problems, a feature selection problem is imminent. In this study, we thus employ a hybrid Estimation of Distribution Algorithm (EDA) based filter-wrapper methodology to simultaneously extract informative feature subsets and build robust QSAR models. The performance of the algorithm was tested on the benchmark classification challenge datasets obtained from the CoePRa competition platform, developed in 2006. Our results clearly demonstrate the efficacy of a hybrid EDA filter-wrapper algorithm in comparison to the results reported earlier.
Citations
More filters
[...]
TL;DR: This paper presents a comprehensive survey of the state-of-the-art work on EC for feature selection, which identifies the contributions of these different algorithms.
Abstract: Feature selection is an important task in data mining and machine learning to reduce the dimensionality of the data and increase the performance of an algorithm, such as a classification algorithm. However, feature selection is a challenging task due mainly to the large search space. A variety of methods have been applied to solve feature selection problems, where evolutionary computation (EC) techniques have recently gained much attention and shown some success. However, there are no comprehensive guidelines on the strengths and weaknesses of alternative approaches. This leads to a disjointed and fragmented field with ultimately lost opportunities for improving performance and successful applications. This paper presents a comprehensive survey of the state-of-the-art work on EC for feature selection, which identifies the contributions of these different algorithms. In addition, current issues and challenges are also discussed to identify promising areas for future research.
855 citations
Cites methods from "Hybrid feature selection and peptid..."
[...]
[...]
TL;DR: A hybrid genetic algorithm with wrapper−embedded feature approach for selection approach (HGAWE), which combines genetic algorithm (global search) with embedded regularization approaches (local search) together and a novel chromosome representation for global and local optimization procedures in HGAWE is proposed.
Abstract: Feature selection is an important research area for big data analysis. In recent years, various feature selection approaches have been developed, which can be divided into four categories: filter, wrapper, embedded, and combined methods. In the combined category, many hybrid genetic approaches from evolutionary computations combine filter and wrapper measures of feature evaluation to implement a population-based global optimization with efficient local search. However, there are limitations to existing combined methods, such as the two-stage and inconsistent feature evaluation measures, difficulties in analyzing data with high feature interaction, and challenges in handling large-scale features and instances. Focusing on these three limitations, we proposed a hybrid genetic algorithm with wrapper−embedded feature approach for selection approach (HGAWE), which combines genetic algorithm (global search) with embedded regularization approaches (local search) together. We also proposed a novel chromosome representation (intron+exon) for global and local optimization procedures in HGAWE. Based on this “intron+exon” encoding, the regularization method can select the relevant features and construct the learning model simultaneously, and genetic operations aim to globally optimize the control parameters in the above non-convex regularization. We mention that any efficient regularization approach can serve as the embedded method in HGAWE, and a hybrid $L_{1/2}+L_{2}$ regularization approach is investigated as an example in this paper. Empirical study of the HGAWE approach on some simulation data and five gene microarray data sets indicates that it outperforms the existing combined methods in terms of feature selection and classification accuracy.
48 citations
Cites methods from "Hybrid feature selection and peptid..."
[...]
[...]
[...]
TL;DR: This work proposes a new Evolutionary Algorithm for the Auto-ML task of automatically selecting the best ensemble of classifiers and their hyper-parameter settings for an input dataset and obtained significantly smaller classification error rates than that Auto-WEKA version.
Abstract: Automated Machine Learning (Auto-ML) is an emerging area of ML which consists of automatically selecting the best ML algorithm and its best hyper-parameter settings for a given input dataset, by doing a search in a large space of candidate algorithms and settings. In this work we propose a new Evolutionary Algorithm (EA) for the Auto-ML task of automatically selecting the best ensemble of classifiers and their hyper-parameter settings for an input dataset. The proposed EA was compared against a version of the well-known Auto-WEKA method adapted to search in the same space of algorithms and hyper-parameter settings as the EA. In general, the EA obtained significantly smaller classification error rates than that Auto-WEKA version in experiments with 15 classification datasets.
9 citations
Additional excerpts
[...]
[...]
TL;DR: A knowledge management overview of evolutionary feature selection approaches, state-of-the-art cooperative co-evolution and MapReduce-based feature selection techniques, and future research directions is presented.
Abstract: The term “big data” characterizes the massive amounts of data generation by the advanced technologies in different domains using 4Vs – volume, velocity, variety, and veracity to indicate the amount of data that can only be processed via computationally intensive analysis, the speed of their creation, the different types of data, and their accuracy. High-dimensional financial data, such as time-series and space-time data, contain a large number of features (variables) while having a small number of samples, which are used to measure various real-time business situations for financial organizations. Such datasets are normally noisy, and complex correlations may exist between their features, and many domains, including financial, lack the al analytic tools to mine the data for knowledge discovery because of the high-dimensionality. Feature selection is an optimization problem to find a minimal subset of relevant features that maximizes the classification accuracy and reduces the computations. Traditional statistical-based feature selection approaches are not adequate to deal with the curse of dimensionality associated with big data. Cooperative co-evolution, a meta-heuristic algorithm and a divide-and-conquer approach, decomposes high-dimensional problems into smaller sub-problems. Further, MapReduce, a programming model, offers a ready-to-use distributed, scalable, and fault-tolerant infrastructure for parallelizing the developed algorithm. This article presents a knowledge management overview of evolutionary feature selection approaches, state-of-the-art cooperative co-evolution and MapReduce-based feature selection techniques, and future research directions. A N M Bazlur Rashid (Australia), Tonmoy Choudhury (Australia)
5 citations
[...]
TL;DR: An improved version of the previous Evolutionary Algorithm (EA) – more precisely, an Estimation of Distribution Algorithm – for the Auto-ML task of automatically selecting the best classifier ensemble and its best hyper-parameter settings for an input dataset.
Abstract: A large number of classification algorithms have been proposed in the machine learning literature. These algorithms have different pros and cons, and no algorithm is the best for all datasets. Hence, a challenging problem consists of choosing the best classification algorithm with its best hyper-parameter settings for a given input dataset. In the last few years, Automated Machine Learning (Auto-ML) has emerged as a promising approach for tackling this problem, by doing a heuristic search in a large space of candidate classification algorithms and their hyper-parameter settings. In this work we propose an improved version of our previous Evolutionary Algorithm (EA) – more precisely, an Estimation of Distribution Algorithm – for the Auto-ML task of automatically selecting the best classifier ensemble and its best hyper-parameter settings for an input dataset. The new version of this EA was compared against its previous version, as well as against a random forest algorithm (a strong ensemble algorithm) and a version of the well-known Auto-ML method Auto-WEKA adapted to search in the same space of classifier ensembles as the proposed EA. In general, in experiments with 21 datasets, the new EA version obtained the best results among all methods in terms of four popular predictive accuracy measures: error rate, precision, recall and F-measure.
2 citations
Cites methods from "Hybrid feature selection and peptid..."
[...]
References
More filters
[...]
TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Abstract: LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
37,868 citations
"Hybrid feature selection and peptid..." refers methods in this paper
[...]
Book•
[...]
TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.
Abstract: The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it's still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition's publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges. * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data
23,590 citations
[...]
21,676 citations
[...]
TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Abstract: More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
18,835 citations
"Hybrid feature selection and peptid..." refers methods in this paper
[...]
[...]
TL;DR: The contributions of this special issue cover a wide range of aspects of variable selection: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.
Abstract: Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. The contributions of this special issue cover a wide range of aspects of such problems: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.
13,554 citations
"Hybrid feature selection and peptid..." refers methods or result in this paper
[...]
[...]
Related Papers (5)
[...]