A Survey on Evolutionary Computation Approaches to Feature Selection
Summary (5 min read)
I. INTRODUCTION
- In data mining and machine learning, real-world problems often involve a large number of features.
- The task is becoming more challenging as n is increasing in many areas with the advances in the data collection techniques and the increased complexity of those problems.
- Feature selection has been used to improve the quality of the feature set in many machine learning tasks, such as classification, clustering, regression, and time series prediction [1] .
- Section V presents the applications of EC based feature selection approaches.
II. BACKGROUND
- Feature selection is a process that selects a subset of relevant features from the original large set of features [9] .
- Based on the evaluation criteria, feature selection algorithms are generally classified into two categories: filter approaches and wrapper approaches [1] , [2] .
- Filters ignore the performance of the selected features on a classification algorithm while wrappers evaluate the feature subsets based on the classification performance, which usually results in better performance achieved Fig. 2 .
- The removal or selection of such features may miss the optimal feature subset(s).
- Feature selection involves two main objectives, which are to maximise the classification accuracy and minimise the number of features.
1) Search techniques:
- Both floating search methods are claimed to be better than the static sequential methods.
- Feature interaction leads to individually relevant features becoming redundant or individually weakly relevant features becoming highly relevant when combined with other features.
- Many studies show that filter methods do not scale well to problems with more than tens of thousands of features [13] .
- Most of the existing feature selection methods aim to maximise the classification performance only during the search process or aggregate the classification performance and the number of features into a single objective function.
B. Detailed Coverage of This Paper
- As shown in Fig. 3 , according to three different criteria, which are the EC paradigms, the evaluation, and the number of objectives, EC based feature selection approaches are classified into different categories.
- Based on the evaluation criteria, the authors review both filter and wrapper approaches, and also include another group of approaches named "Combined".
- Wrapper approaches are not further categorised according to their measures because the classification algorithm in wrappers is used as a "black box" during the feature selection process such that it can often be easily replaced by another classification algorithm.
- The reviewed literature is organised as follows.
- In addition, Section IV discusses the research on EC based filter approaches for feature selection.
A. GAs for Feature Selection
- GAs are most likely the first EC technique widely applied to feature selection problems.
- To address this limitation, Yahya et al. [112] developed a variable length representation, where each chromosome showed the selected features only and different chromosomes may have different lengths.
- Winkler et al. [81] proposed a new representation that included both feature selection and parameter optimisation of a certain classification algorithm, e.g. an SVM.
- Winkler et al. [81] proposed several fitness functions, which considered the number of features, the overall classification performance, the class specific accuracy, and the classification accuracy using all the original features.
- In summary, GAs have been applied to feature selection for around 25 years and have achieved reasonably good performance on problems with hundreds of features.
TABLE II CATEGORISATION OF GP APPROACHES Single Objective
- Compared with GAs and PSO, there are a much smaller number of works on GP for feature selection.
- GP is used more often in feature construction than feature selection because of its flexible representation.
- It may suffer from the problem of high computational cost.
- Two-stage approaches have been investigated in GP for feature selection.
- Venkatraman et al. [124] proposed to use a mutual information measure to rank individual features and remove weakly relevant or irrelevant features in the first stage and GP was then applied to select a subset of the remaining features [124] .
TABLE III CATEGORISATION OF PSO APPROACHES Single Objective
- The representation of each particle in PSO for feature selection is typically a bit-string, where the dimensionality is equal to the total number of features in the dataset.
- The dimensionality of the new representation is much smaller than the typical representation, however, it is not easy to determine the desired number of features.
- Tran et al. [156] used the gbest resetting mechanism in [140] to reduce the number of features and performed a local search process on pbest to increase the classification performance.
- The fitness function plays an important role in PSO for feature selection.
- Research on PSO for multi-objective feature selection started only in the last two years, where Xue et al. [29] , [161] conducted the first work to optimise the classification performance and the number of features as two separate objectives.
D. ACO for Feature Selection
- Table IV shows typical works on ACO for feature selection, where the earliest work was proposed around 2003 [183] .
- Khushaba et al. [47] combined ACO and DE for feature selection, where DE was used to search for the optimal feature subset based on the solutions obtained by ACO.
- In most ACO based algrithms [188] , [16] , features/nodes are fully connected to each other in the graph, but in [189] , each feature was connected only to two features.
- At the end of a tour, each ant had a binary vector with the length as the total number of features, where "1" indicated selecting and "0" indicated removing the corresponding feature.
- The fitness functions in [187] , [16] included both the classification performance and the number of features.
E. Other EC Techniques for Feature Selection
- DE was introduced to solve feature selection problems in recent years, mainly since 2008.
- Khushaba et al. [47] combined DE with ACO for feature selection, where DE was used to search for the optimal feature subset based on the solutions obtained by ACO.
- Experiments showed that the proposed algorithm achieved better performance than other traditional feature selection algorithms on EEG braincomputer-interface tasks.
- Therefore, in most memetic based feature selection approaches, an EC technique was used for wrapper feature selection and a local search algorithm was used for filter feature selection.
- Almost all of them are wrapper based methods.
IV. MEASURES IN FILTER APPROACHES
- Feature selection measures have previously been classified into five categories [1] : information measures, consistency measures, dependency (or correlation) measures, distance measures, and precision measures (i.e. wrapper approaches).
- Rough set theory has attracted much attention in ACO for feature selection [183] , [196] , [204] , [206] , which has been discussed in Section III-D. Tallón-Ballesteros and Riquelme [203] tested a correlation measure, a consistency measure, and their combination with information gain in ACO for feature selection.
- In summary, different types of filter measures have been adopted in EC for feature selection.
- Among these measures, information measures, correlation measures, and distance measures are computationally relatively cheap while consistency, rough set, and fuzzy set theories based measures may handle noisy data better.
V. APPLICATIONS
- Table VII shows the applications of EC for feature selection.
- Generally, the major applications can be grouped into the following five categories: (1) Image and signal processing including image analysis, face recognition, human action recognition, EEG brain-computer-interface, speaker recognition, handwritten digit recognition, personal identification, and music instrument recognition.
- (2) Biological and biomedical tasks including gene analysis, biomarker detection, and disease diagnosis, where selecting the key features and reducing the dimensionality can significantly reduce the cost of clinic validation, disease diagnosis and other related procedures.
- (3) Business and financial problems including financial crisis, credit card issuing in bank systems, and customer churn prediction.
- All the above areas are important and essential to their society or daily life.
A. Scalability
- The most pressing issue is due to the trend in "big data" [13] , the size of the data becomes increasingly large.
- Nowadays the number of features in many areas, such as gene analysis, can easily reach thousands or even millions.
- This increases computational cost and requires advanced search mechanisms, but both of these aspects also have their own issues so the problem cannot be solved by only increasing computational power.
- Other computational intelligence based techniques have been introduced to feature selection tasks in the ranges of millions [13] , [36] .
- The first stage removes lowly-ranked features without considering their interaction with other features.
B. Computational Cost
- Most feature selection methods suffer from the problem of being computationally expensive, which is a particularly serious issue in EC for feature selection since they often involve a large number of evaluations.
- Filter approaches are generally more efficient than wrapper approaches, but experiments have shown that this is not always true [234] .
- To reduce the computational cost, two main factors, an efficient search technique and a fast evaluation measure, need to be considered [1] .
- A fast evaluation criterion may produce a greater influence than the search technique, since in current approaches the evaluation procedure takes the majority of the computational cost.
C. Search Mechanisms
- Feature selection is an NP-hard problem and has a large complex solution space [239] .
- A related issue is that the new search mechanisms should be stable on feature selection tasks.
- EC algorithms are stochastic approaches, which may produce different solutions when using different starting points.
- Even when the fitness values of the solutions are the same, they may select different individual features.
- Therefore, to propose new search algorithms with high stability is also an important task.
D. Measures
- The evaluation measure, which forms the fitness function, is one of the key factors in EC for feature selection.
- Ignoring interactions between features results in subsets with redundancy and lack of complimentary features [2] , [242] , which in turn cannot achieve optimal classification performance in most domains of interest.
- For feature selection problems, multiple different solutions may have the same fitness values.
- This makes the problem even more challenging.
E. Representation
- A good representation scheme can help to reduce the search space size.
- It in turn helps to design new search mechanisms to improve the search ability.
- Another issue is that the current representations usually reflect only whether a feature is selected or not, but the feature interaction information is not shown.
- Furthermore, the interpretation of the solution is also an important issue closely related to the representation.
- Most EC methods are not good at this task except for GP and LCSs as they produce a tree or a population of rules, which are easier to understand and interpret.
F. Multi-Objective Feature Selection
- Most of the existing evolutionary multi-objective (EMO) algorithms are designed for continuous problems [244] , but feature selection is a discrete problem.
- Furthermore, the two main objectives (minimising both the number of features and the classification error rate) are not always conflicting with each other, i.e. in some subspaces, decreasing the number of features can also decrease the classification error rate as unnecessary features are removed [29] , [154] , [158] , [171] , [173] , [194] .
- Furthermore, developing new evaluation metrics and further selection methods to choose a single solution from a set of trade-off solutions is also a challenging topic.
- Finally, besides the two main objectives, other objectives, such as the complexity, the computational time, and the solution size (e.g. tree size in GP and number of rules in LCSs), could also be considered in multi-objective feature selection.
G. Feature Construction
- Feature selection does not create new features, as it only selects original features.
- If the original features are not informative enough to achieve promising performance, feature selection may not work well, yet feature construction may work well [3] , [247] .
- One of the challenges for feature construction is to decide when feature construction is needed.
- Meanwhile, feature selection and feature construction can be used together to improve the classification performance and reduce the dimensionality.
- This can be achieved in three different ways: performing feature selection before feature construction, performing feature construction before feature selection, and simultaneously performing both feature selection and construction [3] .
H. Number of Instances
- The number of instances in a dataset significantly influences the performance and design of experiments [236] .
- It causes problems when the number is too big or too small.
- The larger the data/training size, the longer each evaluation.
- Meanwhile, for "big data" problems, it not only needs to reduce the number of features, but also needs to reduce the number of instances [251] .
VII. CONCLUSIONS
- This paper provided a comprehensive survey of EC techniques in solving feature selection problems, which covered all the commonly used EC algorithms and focused on the key factors, such as representation, search mechanisms, and the performance measures as well as the applications.
- Important issues and challenges were also discussed.
- This survey shows that a variety of EC algorithms have recently attracted much attention to address feature selection tasks.
- A popular approach in GAs, GP and PSO is to improve the representation to simultaneously select features and optimise the classifiers, e.g. SVMs.
Did you find this useful? Give us your feedback
Citations
600 citations
Cites background or methods from "A Survey on Evolutionary Computatio..."
...This mixture of learning and evolutionarylike algorithms has also been explored successfully in other domains (Zhang et al., 2011) such as neural architecture search (Real et al., 2017; Liu et al., 2017), feature selection (Xue et al., 2016), and parameter learning (Fernando et al., 2016)....
[...]
..., 2017), feature selection (Xue et al., 2016), and parameter learning (Fernando et al....
[...]
469 citations
401 citations
Cites background from "A Survey on Evolutionary Computatio..."
...Besides other multiple scenarios leveraging this profitable complementarity (such as feature selection/construction [435], oppositionbased learning for bio-inspired optimization methods [436] or their hybridization with elements from Reinforcement Learning [437]), possibilities for the future are foreseen to sprout sharply given the enormous number of parameters featured by the family of Deep Learning...
[...]
338 citations
Cites background from "A Survey on Evolutionary Computatio..."
..., 2018) as well as evolutionary and swarm intelligence algorithms for feature selection (Yang and Honavar, 1998; Xue et al., 2016; Brezočnik et al., 2018)....
[...]
273 citations
Cites methods from "A Survey on Evolutionary Computatio..."
...In particular, PSO, as a popular metaheuristics, has also been widely adopted for feature selection [40]....
[...]
References
14,509 citations
"A Survey on Evolutionary Computatio..." refers background or methods in this paper
...search space, where the total number of possible solutions is 2n for a dataset with n features [1], [2]....
[...]
..., microarray gene data, where genes possess intrinsic linkages [1], [2]....
[...]
...Ignoring interactions between features results in subsets with redundancy and lack of complimentary features [2], [241], which in turn cannot achieve optimal classification performance in most domains of interest....
[...]
..., the size is 2n for a dataset with n features, even when n is only a few hundreds [1], [2]....
[...]
...ing process, simplify the learned model, and/or increase the performance [1], [2]....
[...]
9,933 citations
8,610 citations
"A Survey on Evolutionary Computatio..." refers methods in this paper
...Thus, to simplify the structure of the paper, we follow the convention of classifying feature selection algorithms into wrappers and filters only [1], [2], [21] with embedded algorithms belonging to the wrapper category....
[...]
8,078 citations
7,098 citations
"A Survey on Evolutionary Computatio..." refers background in this paper
...Feature interaction (or epistasis [22]) happens frequently in many areas [2]....
[...]