This paper presents a comprehensive survey of the state-of-the-art work on EC for feature selection, which identifies the contributions of these different algorithms.
Abstract: Feature selection is an important task in data mining and machine learning to reduce the dimensionality of the data and increase the performance of an algorithm, such as a classification algorithm. However, feature selection is a challenging task due mainly to the large search space. A variety of methods have been applied to solve feature selection problems, where evolutionary computation (EC) techniques have recently gained much attention and shown some success. However, there are no comprehensive guidelines on the strengths and weaknesses of alternative approaches. This leads to a disjointed and fragmented field with ultimately lost opportunities for improving performance and successful applications. This paper presents a comprehensive survey of the state-of-the-art work on EC for feature selection, which identifies the contributions of these different algorithms. In addition, current issues and challenges are also discussed to identify promising areas for future research.

  • In data mining and machine learning, real-world problems often involve a large number of features.
  • The task is becoming more challenging as n is increasing in many areas with the advances in the data collection techniques and the increased complexity of those problems.
  • Feature selection has been used to improve the quality of the feature set in many machine learning tasks, such as classification, clustering, regression, and time series prediction [1] .
  • Section V presents the applications of EC based feature selection approaches.


  • Feature selection is a process that selects a subset of relevant features from the original large set of features [9] .
  • Based on the evaluation criteria, feature selection algorithms are generally classified into two categories: filter approaches and wrapper approaches [1] , [2] .
  • Filters ignore the performance of the selected features on a classification algorithm while wrappers evaluate the feature subsets based on the classification performance, which usually results in better performance achieved Fig. 2 .
  • The removal or selection of such features may miss the optimal feature subset(s).
  • Feature selection involves two main objectives, which are to maximise the classification accuracy and minimise the number of features.

1) Search techniques:

  • Both floating search methods are claimed to be better than the static sequential methods.
  • Feature interaction leads to individually relevant features becoming redundant or individually weakly relevant features becoming highly relevant when combined with other features.
  • Many studies show that filter methods do not scale well to problems with more than tens of thousands of features [13] .
  • Most of the existing feature selection methods aim to maximise the classification performance only during the search process or aggregate the classification performance and the number of features into a single objective function.

B. Detailed Coverage of This Paper

  • As shown in Fig. 3 , according to three different criteria, which are the EC paradigms, the evaluation, and the number of objectives, EC based feature selection approaches are classified into different categories.
  • Based on the evaluation criteria, the authors review both filter and wrapper approaches, and also include another group of approaches named "Combined".
  • Wrapper approaches are not further categorised according to their measures because the classification algorithm in wrappers is used as a "black box" during the feature selection process such that it can often be easily replaced by another classification algorithm.
  • The reviewed literature is organised as follows.
  • In addition, Section IV discusses the research on EC based filter approaches for feature selection.

A. GAs for Feature Selection

  • GAs are most likely the first EC technique widely applied to feature selection problems.
  • To address this limitation, Yahya et al. [112] developed a variable length representation, where each chromosome showed the selected features only and different chromosomes may have different lengths.
  • Winkler et al. [81] proposed a new representation that included both feature selection and parameter optimisation of a certain classification algorithm, e.g. an SVM.
  • Winkler et al. [81] proposed several fitness functions, which considered the number of features, the overall classification performance, the class specific accuracy, and the classification accuracy using all the original features.
  • In summary, GAs have been applied to feature selection for around 25 years and have achieved reasonably good performance on problems with hundreds of features.


  • Compared with GAs and PSO, there are a much smaller number of works on GP for feature selection.
  • GP is used more often in feature construction than feature selection because of its flexible representation.
  • It may suffer from the problem of high computational cost.
  • Two-stage approaches have been investigated in GP for feature selection.
  • Venkatraman et al. [124] proposed to use a mutual information measure to rank individual features and remove weakly relevant or irrelevant features in the first stage and GP was then applied to select a subset of the remaining features [124] .


  • The representation of each particle in PSO for feature selection is typically a bit-string, where the dimensionality is equal to the total number of features in the dataset.
  • The dimensionality of the new representation is much smaller than the typical representation, however, it is not easy to determine the desired number of features.
  • Tran et al. [156] used the gbest resetting mechanism in [140] to reduce the number of features and performed a local search process on pbest to increase the classification performance.
  • The fitness function plays an important role in PSO for feature selection.
  • Research on PSO for multi-objective feature selection started only in the last two years, where Xue et al. [29] , [161] conducted the first work to optimise the classification performance and the number of features as two separate objectives.

D. ACO for Feature Selection

  • Table IV shows typical works on ACO for feature selection, where the earliest work was proposed around 2003 [183] .
  • Khushaba et al. [47] combined ACO and DE for feature selection, where DE was used to search for the optimal feature subset based on the solutions obtained by ACO.
  • In most ACO based algrithms [188] , [16] , features/nodes are fully connected to each other in the graph, but in [189] , each feature was connected only to two features.
  • At the end of a tour, each ant had a binary vector with the length as the total number of features, where "1" indicated selecting and "0" indicated removing the corresponding feature.
  • The fitness functions in [187] , [16] included both the classification performance and the number of features.

E. Other EC Techniques for Feature Selection

  • DE was introduced to solve feature selection problems in recent years, mainly since 2008.
  • Khushaba et al. [47] combined DE with ACO for feature selection, where DE was used to search for the optimal feature subset based on the solutions obtained by ACO.
  • Experiments showed that the proposed algorithm achieved better performance than other traditional feature selection algorithms on EEG braincomputer-interface tasks.
  • Therefore, in most memetic based feature selection approaches, an EC technique was used for wrapper feature selection and a local search algorithm was used for filter feature selection.
  • Almost all of them are wrapper based methods.


  • Feature selection measures have previously been classified into five categories [1] : information measures, consistency measures, dependency (or correlation) measures, distance measures, and precision measures (i.e. wrapper approaches).
  • Rough set theory has attracted much attention in ACO for feature selection [183] , [196] , [204] , [206] , which has been discussed in Section III-D. Tallón-Ballesteros and Riquelme [203] tested a correlation measure, a consistency measure, and their combination with information gain in ACO for feature selection.
  • In summary, different types of filter measures have been adopted in EC for feature selection.
  • Among these measures, information measures, correlation measures, and distance measures are computationally relatively cheap while consistency, rough set, and fuzzy set theories based measures may handle noisy data better.


  • Table VII shows the applications of EC for feature selection.
  • Generally, the major applications can be grouped into the following five categories: (1) Image and signal processing including image analysis, face recognition, human action recognition, EEG brain-computer-interface, speaker recognition, handwritten digit recognition, personal identification, and music instrument recognition.
  • (2) Biological and biomedical tasks including gene analysis, biomarker detection, and disease diagnosis, where selecting the key features and reducing the dimensionality can significantly reduce the cost of clinic validation, disease diagnosis and other related procedures.
  • (3) Business and financial problems including financial crisis, credit card issuing in bank systems, and customer churn prediction.
  • All the above areas are important and essential to their society or daily life.

A. Scalability

  • The most pressing issue is due to the trend in "big data" [13] , the size of the data becomes increasingly large.
  • Nowadays the number of features in many areas, such as gene analysis, can easily reach thousands or even millions.
  • This increases computational cost and requires advanced search mechanisms, but both of these aspects also have their own issues so the problem cannot be solved by only increasing computational power.
  • Other computational intelligence based techniques have been introduced to feature selection tasks in the ranges of millions [13] , [36] .
  • The first stage removes lowly-ranked features without considering their interaction with other features.

B. Computational Cost

  • Most feature selection methods suffer from the problem of being computationally expensive, which is a particularly serious issue in EC for feature selection since they often involve a large number of evaluations.
  • Filter approaches are generally more efficient than wrapper approaches, but experiments have shown that this is not always true [234] .
  • To reduce the computational cost, two main factors, an efficient search technique and a fast evaluation measure, need to be considered [1] .
  • A fast evaluation criterion may produce a greater influence than the search technique, since in current approaches the evaluation procedure takes the majority of the computational cost.

C. Search Mechanisms

  • Feature selection is an NP-hard problem and has a large complex solution space [239] .
  • A related issue is that the new search mechanisms should be stable on feature selection tasks.
  • EC algorithms are stochastic approaches, which may produce different solutions when using different starting points.
  • Even when the fitness values of the solutions are the same, they may select different individual features.
  • Therefore, to propose new search algorithms with high stability is also an important task.

D. Measures

  • The evaluation measure, which forms the fitness function, is one of the key factors in EC for feature selection.
  • Ignoring interactions between features results in subsets with redundancy and lack of complimentary features [2] , [242] , which in turn cannot achieve optimal classification performance in most domains of interest.
  • For feature selection problems, multiple different solutions may have the same fitness values.
  • This makes the problem even more challenging.

E. Representation

  • A good representation scheme can help to reduce the search space size.
  • It in turn helps to design new search mechanisms to improve the search ability.
  • Another issue is that the current representations usually reflect only whether a feature is selected or not, but the feature interaction information is not shown.
  • Furthermore, the interpretation of the solution is also an important issue closely related to the representation.
  • Most EC methods are not good at this task except for GP and LCSs as they produce a tree or a population of rules, which are easier to understand and interpret.

F. Multi-Objective Feature Selection

  • Most of the existing evolutionary multi-objective (EMO) algorithms are designed for continuous problems [244] , but feature selection is a discrete problem.
  • Furthermore, the two main objectives (minimising both the number of features and the classification error rate) are not always conflicting with each other, i.e. in some subspaces, decreasing the number of features can also decrease the classification error rate as unnecessary features are removed [29] , [154] , [158] , [171] , [173] , [194] .
  • Furthermore, developing new evaluation metrics and further selection methods to choose a single solution from a set of trade-off solutions is also a challenging topic.
  • Finally, besides the two main objectives, other objectives, such as the complexity, the computational time, and the solution size (e.g. tree size in GP and number of rules in LCSs), could also be considered in multi-objective feature selection.

G. Feature Construction

  • Feature selection does not create new features, as it only selects original features.
  • If the original features are not informative enough to achieve promising performance, feature selection may not work well, yet feature construction may work well [3] , [247] .
  • One of the challenges for feature construction is to decide when feature construction is needed.
  • Meanwhile, feature selection and feature construction can be used together to improve the classification performance and reduce the dimensionality.
  • This can be achieved in three different ways: performing feature selection before feature construction, performing feature construction before feature selection, and simultaneously performing both feature selection and construction [3] .

H. Number of Instances

  • The number of instances in a dataset significantly influences the performance and design of experiments [236] .
  • It causes problems when the number is too big or too small.
  • The larger the data/training size, the longer each evaluation.
  • Meanwhile, for "big data" problems, it not only needs to reduce the number of features, but also needs to reduce the number of instances [251] .


  • This paper provided a comprehensive survey of EC techniques in solving feature selection problems, which covered all the commonly used EC algorithms and focused on the key factors, such as representation, search mechanisms, and the performance measures as well as the applications.
  • Important issues and challenges were also discussed.
  • This survey shows that a variety of EC algorithms have recently attracted much attention to address feature selection tasks.
  • A popular approach in GAs, GP and PSO is to improve the representation to simultaneously select features and optimise the classifiers, e.g. SVMs.

CART, are relatively simple and can achieve good performance
[35]. Sparse approaches have recently become popular, such
as sparse logistic regression for feature selection [36], which
has been used for feature selection tasks with millions of
features. For example, the sparse logistic regression method
[36] automatically assigns a weight to each feature showing
its relevance. Irrelevant features are assigned with low weights
close to zero, which has the effect of filtering out these
features. Sparse learning based methods tend to learn simple
models due to their bias to features with high weights. These
statistical algorithms usually produce good performance with
high efficiency, but they often have assumptions about the
probability distribution of the data. Furthermore, the used
cutting plan search method in [36] works well when the search
space is unimodal, but EC approaches can deal well with both
unimodal and multimodal search space and the population
based search can find a Pareto front of non-dominated (trade-
off) solutions. Min et al. [28] developed a rough set theory
based algorithm to address feature selection problems under
the constraints of having limited resources (e.g. money and
time). However, many studies show that filter methods do not
scale well to problems with more than tens of thousands of
features [13].
3) Number of objectives: Most of the existing feature selec-
tion methods aim to maximise the classification performance
only during the search process or aggregate the classification
performance and the number of features into a single objective
function. To the best of our knowledge, all the multi-objective
feature selection algorithms to date are based on EC techniques
since their population based mechanism producing multiple
solutions in a single run is particularly suitable for multi-
objective optimisation.
B. Detailed Coverage of This Paper
As shown in Fig. 3, according to three different criteria,
which are the EC paradigms, the evaluation, and the num-
ber of objectives, EC based feature selection approaches are
classified into different categories. These three criteria are the
key components in a feature selection method. EC approaches
are mainly used as the search techniques in feature selection.
Almost all the major EC paradigms have been applied to
feature selection and the most popular ones are discussed in
this paper, i.e. GAs [37], [38], [39] and GP [19], [40], [41] as
typical examples in evolutionary algorithms, PSO [10], [29],
[42] and ACO [43], [44], [45], [46] as typical examples in
swarm intelligence, and other algorithms recently applied to
feature selection, including differential evolution (DE) [47],
, memetic algorithms [49], [50], LCSs [51], [52], evolu-
tionary strategy (ES) [53], artificial bee colony (ABC) [54],
[55], and artificial immune systems (AISs) [56], [57]. Based
on the evaluation criteria, we review both filter and wrapper
approaches, and also include another group of approaches
named “Combined”. “Combined” means that the evaluation
procedure includes both filter and wrapper measures, which
are also called hybrid approaches by some researchers [9],
[14]. The use here of “Combined” instead of “hybrid” is
Some researchers classify DE as a swarm intelligence algorithm.
Filter Approaches
Fuzzy Set
Rough Set
Fig. 4. Different measures in EC based filter approaches.
to avoid confusion with the concept of hybrid algorithms in
the EC field, which hybridise multiple EC search techniques.
According to the number of objectives, EC based feature selec-
tion approaches are classified into single objective and multi-
objective approaches, where the multi-objective approaches
correspond to methods aiming to find a Pareto front of trade-
off solutions. The approaches that aggregate the number of
features and the classification performance into a single fitness
function are treated as single objective algorithms in this paper.
Similar to many earlier survey papers on traditional (non-
EC) feature selection [1], [7], [8], [9], this paper further
reviews different evolutionary filter methods according to
measures that are driven from different disciplines. Fig. 4
shows the main categories of measures used in EC based filter
approaches. Wrapper approaches are not further categorised
according to their measures because the classification algo-
rithm in wrappers is used as a “black box” during the feature
selection process such that it can often be easily replaced by
another classification algorithm.
The reviewed literature is organised as follows. Typical
approaches are reviewed in Section III, where each subsection
discusses a particular EC technique for feature selection (e.g.
Section III-A: GAs for feature selection, as shown by the left
branch in Fig. 3). Within each subsection, the research using
an EC technique is further detailed and discussed according
to the evaluation criterion and the number of objectives. In
addition, Section IV discusses the research on EC based filter
approaches for feature selection. The applications of EC for
feature selection are described in Section V.
Single Objective Multi-Objective
[3], [37], [58], [38], [39], [44],
[59], [60], [61], [62], [63], [64],
[65], [66], [67], [68], [69], [70],
[71], [72], [73], [74], [75], [76],
[77], [78], [79], [80], [81], [82],
[83], [84], [85], [86], [87]
[88], [89], [90], [91],
[92], [93], [94], [95],
[96], [97]
[75], [98], [99], [100], [101],
[102], [103], [104],
[105], [106]
[107], [108], [109]
A. GAs for Feature Selection
GAs are most likely the first EC technique widely applied
to feature selection problems. One of the earliest works was
published in 1989 [37]. GAs have a natural representation of
a binary string, where “1” shows the corresponding feature is
selected and “0” means not selected. Table I shows the typical
works on GAs for feature selection. It can be seen that there
are more works on wrappers than filters, and more on single
objective than multi-objective approaches.
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at
Copyright (c) 2016 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing

