scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Survey on Evolutionary Computation Approaches to Feature Selection

TL;DR: This paper presents a comprehensive survey of the state-of-the-art work on EC for feature selection, which identifies the contributions of these different algorithms.
Abstract: Feature selection is an important task in data mining and machine learning to reduce the dimensionality of the data and increase the performance of an algorithm, such as a classification algorithm. However, feature selection is a challenging task due mainly to the large search space. A variety of methods have been applied to solve feature selection problems, where evolutionary computation (EC) techniques have recently gained much attention and shown some success. However, there are no comprehensive guidelines on the strengths and weaknesses of alternative approaches. This leads to a disjointed and fragmented field with ultimately lost opportunities for improving performance and successful applications. This paper presents a comprehensive survey of the state-of-the-art work on EC for feature selection, which identifies the contributions of these different algorithms. In addition, current issues and challenges are also discussed to identify promising areas for future research.

Summary (5 min read)

I. INTRODUCTION

  • In data mining and machine learning, real-world problems often involve a large number of features.
  • The task is becoming more challenging as n is increasing in many areas with the advances in the data collection techniques and the increased complexity of those problems.
  • Feature selection has been used to improve the quality of the feature set in many machine learning tasks, such as classification, clustering, regression, and time series prediction [1] .
  • Section V presents the applications of EC based feature selection approaches.

II. BACKGROUND

  • Feature selection is a process that selects a subset of relevant features from the original large set of features [9] .
  • Based on the evaluation criteria, feature selection algorithms are generally classified into two categories: filter approaches and wrapper approaches [1] , [2] .
  • Filters ignore the performance of the selected features on a classification algorithm while wrappers evaluate the feature subsets based on the classification performance, which usually results in better performance achieved Fig. 2 .
  • The removal or selection of such features may miss the optimal feature subset(s).
  • Feature selection involves two main objectives, which are to maximise the classification accuracy and minimise the number of features.

1) Search techniques:

  • Both floating search methods are claimed to be better than the static sequential methods.
  • Feature interaction leads to individually relevant features becoming redundant or individually weakly relevant features becoming highly relevant when combined with other features.
  • Many studies show that filter methods do not scale well to problems with more than tens of thousands of features [13] .
  • Most of the existing feature selection methods aim to maximise the classification performance only during the search process or aggregate the classification performance and the number of features into a single objective function.

B. Detailed Coverage of This Paper

  • As shown in Fig. 3 , according to three different criteria, which are the EC paradigms, the evaluation, and the number of objectives, EC based feature selection approaches are classified into different categories.
  • Based on the evaluation criteria, the authors review both filter and wrapper approaches, and also include another group of approaches named "Combined".
  • Wrapper approaches are not further categorised according to their measures because the classification algorithm in wrappers is used as a "black box" during the feature selection process such that it can often be easily replaced by another classification algorithm.
  • The reviewed literature is organised as follows.
  • In addition, Section IV discusses the research on EC based filter approaches for feature selection.

A. GAs for Feature Selection

  • GAs are most likely the first EC technique widely applied to feature selection problems.
  • To address this limitation, Yahya et al. [112] developed a variable length representation, where each chromosome showed the selected features only and different chromosomes may have different lengths.
  • Winkler et al. [81] proposed a new representation that included both feature selection and parameter optimisation of a certain classification algorithm, e.g. an SVM.
  • Winkler et al. [81] proposed several fitness functions, which considered the number of features, the overall classification performance, the class specific accuracy, and the classification accuracy using all the original features.
  • In summary, GAs have been applied to feature selection for around 25 years and have achieved reasonably good performance on problems with hundreds of features.

TABLE II CATEGORISATION OF GP APPROACHES Single Objective

  • Compared with GAs and PSO, there are a much smaller number of works on GP for feature selection.
  • GP is used more often in feature construction than feature selection because of its flexible representation.
  • It may suffer from the problem of high computational cost.
  • Two-stage approaches have been investigated in GP for feature selection.
  • Venkatraman et al. [124] proposed to use a mutual information measure to rank individual features and remove weakly relevant or irrelevant features in the first stage and GP was then applied to select a subset of the remaining features [124] .

TABLE III CATEGORISATION OF PSO APPROACHES Single Objective

  • The representation of each particle in PSO for feature selection is typically a bit-string, where the dimensionality is equal to the total number of features in the dataset.
  • The dimensionality of the new representation is much smaller than the typical representation, however, it is not easy to determine the desired number of features.
  • Tran et al. [156] used the gbest resetting mechanism in [140] to reduce the number of features and performed a local search process on pbest to increase the classification performance.
  • The fitness function plays an important role in PSO for feature selection.
  • Research on PSO for multi-objective feature selection started only in the last two years, where Xue et al. [29] , [161] conducted the first work to optimise the classification performance and the number of features as two separate objectives.

D. ACO for Feature Selection

  • Table IV shows typical works on ACO for feature selection, where the earliest work was proposed around 2003 [183] .
  • Khushaba et al. [47] combined ACO and DE for feature selection, where DE was used to search for the optimal feature subset based on the solutions obtained by ACO.
  • In most ACO based algrithms [188] , [16] , features/nodes are fully connected to each other in the graph, but in [189] , each feature was connected only to two features.
  • At the end of a tour, each ant had a binary vector with the length as the total number of features, where "1" indicated selecting and "0" indicated removing the corresponding feature.
  • The fitness functions in [187] , [16] included both the classification performance and the number of features.

E. Other EC Techniques for Feature Selection

  • DE was introduced to solve feature selection problems in recent years, mainly since 2008.
  • Khushaba et al. [47] combined DE with ACO for feature selection, where DE was used to search for the optimal feature subset based on the solutions obtained by ACO.
  • Experiments showed that the proposed algorithm achieved better performance than other traditional feature selection algorithms on EEG braincomputer-interface tasks.
  • Therefore, in most memetic based feature selection approaches, an EC technique was used for wrapper feature selection and a local search algorithm was used for filter feature selection.
  • Almost all of them are wrapper based methods.

IV. MEASURES IN FILTER APPROACHES

  • Feature selection measures have previously been classified into five categories [1] : information measures, consistency measures, dependency (or correlation) measures, distance measures, and precision measures (i.e. wrapper approaches).
  • Rough set theory has attracted much attention in ACO for feature selection [183] , [196] , [204] , [206] , which has been discussed in Section III-D. Tallón-Ballesteros and Riquelme [203] tested a correlation measure, a consistency measure, and their combination with information gain in ACO for feature selection.
  • In summary, different types of filter measures have been adopted in EC for feature selection.
  • Among these measures, information measures, correlation measures, and distance measures are computationally relatively cheap while consistency, rough set, and fuzzy set theories based measures may handle noisy data better.

V. APPLICATIONS

  • Table VII shows the applications of EC for feature selection.
  • Generally, the major applications can be grouped into the following five categories: (1) Image and signal processing including image analysis, face recognition, human action recognition, EEG brain-computer-interface, speaker recognition, handwritten digit recognition, personal identification, and music instrument recognition.
  • (2) Biological and biomedical tasks including gene analysis, biomarker detection, and disease diagnosis, where selecting the key features and reducing the dimensionality can significantly reduce the cost of clinic validation, disease diagnosis and other related procedures.
  • (3) Business and financial problems including financial crisis, credit card issuing in bank systems, and customer churn prediction.
  • All the above areas are important and essential to their society or daily life.

A. Scalability

  • The most pressing issue is due to the trend in "big data" [13] , the size of the data becomes increasingly large.
  • Nowadays the number of features in many areas, such as gene analysis, can easily reach thousands or even millions.
  • This increases computational cost and requires advanced search mechanisms, but both of these aspects also have their own issues so the problem cannot be solved by only increasing computational power.
  • Other computational intelligence based techniques have been introduced to feature selection tasks in the ranges of millions [13] , [36] .
  • The first stage removes lowly-ranked features without considering their interaction with other features.

B. Computational Cost

  • Most feature selection methods suffer from the problem of being computationally expensive, which is a particularly serious issue in EC for feature selection since they often involve a large number of evaluations.
  • Filter approaches are generally more efficient than wrapper approaches, but experiments have shown that this is not always true [234] .
  • To reduce the computational cost, two main factors, an efficient search technique and a fast evaluation measure, need to be considered [1] .
  • A fast evaluation criterion may produce a greater influence than the search technique, since in current approaches the evaluation procedure takes the majority of the computational cost.

C. Search Mechanisms

  • Feature selection is an NP-hard problem and has a large complex solution space [239] .
  • A related issue is that the new search mechanisms should be stable on feature selection tasks.
  • EC algorithms are stochastic approaches, which may produce different solutions when using different starting points.
  • Even when the fitness values of the solutions are the same, they may select different individual features.
  • Therefore, to propose new search algorithms with high stability is also an important task.

D. Measures

  • The evaluation measure, which forms the fitness function, is one of the key factors in EC for feature selection.
  • Ignoring interactions between features results in subsets with redundancy and lack of complimentary features [2] , [242] , which in turn cannot achieve optimal classification performance in most domains of interest.
  • For feature selection problems, multiple different solutions may have the same fitness values.
  • This makes the problem even more challenging.

E. Representation

  • A good representation scheme can help to reduce the search space size.
  • It in turn helps to design new search mechanisms to improve the search ability.
  • Another issue is that the current representations usually reflect only whether a feature is selected or not, but the feature interaction information is not shown.
  • Furthermore, the interpretation of the solution is also an important issue closely related to the representation.
  • Most EC methods are not good at this task except for GP and LCSs as they produce a tree or a population of rules, which are easier to understand and interpret.

F. Multi-Objective Feature Selection

  • Most of the existing evolutionary multi-objective (EMO) algorithms are designed for continuous problems [244] , but feature selection is a discrete problem.
  • Furthermore, the two main objectives (minimising both the number of features and the classification error rate) are not always conflicting with each other, i.e. in some subspaces, decreasing the number of features can also decrease the classification error rate as unnecessary features are removed [29] , [154] , [158] , [171] , [173] , [194] .
  • Furthermore, developing new evaluation metrics and further selection methods to choose a single solution from a set of trade-off solutions is also a challenging topic.
  • Finally, besides the two main objectives, other objectives, such as the complexity, the computational time, and the solution size (e.g. tree size in GP and number of rules in LCSs), could also be considered in multi-objective feature selection.

G. Feature Construction

  • Feature selection does not create new features, as it only selects original features.
  • If the original features are not informative enough to achieve promising performance, feature selection may not work well, yet feature construction may work well [3] , [247] .
  • One of the challenges for feature construction is to decide when feature construction is needed.
  • Meanwhile, feature selection and feature construction can be used together to improve the classification performance and reduce the dimensionality.
  • This can be achieved in three different ways: performing feature selection before feature construction, performing feature construction before feature selection, and simultaneously performing both feature selection and construction [3] .

H. Number of Instances

  • The number of instances in a dataset significantly influences the performance and design of experiments [236] .
  • It causes problems when the number is too big or too small.
  • The larger the data/training size, the longer each evaluation.
  • Meanwhile, for "big data" problems, it not only needs to reduce the number of features, but also needs to reduce the number of instances [251] .

VII. CONCLUSIONS

  • This paper provided a comprehensive survey of EC techniques in solving feature selection problems, which covered all the commonly used EC algorithms and focused on the key factors, such as representation, search mechanisms, and the performance measures as well as the applications.
  • Important issues and challenges were also discussed.
  • This survey shows that a variety of EC algorithms have recently attracted much attention to address feature selection tasks.
  • A popular approach in GAs, GP and PSO is to improve the representation to simultaneously select features and optimise the classifiers, e.g. SVMs.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

University of Birmingham
A Survey on Evolutionary Computation Approaches
to Feature Selection
Xue, Bing; Zhang, Mengjie; Browne, Will; Yao, Xin
DOI:
10.1109/TEVC.2015.2504420
License:
Creative Commons: Attribution (CC BY)
Document Version
Publisher's PDF, also known as Version of record
Citation for published version (Harvard):
Xue, B, Zhang, M, Browne, W & Yao, X 2015, 'A Survey on Evolutionary Computation Approaches to Feature
Selection', IEEE Transactions on Evolutionary Computation, no. 99. https://doi.org/10.1109/TEVC.2015.2504420
Link to publication on Research at Birmingham portal
General rights
Unless a licence is specified above, all rights (including copyright and moral rights) in this document are retained by the authors and/or the
copyright holders. The express permission of the copyright holder must be obtained for any use of this material other than for purposes
permitted by law.
•Users may freely distribute the URL that is used to identify this publication.
•Users may download and/or print one copy of the publication from the University of Birmingham research portal for the purpose of private
study or non-commercial research.
•User may use extracts from the document in line with the concept of ‘fair dealing’ under the Copyright, Designs and Patents Act 1988 (?)
•Users may not further distribute the material nor use it for the purposes of commercial gain.
Where a licence is displayed above, please note the terms and conditions of the licence govern your use of this document.
When citing, please reference the published version.
Take down policy
While the University of Birmingham exercises care and attention in making items available there are rare occasions when an item has been
uploaded in error or has been deemed to be commercially or otherwise sensitive.
If you believe that this is the case for this document, please contact UBIRA@lists.bham.ac.uk providing details and we will remove access to
the work immediately and investigate.
Download date: 10. Aug. 2022

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TEVC.2015.2504420, IEEE
Transactions on Evolutionary Computation
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. , NO. , 1
A Survey on Evolutionary Computation Approaches
to Feature Selection
Bing Xue, Member, IEEE, Mengjie Zhang, Senior Member, IEEE, Will N. Browne, Member, IEEE
and Xin Yao, Fellow, IEEE
Abstract—Feature selection is an important task in data mining
and machine learning to reduce the dimensionality of the data
and increase the performance of an algorithm, such as a clas-
sification algorithm. However, feature selection is a challenging
task due mainly to the large search space. A variety of methods
have been applied to solve feature selection problems, where
evolutionary computation techniques have recently gained much
attention and shown some success. However, there are no compre-
hensive guidelines on the strengths and weaknesses of alternative
approaches. This leads to a disjointed and fragmented field
with ultimately lost opportunities for improving performance
and successful applications. This paper presents a comprehensive
survey of the state-of-the-art work on evolutionary computation
for feature selection, which identifies the contributions of these
different algorithms. In addition, current issues and challenges
are also discussed to identify promising areas for future research.
Index Terms—Evolutionary computation, feature selection,
classification, data mining, machine learning.
I. INTRODUCTION
In data mining and machine learning, real-world problems
often involve a large number of features. However, not all
features are essential since many of them are redundant or even
irrelevant, which may reduce the performance of an algorithm,
e.g. a classification algorithm. Feature selection aims to solve
this problem by selecting only a small subset of relevant
features from the original large set of features. By removing
irrelevant and redundant features, feature selection can reduce
the dimensionality of the data, speed up the learning process,
simplify the learnt model, and/or increase the performance [1],
[2]. Feature construction (or feature extraction) [3], [4], [5],
which can also reduce the dimensionality, is closely related to
feature selection. The major difference is that feature selection
selects a subset of original features while feature construction
creates novel features from the original features. This paper
focuses mainly on feature selection.
Feature selection is a difficult task due mainly to a large
search space, where the total number of possible solutions is
2
n
for a dataset with n features [1], [2]. The task is becoming
more challenging as n is increasing in many areas with the
advances in the data collection techniques and the increased
complexity of those problems. An exhaustive search for the
Bing Xue, Mengjie Zhang, and Will N. Browne are with the Evolutionary
Computation Research Group at Victoria University of Wellington, PO Box
600, Wellington, New Zealand (E-mail: bing.xue@ecs.vuw.ac.nz).
Xin Yao is with the Natural Computation Group, School of Computer Science
at The University of Birmingham, Edgbaston, Birmingham B15 2TT, U.K.
Copyright (c) 2012 IEEE. Personal use of this material is permitted. However,
permission to use this material for any other purposes must be obtained from
the IEEE by sending a request to pubs-permissions@ieee.org.
best feature subset of a given dataset is practically impossible
in most situations. A variety of search techniques have been
applied to feature selection, such as complete search, greedy
search, heuristic search, and random search [1], [6], [7], [8],
[9]. However, most existing feature selection methods still suf-
fer from stagnation in local optima and/or high computational
cost [10], [11]. Therefore, an efficient global search technique
is needed to better solve feature selection problems. Evolution-
ary computation (EC) techniques have recently received much
attention from the feature selection community as they are
well-known for their global search ability/potential. However,
there are no comprehensive guidelines on the strengths and
weaknesses of alternative approaches along with their most
suitable application areas. This leads to progress in the field
being disjointed, shared best practice becoming fragmented
and ultimately, opportunities for improving performance and
successful applications being missed. This paper presents a
comprehensive survey of the literature on EC for feature
selection with the goal of providing interested researchers with
the state-of-the-art research.
Feature selection has been used to improve the quality
of the feature set in many machine learning tasks, such as
classification, clustering, regression, and time series prediction
[1]. This paper focuses mainly on feature selection for clas-
sification since there is much more work on feature selection
for classification than for other tasks [1]. Recent reviews on
feature selection can be seen from [7], [8], [12], [13], which
focus mainly on non-EC based methods. De La Iglesia [14]
presents a summary of works using EC for feature selection
in classification, which is suitable for a non-EC audience
since it focuses on basic EC concepts and genetic algorithms
(GAs) for feature selection. The paper [14] reviewed only
14 papers published since 2010 and in total 21 papers since
2007. No papers published in the most recent two years were
reviewed [14], but there have been over 500 papers published
in the last five years. Research on EC for feature selection
started around 1990, but it has become popular since 2007
when the number of features in many areas became relatively
large. Fig. 1 shows the number of papers on the two most
popular EC methods in feature selection, i.e. GAs and particle
swarm optimisation (PSO), which shows that the number of
papers, especially on PSO, has significantly increased since
2007 (Note that the numbers were obtained from Google
Scholar on September 2015. These numbers might not be
complete, but they show the general trend of the field. The
papers used to form this survey were collected from all the
major databases, such as Web of Science, Scopus, and Google
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TEVC.2015.2504420
Copyright (c) 2016 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TEVC.2015.2504420, IEEE
Transactions on Evolutionary Computation
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. , NO. , 2
0
25
50
75
100
2004
2005
2006
2010
2011
2012
GA PSO
!1
Fig. 1. Number of Papers on GAs and PSO for Feature Selection (from
Google Scholar, September 2015).
Scholar). We aim to provide a comprehensive survey of the
state-of-the-art work and a discussion of the open issues and
challenges for future work. We expect this survey to attract
attention from researchers working on different EC paradigms
to further investigate effective and efficient approaches to
addressing new challenges in feature selection. This paper
is also expected to encourage researchers from the machine
learning community, especially classification, to pay much
attention to the use of EC techniques to address feature
selection problems.
The remainder of this paper is organised as follows. Section
II describes the background of feature selection. Section III
reviews typical EC algorithms for feature selection. Section IV
discusses different measures used in EC for feature selection.
Section V presents the applications of EC based feature
selection approaches. Section VI discusses current issues and
challenges, and conclusions are given in Section VII.
II. BACKGROUND
Feature selection is a process that selects a subset of
relevant features from the original large set of features [9]. For
example, feature selection is to find key genes (i.e. biomark-
ers) from a large number of candidate genes in biological
and biomedical problems [15], to discover core indicators
(features) to describe the dynamic business environment [9],
to select key terms (features, e.g. words or phrases) in text
mining [16], and to choose/construct important visual contents
(features, e.g. pixel, color, texture, shape) in image analysis
[17]. Fig. 2 shows a general feature selection process and all
the five key steps, where “Subset Evaluation” is achieved by
using an evaluation function to measure the goodness/quality
of the selected features. Detailed discussions about Fig. 2 can
be seen in [1] and a typical iterative evolutionary workflow of
feature selection can be seen in [18].
Based on the evaluation criteria, feature selection algorithms
are generally classified into two categories: filter approaches
and wrapper approaches [1], [2]. Their main difference is that
wrapper approaches include a classification/learning algorithm
in the feature subset evaluation step. The classification algo-
rithm is used as a “black box” by a wrapper to evaluate the
goodness (i.e. the classification performance) of the selected
features. A filter feature selection process is independent
of any classification algorithm. Filter algorithms are often
computationally less expensive and more general than wrapper
algorithms. However, filters ignore the performance of the
selected features on a classification algorithm while wrappers
evaluate the feature subsets based on the classification perfor-
mance, which usually results in better performance achieved














Fig. 2. General Feature Selection Process [1].
by wrappers than filters for a particular classification algorithm
[1], [7], [8]. Note that some researchers categorise feature
selection methods into three groups: wrapper, embedded and
filter approaches [7], [8]. The methods integrating feature
selection and classifier learning into a single process are called
embedded approaches. Among current EC techniques, only
genetic programming (GP) and learning classifier systems
(LCSs) are able to perform embedded feature selection [19],
[20]. Thus, to simplify the structure of the paper, we follow
the convention of classifying feature selection algorithms
into wrappers and filters only [1], [2], [21] with embedded
algorithms belonging to the wrapper category.
Feature selection is a difficult problem not only because of
the large search space, but also feature interaction problems.
Feature interaction (or epistasis [22]) happens frequently in
many areas [2]. There can be two-way, three-way or complex
multi-way interactions among features. A feature, which is
weakly relevant to the target concept by itself, could sig-
nificantly improve the classification accuracy if it is used
together with some complementary features. In contrast, an
individually relevant feature may become redundant when
used together with other features. The removal or selection of
such features may miss the optimal feature subset(s). Many
traditional measures evaluating features individually cannot
work well and a subset of features needs to be evaluated as
a whole. Therefore, the two key factors in a feature selection
approach are the search techniques, which explore the search
space to find the optimal feature subset(s), and the evaluation
criteria, which measure the quality of feature subsets to guide
the search.
Feature selection involves two main objectives, which are to
maximise the classification accuracy and minimise the number
of features. They are often conflicting objectives. Therefore,
feature selection can be treated as a multi-objective problem to
find a set of trade-off solutions between these two objectives.
The research on this direction has gained much attention only
in recent years, where EC techniques contribute the most
since EC techniques using a population based approach are
particularly suitable for multi-objective optimisation.
A. Existing Work on Feature Selection
This section briefly summarises them from three aspects,
which are the search techniques, the evaluation criteria, and
the number of objectives.
1) Search techniques: There are very few feature selection
methods that use an exhaustive search [1], [7], [8]. This is
because even when the number of features is relatively small
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TEVC.2015.2504420
Copyright (c) 2016 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TEVC.2015.2504420, IEEE
Transactions on Evolutionary Computation
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. , NO. , 3
Evolutionary Feature Selection
EC Paradigms Number of Objectives
Single
Objective
Multi-
Objective
Evaluation
Wrapper
Approaches
Filter
Approaches
Combined
Approaches
GPGAs
Evolutionary
Algorithms
LCSs, ES,
ABC, et al.
MemeticDE
Others
ACOPSO
Swarm
Intelligence
Fig. 3. Overall categories of EC for feature selection.
(e.g. 50), in many situations such methods are computationally
too expensive to perform. Therefore, different heuristic search
techniques have been applied to feature selection, such as
greedy search algorithms, where typical examples are se-
quential forward selection (SFS) [23], sequential backward
selection (SBS) [24]. However, both methods suffer from the
so-called “nesting effect” because a feature that is selected
or removed cannot be removed or selected in later stages.
“plus-l-take-away-r [25] compromises these two approaches
by applying SFS l times and then SBS r times. This strategy
can avoid the nesting effect in principle, but it is hard to
determine appropriate values for l and r in practice. To avoid
this problem, two methods called sequential backward floating
selection (SBFS) and sequential forward floating selection
(SFFS) were proposed in [26]. Both floating search methods
are claimed to be better than the static sequential methods.
Recently, Mao and Tsang [27] proposed a two-layer cutting
plane algorithm to search for the optimal feature subsets. Min
et al. [28] proposed a heuristic search and a backtracking
algorithm, which performs exhaustive search, to solve feature
selection problems using rough set theory. The results show
that heuristic search techniques achieved similar performance
to the backtracking algorithm, but used a much shorter time.
In recent years, EC technique as they are effective methods
have been applied to solve feature selection problems. Such
methods include GAs, GP, particle swarm optimisation (PSO),
and ant colony optimisation (ACO). Details will be described
in the next section.
Feature selection problems have a large search space, which
is often very complex due to feature interaction. Feature
interaction leads to individually relevant features becoming
redundant or individually weakly relevant features becoming
highly relevant when combined with other features. Compared
with traditional search methods, EC techniques do not need
domain knowledge and do not make any assumptions about
the search space, such as whether it is linearly or non-linearly
separable, and differentiable. Another significant advantage of
EC techniques is that their population based mechanism can
produce multiple solutions in a single run. This is particularly
suitable for multi-objective feature selection in order to find
a set of non-dominated solutions with the trade-off between
the number of features and the classification performance.
However, EC techniques have a major limitation of requiring
a high computational cost since they usually involve a large
number of evaluations. Another issue with EC techniques
is their stability since the algorithms often select different
features from different runs, which may require a further
selection process for real-world users. Further research to
address these issues is of great importance as the increasingly
large number of features increases the computational cost and
lowers the stability of the algorithms in many real-world tasks.
2) Evaluation criteria: For wrapper feature selection ap-
proaches, the classification performance of the selected fea-
tures is used as the evaluation criterion. Most of the popular
classification algorithms, such as decision tree (DT), support
vector machines (SVMs), Na
¨
ıve Bayes (NB), K-nearest neigh-
bour (KNN), artificial neural networks (ANNs), and linear
discriminant analysis (LDA), have been applied to wrappers
for feature selection [7], [8], [29]. For filter approaches,
measures from different disciplines have been applied, includ-
ing information theory based measures, correlation measures,
distance measures, and consistency measures, and [1].
Single feature ranking based on a certain criterion is a
simple filter approach, where feature selection is achieved by
choosing only the top-ranked features [7]. Relief [30] is a
typical example, where a distance measure is used to measure
the relevance of each feature and all the relevant features are
selected. Single feature ranking methods are computationally
cheap, but do not consider feature interactions, which often
leads to redundant feature subsets (or local optima) when
applied to complex problems, e.g. microarray gene data, where
genes possess intrinsic linkages [1], [2]. To overcome such
issues, filter measures that can evaluate the feature subset
as a whole have become popular. Recently, Wang et al.
[31] developed a distance measure evaluating the difference
between the selected feature space and all feature space to
find a feature subset, which approximates all features. Peng
et al. [32] proposed the minimum Redundancy Maximum
Relevance method based on mutual information, where the
proposed measures have been introduced to EC for feature
selection due to their powerful search abilities [33], [34].
Mao and Tsang [27] proposed a novel feature selection
approach by optimizing multivariate performance measures
(which can also be viewed as an embedded method since the
proposed feature selection framework was to optimise the gen-
eral loss function and was achieved based on SVMs). However,
the proposed method resulted a huge search space for high-
dimensional data, which required a powerful heuristic search
method to find the optimal solutions. Statistical approaches,
such as T-test, logistic regression, hierarchical clustering, and
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TEVC.2015.2504420
Copyright (c) 2016 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TEVC.2015.2504420, IEEE
Transactions on Evolutionary Computation
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. , NO. , 4
CART, are relatively simple and can achieve good performance
[35]. Sparse approaches have recently become popular, such
as sparse logistic regression for feature selection [36], which
has been used for feature selection tasks with millions of
features. For example, the sparse logistic regression method
[36] automatically assigns a weight to each feature showing
its relevance. Irrelevant features are assigned with low weights
close to zero, which has the effect of filtering out these
features. Sparse learning based methods tend to learn simple
models due to their bias to features with high weights. These
statistical algorithms usually produce good performance with
high efficiency, but they often have assumptions about the
probability distribution of the data. Furthermore, the used
cutting plan search method in [36] works well when the search
space is unimodal, but EC approaches can deal well with both
unimodal and multimodal search space and the population
based search can find a Pareto front of non-dominated (trade-
off) solutions. Min et al. [28] developed a rough set theory
based algorithm to address feature selection problems under
the constraints of having limited resources (e.g. money and
time). However, many studies show that filter methods do not
scale well to problems with more than tens of thousands of
features [13].
3) Number of objectives: Most of the existing feature selec-
tion methods aim to maximise the classification performance
only during the search process or aggregate the classification
performance and the number of features into a single objective
function. To the best of our knowledge, all the multi-objective
feature selection algorithms to date are based on EC techniques
since their population based mechanism producing multiple
solutions in a single run is particularly suitable for multi-
objective optimisation.
B. Detailed Coverage of This Paper
As shown in Fig. 3, according to three different criteria,
which are the EC paradigms, the evaluation, and the num-
ber of objectives, EC based feature selection approaches are
classified into different categories. These three criteria are the
key components in a feature selection method. EC approaches
are mainly used as the search techniques in feature selection.
Almost all the major EC paradigms have been applied to
feature selection and the most popular ones are discussed in
this paper, i.e. GAs [37], [38], [39] and GP [19], [40], [41] as
typical examples in evolutionary algorithms, PSO [10], [29],
[42] and ACO [43], [44], [45], [46] as typical examples in
swarm intelligence, and other algorithms recently applied to
feature selection, including differential evolution (DE) [47],
[48]
1
, memetic algorithms [49], [50], LCSs [51], [52], evolu-
tionary strategy (ES) [53], artificial bee colony (ABC) [54],
[55], and artificial immune systems (AISs) [56], [57]. Based
on the evaluation criteria, we review both filter and wrapper
approaches, and also include another group of approaches
named “Combined”. “Combined” means that the evaluation
procedure includes both filter and wrapper measures, which
are also called hybrid approaches by some researchers [9],
[14]. The use here of “Combined” instead of “hybrid” is
1
Some researchers classify DE as a swarm intelligence algorithm.
Filter Approaches
Consistency
Measure
Distance
Measure
Fuzzy Set
Theory
Correlation
Measure
Information
Measure
Rough Set
Theory
Fig. 4. Different measures in EC based filter approaches.
to avoid confusion with the concept of hybrid algorithms in
the EC field, which hybridise multiple EC search techniques.
According to the number of objectives, EC based feature selec-
tion approaches are classified into single objective and multi-
objective approaches, where the multi-objective approaches
correspond to methods aiming to find a Pareto front of trade-
off solutions. The approaches that aggregate the number of
features and the classification performance into a single fitness
function are treated as single objective algorithms in this paper.
Similar to many earlier survey papers on traditional (non-
EC) feature selection [1], [7], [8], [9], this paper further
reviews different evolutionary filter methods according to
measures that are driven from different disciplines. Fig. 4
shows the main categories of measures used in EC based filter
approaches. Wrapper approaches are not further categorised
according to their measures because the classification algo-
rithm in wrappers is used as a “black box” during the feature
selection process such that it can often be easily replaced by
another classification algorithm.
The reviewed literature is organised as follows. Typical
approaches are reviewed in Section III, where each subsection
discusses a particular EC technique for feature selection (e.g.
Section III-A: GAs for feature selection, as shown by the left
branch in Fig. 3). Within each subsection, the research using
an EC technique is further detailed and discussed according
to the evaluation criterion and the number of objectives. In
addition, Section IV discusses the research on EC based filter
approaches for feature selection. The applications of EC for
feature selection are described in Section V.
TABLE I
CATEGORISATION OF GA APPROACHES
Single Objective Multi-Objective
Wrapper
[3], [37], [58], [38], [39], [44],
[59], [60], [61], [62], [63], [64],
[65], [66], [67], [68], [69], [70],
[71], [72], [73], [74], [75], [76],
[77], [78], [79], [80], [81], [82],
[83], [84], [85], [86], [87]
[88], [89], [90], [91],
[92], [93], [94], [95],
[96], [97]
Filter
[75], [98], [99], [100], [101],
[102]
[102], [103], [104],
[105], [106]
Combined
[107], [108], [109]
III. EC FOR FEATURE SELECTION
A. GAs for Feature Selection
GAs are most likely the first EC technique widely applied
to feature selection problems. One of the earliest works was
published in 1989 [37]. GAs have a natural representation of
a binary string, where “1” shows the corresponding feature is
selected and “0” means not selected. Table I shows the typical
works on GAs for feature selection. It can be seen that there
are more works on wrappers than filters, and more on single
objective than multi-objective approaches.
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.
The final version of record is available at http://dx.doi.org/10.1109/TEVC.2015.2504420
Copyright (c) 2016 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

Citations
More filters
Patent
TL;DR: Population Based Training is presented, a simple asynchronous optimisation algorithm which effectively utilises a fixed computational budget to jointly optimise a population of models and their hyperparameters to maximise performance.
Abstract: Neural networks dominate the modern machine learning landscape, but their training and success still suffer from sensitivity to empirical choices of hyperparameters such as model architecture, loss function, and optimisation algorithm. In this work we present \emph{Population Based Training (PBT)}, a simple asynchronous optimisation algorithm which effectively utilises a fixed computational budget to jointly optimise a population of models and their hyperparameters to maximise performance. Importantly, PBT discovers a schedule of hyperparameter settings rather than following the generally sub-optimal strategy of trying to find a single fixed set to use for the whole course of training. With just a small modification to a typical distributed hyperparameter training framework, our method allows robust and reliable training of models. We demonstrate the effectiveness of PBT on deep reinforcement learning problems, showing faster wall-clock convergence and higher final performance of agents by optimising over a suite of hyperparameters. In addition, we show the same method can be applied to supervised learning for machine translation, where PBT is used to maximise the BLEU score directly, and also to training of Generative Adversarial Networks to maximise the Inception score of generated images. In all cases PBT results in the automatic discovery of hyperparameter schedules and model selection which results in stable training and better final performance.

600 citations


Cites background or methods from "A Survey on Evolutionary Computatio..."

  • ...This mixture of learning and evolutionarylike algorithms has also been explored successfully in other domains (Zhang et al., 2011) such as neural architecture search (Real et al., 2017; Liu et al., 2017), feature selection (Xue et al., 2016), and parameter learning (Fernando et al., 2016)....

    [...]

  • ..., 2017), feature selection (Xue et al., 2016), and parameter learning (Fernando et al....

    [...]

Journal ArticleDOI
TL;DR: This study aims to investigate the impact of fourteen data normalization methods on classification performance considering full feature set, feature selection, and feature weighting and suggests a set of the best and the worst methods combining the normalization procedure and empirical analysis of results.

469 citations

Journal ArticleDOI
TL;DR: The main purpose of this paper is to outline the state of the art and to identify open challenges concerning the most relevant areas within bio-inspired optimization, thereby highlighting the need for reaching a consensus and joining forces towards achieving valuable insights into the understanding of this family of optimization techniques.
Abstract: In recent years, the research community has witnessed an explosion of literature dealing with the mimicking of behavioral patterns and social phenomena observed in nature towards efficiently solving complex computational tasks. This trend has been especially dramatic in what relates to optimization problems, mainly due to the unprecedented complexity of problem instances, arising from a diverse spectrum of domains such as transportation, logistics, energy, climate, social networks, health and industry 4.0, among many others. Notwithstanding this upsurge of activity, research in this vibrant topic should be steered towards certain areas that, despite their eventual value and impact on the field of bio-inspired computation, still remain insufficiently explored to date. The main purpose of this paper is to outline the state of the art and to identify open challenges concerning the most relevant areas within bio-inspired optimization. An analysis and discussion are also carried out over the general trajectory followed in recent years by the community working in this field, thereby highlighting the need for reaching a consensus and joining forces towards achieving valuable insights into the understanding of this family of optimization techniques.

401 citations


Cites background from "A Survey on Evolutionary Computatio..."

  • ...Besides other multiple scenarios leveraging this profitable complementarity (such as feature selection/construction [435], oppositionbased learning for bio-inspired optimization methods [436] or their hybridization with elements from Reinforcement Learning [437]), possibilities for the future are foreseen to sprout sharply given the enormous number of parameters featured by the family of Deep Learning...

    [...]

Journal ArticleDOI
TL;DR: There is no group of filter methods that always outperforms all other methods, but recommendations onfilter methods that perform well on many of the data sets are made and groups of filters that are similar with respect to the order in which they rank the features are found.

338 citations


Cites background from "A Survey on Evolutionary Computatio..."

  • ..., 2018) as well as evolutionary and swarm intelligence algorithms for feature selection (Yang and Honavar, 1998; Xue et al., 2016; Brezočnik et al., 2018)....

    [...]

Journal ArticleDOI
01 Feb 2018
TL;DR: This paper proposes to use a very recent PSO variant, known as competitive swarm optimizer (CSO) that was dedicated to large-scale optimization, for solving high-dimensional feature selection problems, and demonstrates that compared to the canonical PSO-based and a state-of-the-art PSO variants for feature selection, the proposed CSO- based feature selection algorithm not only selects a much smaller number of features, but result in better classification performance as well.
Abstract: When solving many machine learning problems such as classification, there exists a large number of input features. However, not all features are relevant for solving the problem, and sometimes, including irrelevant features may deteriorate the learning performance.Please check the edit made in the article title Therefore, it is essential to select the most relevant features, which is known as feature selection. Many feature selection algorithms have been developed, including evolutionary algorithms or particle swarm optimization (PSO) algorithms, to find a subset of the most important features for accomplishing a particular machine learning task. However, the traditional PSO does not perform well for large-scale optimization problems, which degrades the effectiveness of PSO for feature selection when the number of features dramatically increases. In this paper, we propose to use a very recent PSO variant, known as competitive swarm optimizer (CSO) that was dedicated to large-scale optimization, for solving high-dimensional feature selection problems. In addition, the CSO, which was originally developed for continuous optimization, is adapted to perform feature selection that can be considered as a combinatorial optimization problem. An archive technique is also introduced to reduce computational cost. Experiments on six benchmark datasets demonstrate that compared to the canonical PSO-based and a state-of-the-art PSO variant for feature selection, the proposed CSO-based feature selection algorithm not only selects a much smaller number of features, but result in better classification performance as well.

273 citations


Cites methods from "A Survey on Evolutionary Computatio..."

  • ...In particular, PSO, as a popular metaheuristics, has also been widely adopted for feature selection [40]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: The contributions of this special issue cover a wide range of aspects of variable selection: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.
Abstract: Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. The contributions of this special issue cover a wide range of aspects of such problems: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.

14,509 citations


"A Survey on Evolutionary Computatio..." refers background or methods in this paper

  • ...search space, where the total number of possible solutions is 2n for a dataset with n features [1], [2]....

    [...]

  • ..., microarray gene data, where genes possess intrinsic linkages [1], [2]....

    [...]

  • ...Ignoring interactions between features results in subsets with redundancy and lack of complimentary features [2], [241], which in turn cannot achieve optimal classification performance in most domains of interest....

    [...]

  • ..., the size is 2n for a dataset with n features, even when n is only a few hundreds [1], [2]....

    [...]

  • ...ing process, simplify the learned model, and/or increase the performance [1], [2]....

    [...]

Book
01 Jan 1996
TL;DR: An Introduction to Genetic Algorithms focuses in depth on a small set of important and interesting topics -- particularly in machine learning, scientific modeling, and artificial life -- and reviews a broad span of research, including the work of Mitchell and her colleagues.
Abstract: From the Publisher: "This is the best general book on Genetic Algorithms written to date. It covers background, history, and motivation; it selects important, informative examples of applications and discusses the use of Genetic Algorithms in scientific models; and it gives a good account of the status of the theory of Genetic Algorithms. Best of all the book presents its material in clear, straightforward, felicitous prose, accessible to anyone with a college-level scientific background. If you want a broad, solid understanding of Genetic Algorithms -- where they came from, what's being done with them, and where they are going -- this is the book. -- John H. Holland, Professor, Computer Science and Engineering, and Professor of Psychology, The University of Michigan; External Professor, the Santa Fe Institute. Genetic algorithms have been used in science and engineering as adaptive algorithms for solving practical problems and as computational models of natural evolutionary systems. This brief, accessible introduction describes some of the most interesting research in the field and also enables readers to implement and experiment with genetic algorithms on their own. It focuses in depth on a small set of important and interesting topics -- particularly in machine learning, scientific modeling, and artificial life -- and reviews a broad span of research, including the work of Mitchell and her colleagues. The descriptions of applications and modeling projects stretch beyond the strict boundaries of computer science to include dynamical systems theory, game theory, molecular biology, ecology, evolutionary biology, and population genetics, underscoring the exciting "general purpose" nature of genetic algorithms as search methods that can be employed across disciplines. An Introduction to Genetic Algorithms is accessible to students and researchers in any scientific discipline. It includes many thought and computer exercises that build on and reinforce the reader's understanding of the text. The first chapter introduces genetic algorithms and their terminology and describes two provocative applications in detail. The second and third chapters look at the use of genetic algorithms in machine learning (computer programs, data analysis and prediction, neural networks) and in scientific models (interactions among learning, evolution, and culture; sexual selection; ecosystems; evolutionary activity). Several approaches to the theory of genetic algorithms are discussed in depth in the fourth chapter. The fifth chapter takes up implementation, and the last chapter poses some currently unanswered questions and surveys prospects for the future of evolutionary computation.

9,933 citations

Journal ArticleDOI
TL;DR: The wrapper method searches for an optimal feature subset tailored to a particular algorithm and a domain and compares the wrapper approach to induction without feature subset selection and to Relief, a filter approach tofeature subset selection.

8,610 citations


"A Survey on Evolutionary Computatio..." refers methods in this paper

  • ...Thus, to simplify the structure of the paper, we follow the convention of classifying feature selection algorithms into wrappers and filters only [1], [2], [21] with embedded algorithms belonging to the wrapper category....

    [...]

Journal ArticleDOI
TL;DR: In this article, the maximal statistical dependency criterion based on mutual information (mRMR) was proposed to select good features according to the maximal dependency condition. But the problem of feature selection is not solved by directly implementing mRMR.
Abstract: Feature selection is an important problem for pattern classification systems. We study how to select good features according to the maximal statistical dependency criterion based on mutual information. Because of the difficulty in directly implementing the maximal dependency condition, we first derive an equivalent form, called minimal-redundancy-maximal-relevance criterion (mRMR), for first-order incremental feature selection. Then, we present a two-stage feature selection algorithm by combining mRMR and other more sophisticated feature selectors (e.g., wrappers). This allows us to select a compact set of superior features at very low cost. We perform extensive experimental comparison of our algorithm and other methods using three different classifiers (naive Bayes, support vector machine, and linear discriminate analysis) and four different data sets (handwritten digits, arrhythmia, NCI cancer cell lines, and lymphoma tissues). The results confirm that mRMR leads to promising improvement on feature selection and classification accuracy.

8,078 citations

Journal ArticleDOI
TL;DR: An Introduction to Genetic Algorithms as discussed by the authors is one of the rare examples of a book in which every single page is worth reading, and the author, Melanie Mitchell, manages to describe in depth many fascinating examples as well as important theoretical issues.
Abstract: An Introduction to Genetic Algorithms is one of the rare examples of a book in which every single page is worth reading. The author, Melanie Mitchell, manages to describe in depth many fascinating examples as well as important theoretical issues, yet the book is concise (200 pages) and readable. Although Mitchell explicitly states that her aim is not a complete survey, the essentials of genetic algorithms (GAs) are contained: theory and practice, problem solving and scientific models, a \"Brief History\" and \"Future Directions.\" Her book is both an introduction for novices interested in GAs and a collection of recent research, including hot topics such as coevolution (interspecies and intraspecies), diploidy and dominance, encapsulation, hierarchical regulation, adaptive encoding, interactions of learning and evolution, self-adapting GAs, and more. Nevertheless, the book focused more on machine learning, artificial life, and modeling evolution than on optimization and engineering.

7,098 citations


"A Survey on Evolutionary Computatio..." refers background in this paper

  • ...Feature interaction (or epistasis [22]) happens frequently in many areas [2]....

    [...]