scispace - formally typeset
Open AccessJournal ArticleDOI

Feature Selection for Intrusion Detection Using Random Forest

TLDR
Results show that the Random Forest based proposed approach can select most important and relevant features useful for classification, which reduces not only the number of input features and time but also increases the classification accuracy.
Abstract
An intrusion detection system collects and analyzes information from different areas within a computer or a network to identify possible security threats that include threats from both outside as well as inside of the organization. It deals with large amount of data, which contains various ir-relevant and redundant features and results in increased processing time and low detection rate. Therefore, feature selection should be treated as an indispensable pre-processing step to improve the overall system performance significantly while mining on huge datasets. In this context, in this paper, we focus on a two-step approach of feature selection based on Random Forest. The first step selects the features with higher variable importance score and guides the initialization of search process for the second step whose outputs the final feature subset for classification and in-terpretation. The effectiveness of this algorithm is demonstrated on KDD’99 intrusion detection datasets, which are based on DARPA 98 dataset, provides labeled data for researchers working in the field of intrusion detection. The important deficiency in the KDD’99 data set is the huge number of redundant records as observed earlier. Therefore, we have derived a data set RRE-KDD by eliminating redundant record from KDD’99 train and test dataset, so the classifiers and feature selection method will not be biased towards more frequent records. This RRE-KDD consists of both KDD99Train+ and KDD99Test+ dataset for training and testing purposes, respectively. The experimental results show that the Random Forest based proposed approach can select most im-portant and relevant features useful for classification, which, in turn, reduces not only the number of input features and time but also increases the classification accuracy.

read more

Content maybe subject to copyright    Report

Journal of Information Security, 2016, 7, 129-140
Published Online April 2016 in SciRes. http://www.scirp.org/journal/jis
http://dx.doi.org/10.4236/jis.2016.73009
How to cite this paper: Hasan, M.A.M., Nasser, M., Ahmad, S. and Molla, K.I. (2016) Feature Selection for Intrusion Detec-
tion Using Random Forest. Journal of Information Security, 7, 129-140. http://dx.doi.org/10.4236/jis.2016.73009
Feature Selection for Intrusion Detection
Using Random Forest
Md. Al Mehedi Hasan, Mohammed Nasser, Shamim Ahmad, Khademul Islam Molla
Department of Computer Science & Engineering, University of Rajshahi, Rajshahi, Bangladesh
Received 11 September 2015; accepted 4 April 2016; published 7 April 2016
Copyright © 2016 by authors and Scientific Research Publishing Inc.
This work is licensed under the Creative Commons Attribution International License (CC BY).
http://creativecommons.org/licenses/by/4.0/
Abstract
An intrusion detection system collects and analyzes information from different areas within a
computer or a network to identify possible security threats that include threats from both outside
as well as inside of the organization. It deals with large amount of data, which contains various ir-
relevant and redundant features and results in increased processing time and low detection rate.
Therefore, feature selection should be treated as an indispensable pre-processing step to improve
the overall system performance significantly while mining on huge datasets. In this context, in this
paper, we focus on a two-step approach of feature selection based on Random Forest. The first
step selects the features with higher variable importance score and guides the initialization of
search process for the second step whose outputs the final feature subset for classification and in-
terpretation. The effectiveness of this algorithm is demonstrated on KDD99 intrusion detection
datasets, which are based on DARPA 98 dataset, provides labeled data for researchers working in
the field of intrusion detection. The important deficiency in the KDD’99 data set is the huge num-
ber of redundant records as observed earlier. Therefore, we have derived a data set RRE-KDD by
eliminating redundant record from KDD’99 train and test dataset, so the classifiers and feature
selection method will not be biased towards more frequent records. This RRE-KDD consists of
both KDD99Train+ and KDD99Test+ dataset for training and testing purposes, respectively. The
experimental results show that the Random Forest based proposed approach can select most im-
portant and relevant features useful for classification, which, in turn, reduces not only the number
of input features and time but also increases the classification accuracy.
Keywords
Feature Selection, KDD’99 Dataset, RRE-KDD Dataset, Random Forest, Permuted Importance
Measure

M. A. M. Hasan et al.
130
1. Introduction
The internet and local area networks are growing larger in recent years. As a great variety of people all over the
world are connecting to the Internet, they are unconsciously encountering the number of security threats such as
viruses, worms and attacks from hackers [1]. Now firewalls, anti-virus software, message encryption, secured
network protocols, password protection and so on are not sufficient to assure the security in computer networks,
when some intrusions take advantages of weaknesses in computer systems to threaten. Therefore, Intrusion De-
tection Systems (IDSs) have become a necessary addition to the security infrastructure of most organizations
[2].
Deploying highly effective IDS systems is extremely challenging and has emerged as a significant field of re-
search, because it is not theoretically possible to set up a system with no vulnerabilities [3]. Several machine
learning (ML) algorithms, for instance Neural Network [4], Genetic Algorithm [5], Support Vector Machine [2]
[6], Clustering Algorithm [7] and more have been extensively employed to detect intrusion activities from large
quantity of complex and dynamic datasets.
Current Intrusion Detection Systems (IDS) examine all data features to detect intrusion or misuse patterns [8].
Since the amount of audit data that an IDS needs to examine is very large even for a small network, therefore
their analysis is difficult even with computer assistance because extraneous features can make it harder to detect
suspicious behavior patterns [8]-[10]. As a result, IDS must reduce the amount of data to be processed. This is
very important if a real-time detection is desired. Reduction can be performed by data filtering, data clustering
or feature selection. In our work, we investigate feature selection to reduce the amount of data directly handled
by the IDS.
Literature survey showed that, most of the researchers used randomly generated records or a portion of record
from the KDD’99 dataset to develop feature selection method and to build intrusion detection system [1] [8] [10]
[11] without using the whole train and test dataset. Yuehui Chen et al. [8], Srilatha et al. [10] [11] present a re-
duced number of features by using a randomly generated dataset containing only 11,982 records [8] [10] [11],
therefore, the number of features reduced to 12 or 17 [10] [11] is in question if the property of whole dataset is
considered. So, those findings do not indicate the actual relevant features for classification. Although some re-
searcher use the whole dataset but do not remove redundant records, which implies a limitation of having a
chance of redundant record used for the same feature selection and because of that, classification methods may
be biased toward to the class that has redundant records [12]. These limitations have motivated us to find out the
actual relevant features for classification based on the whole train and test dataset of KDD’99 by removing re-
dundant record.
Feature selection also known as variable selection, feature reduction, attribute selection or variable subset se-
lection, is a widely used dimensionality reduction technique, which has been the focus of much research in ma-
chine learning and data mining and has found applications in text classification, web mining, and so on [1]. It
allows faster model building by reducing the number of features, and also helps removing irrelevant, redundant
and noisy features. This begets simpler and more comprehensible classification models with classification per-
formance. Hence, selecting relevant attributes are a critical issue for competitive classifiers and for data reduc-
tion. Feature Selection can fall into two approaches: filter and wrapper [13]. The difference between the filter
model and wrapper model is whether feature selection relies on any learning algorithm. The filter model is in-
dependent of any learning algorithm, and its advantages lies in better generality and low computational cost [13].
It ranks the features by a metric and eliminates all features that do not achieve an adequate score (selecting only
important features). The wrapper model relies on some learning algorithm, and it can expect high classification
performance, but it is computationally expensive especially when dealing with large scale data sets [14] like
KDDCUP99. It searches for the set of possible features for the optimal subset. In this paper, we adapt Random
Forest to rank the features and select a subset feature, which can bring to a successful conclusion of intrusion
detection.
Random Forest directly performs feature selection while a classification rule is built. The two commonly used
variable importance measures in RF are Gini importance index and permutation importance index (PIM) [15]. In
this paper, we have used two steps approach to feature selection. In first step, permutation importance index are
used to rank the features and then in second step, Random Forest is used to select the best subset of features for
classification. This reduced feature set is then employed to implement an Intrusion Detection System. Our ap-
proach results in more accurate detection as well as fast training and testing process.
The remainder of the paper is organized as follows. Section 2 provides the description of KDD’99 dataset. We

M. A. M. Hasan et al.
131
outline mathematical overview of RF and calculation procedure of variable importance in Section 3. Experi-
mental setup is presented in Section 4 and RF model selection is drawn in Section 5. Measurement of Variable
Importance and Variable Selection are discussed in section 6. Finally, Section 7 reports the experimental result
followed by conclusion in Section 8.
2. KDDCUP’99 Dataset
Under the sponsorship of Defense Advanced Research Projects Agency (DARPA) and Air Force Research La-
boratory (AFRL), MIT Lincoln Laboratory has collected and distributed the datasets for the evaluation of re-
searches in computer network intrusion detection systems [16]. The KDD’99 dataset is a subset of the DARPA
benchmark dataset prepared by Sal Stofo and Wenke Lee [17]. The KDD data set was acquired from raw
tcpdump data for a length of nine weeks. It is made up of a large number of network traffic activities that in-
clude both normal and malicious connections.
2.1. Attack and Feature Description of KDD’99 Dataset
The KDD99 data set includes three independent sets; whole KDD, 10% KDD, and corrected KDD. Most
of researchers have used the 10% KDDand the corrected KDDas training and testing set, respectively [18].
The training set contains a total of 22 training attack types and one type for normal. The corrected KDDtest-
ing set includes an additional 17 types of attack and excludes 2 types (spy, warezclient) of attack from training
set, so therefore there are 37 attack types that are included in the testing set, as shown in Table 1 and Table 2.
The simulated attacks fall in one of the four categories [2] [18]: 1) Denial of Service Attack (DoS), 2) User to
Root Attack (U2R), 3) Remote to Local Attack (R2L), 4) Probing Attack.
A connection in the KDD99 dataset is represented by 41 features, each of which is in one of the continuous,
discrete and symbolic forms, with significantly varying ranges [19] (Table 3). The description of various fea-
tures is shown in Table 3. In Table 3, C is used to denote continuous and D is used to donate discrete and sym-
bolic type data in the Data Type field.
2.2. Inherent Problems and Criticisms against the KDD’99
Statistical analysis on KDD’99 dataset found important issues which highly affects the performance of evaluated
systems and results in a very poor evaluation of anomaly detection approaches [20]. The most important defi-
ciency in the KDD data set is the huge number of redundant records. Analyzing KDD train and test sets, Moh-
bod Tavallaee found that about 78% and 75% of the records are duplicated in the train and test set, respectively
[21]. This large amount of redundant records in the train set will cause learning algorithms to be biased towards
the more frequent records.
As a result, this biasing prevents the system from learning infrequent records which are usually more harmful
Table 1. Attacks in KDD’99 training dataset.
Classification of Attacks Attack Name
Probing Port-sweep, IP-sweep, Nmap, Satan
DoS Neptune, Smurf, Pod, Teardrop, Land, Back
U2R Buffer-overflow, Load-module, Perl, Rootkit
R2L Guess-password, Ftp-write, Imap, Phf, Multihop, Spy, Warezclient, Warezmaster
Table 2. Attacks in KDD’99 testing dataset.
Classification of Attacks Attack Name
Probing Port-sweep, IP-sweep, Nmap, Satan, Saint, Mscan
DoS Neptune, Smurf, Pod, Teardrop, Land, Back, Apache2,Udpstorm, Processtable, Mail-bomb
U2R Buffer-overflow, Load-module, Perl, Rootkit, Xterm, Ps, Sqlattack.
R2L
Guess-password, Ftp-write, Imap, Phf, Multihop, Warezmaster, Snmpget attack, Named, Xlock, Xsnoop,
Send-mail, Http-tunnel, Worm, Snmp-guess.

M. A. M. Hasan et al.
132
Table 3. List of Features with their descriptions and data types.
S.
No
Feature Description
Data
Type
S.
No
Feature Description
Data
Type
1 Duration
Duration of the
connection.
C 22 Is guest login
D
2 Protocol type Connection protocol D 23 Count
host as the current connection in the
C
3 Service Destination service D 24 Srv count
service as the current connection in the
C
4 Flag
Status flag of the
connection
D 25 Serror rate
C
5 Source bytes
Bytes sent from source
to destination
C 26 Srv serror rate
C
6 Destination bytes
Bytes sent from
destination to source
C 27 Rerror rate
C
7 Land
1 if connection is
from/to the same
host/port; 0 otherwise
D 28 Srv rerror rate
% of connections that have REJ
errors
C
8 Wrong fragment
Number of wrong
fragments
C 29 Same srv rate % of connections to the same service C
9 Urgent
Number of urgent
packets
C 30 Diff srv rate
C
10 Hot
Number of hot
indicators
C 31 Srv diff host rate % of connections to different hosts C
11 Failed Login
Logins number of failed
logins
C 32
Dst host count
C
12 Logged in
1 if successfully logged
in; 0 otherwise
D 33
Dst host srv count
destination host and using the same
C
13 Compromised
Number of
“compromised”
conditions
C 34 Dst host same srv rate
destination host and using the same
C
14 Root shell
1 if root shell is
obtained; 0 otherwise
C 35 Dst host diff srv rate
host
C
15 Su attempted
1 if su rootcommand
attempted; 0 otherwise
C 36
Dst host same src port
rate
C
16 Root
Number of root
accesses
C 37
Dst host srv diff host
rate
C
17 File creations
Number of file creation
operations
C 38 Dst host serror rate
C
18 Shells
Number of shell
prompts
C 39
Dst host srv serror
rate
and specified service that have an S0
C
19 Access files
Number of operations
on access control files
C 40 Dst host rerror rate
C
20 Outbound cmds
Number of outbound
commands in an ftp
session
C 41
Dst host srv rerror
rate
and specified service that have an RST
C
21 Is hot login
1 if the login belongs to
the “hotlist; 0
otherwise
D

M. A. M. Hasan et al.
133
to networks such as U2R attacks. The existence of these repeated records in the test set, on the other hand, will
cause the evaluation results to be biased by the methods which have better detection rates on the frequent
records.
To solve these issues, we have derived a new data set RRE-KDD by eliminating redundant record from
KDD’99 train and test dataset (10% KDD and corrected KDD), so the classifiers will not be biased towards
more frequent records. This RRE-KDD dataset consists of KDD99Train+ and KDD99Test+ dataset for training
and testing purposes, respectively. The numbers of records in the train and test sets are now reasonable, which
makes it affordable to run the experiments on the complete set without the need to randomly select a small por-
tion.
3. Variable Selection and Classification
Consider the problem of separating the set of training vectors belong to two separate classes, (x
1
, y
1
), (x
2
, y
2
), ,
(x
n
, y
n
) where
{ }
and 1 +1
p
ii
xR y ∈−
is the corresponding class label, 1 i n. The main task is to find a
classifier with a decision function f(x, θ) such that y = f(x, θ), where y is the class label for x, θ is a vector of un-
known parameters in the function.
3.1. Random Forest
The random forest is an ensemble of unpruned classification or regression trees [15]. Random forest generates
many classification trees and each tree is constructed by a different bootstrap sample from the original data us-
ing a tree classification algorithm. After the forest is formed, a new object that needs to be classified is put down
each of the tree in the forest for classification. Each tree gives a vote that indicates the tree’s decision about the
class of the object. The forest chooses the class with the most votes for the object. The random forests algorithm
(for both classification and regression) is as follows [22] [23]:
1) From the Training of n samples draw n
tree
bootstrap samples.
2) For each of the bootstrap samples, grow classification or regression tree with the following modification:
at each node, rather than choosing the best split among all predictors, randomly sample m
try
of the predic-
tors and choose the best split among those variables. The tree is grown to the maximum size and not
pruned back. Bagging can be thought of as the special case of random forests obtained when m
try
= p, the
number of predictors.
3) Predict new data by aggregating the predictions of the n
tree
trees (i.e., majority votes for classification, av-
erage for regression).
There are two ways to evaluate the error rate. One is to split the dataset into training part and test part. We can
employ the training part to build the forest, and then use the test part to calculate the error rate. Another way is
to use the Out-of-Bag (OOB) error estimate. Because random forests algorithm calculates the OOB error during
the training phase, therefore to get OOB error, we do not need to split the training data. In our work, we have
used both ways to evaluate the error rate.
There are three tuning parameters of Random Forest: number of trees (n
tree
), number of descriptors randomly
sampled as candidates for splitting at each node (m
try
) and minimum node size [23]. When the forest is growing,
random features are selected at random out of the all features in the training data. The number of features em-
ployed in splitting each node for each tree is the primary tuning parameter (m
try
). To improve the performance of
random forests, this parameter should be optimized. The number of trees should only be chosen to be sufficient-
ly large so that the OOB error has stabilized. In many cases, 500 trees are sufficient (more are needed if de-
scriptor’s importance or intrinsic proximity is desired). In contrast to other algorithms having a stopping rule, in
RF, there is no penalty for having too manytrees, other than waste in computational resources. Another para-
meter, minimum node size, determines the minimum size of nodes below which no split will be attempted. This
parameter has some effect on the size of the trees grown. In Random Forest, for classification, the default value
of minimum node size is 1, ensuring that trees are grown to their maximum size and for regression, the default
value is 5 [23].
3.2. Variable Important Measure and Selection Using Random Forest
The high dimensional nature of many tasks in pattern recognition has created an urgent need for feature selec-
tion techniques. The goal of feature selection in this field is manifold, where the two most important are: 1) to

Citations
More filters
Journal ArticleDOI

Survey on SDN based network intrusion detection system using machine learning approaches

TL;DR: This survey evaluated the techniques of deep learning in developing SDN-based Network Intrusion Detection Systems (NIDS) and covered tools that can be used to develop NIDS models in SDN environment.
Journal ArticleDOI

Ensembles for feature selection: A review and future trends

TL;DR: This work provides the reader with the basic concepts necessary to build an ensemble for feature selection, as well as reviewing the up-to-date advances and commenting on the future trends that are still to be faced.
Journal ArticleDOI

A Survey of Random Forest Based Methods for Intrusion Detection Systems

TL;DR: This work provides a comprehensive review of the general basic concepts related to Intrusion Detection Systems, including taxonomies, attacks, data collection, modelling, evaluation metrics, and commonly used methods.
Journal ArticleDOI

Large group activity security risk assessment and risk early warning based on random forest algorithm

TL;DR: The method of calculating the importance of the random forest algorithm to variables and the calculation formula of the weight of the security risk index leads to the conclusion that the random Forest algorithm has good predictive ability in the risk assessment of large-scale group activities.
Journal ArticleDOI

Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers

TL;DR: The results demonstrate that on replacing the missing values and outliers by group median and median values, respectively, an accuracy, sensitivity, specificity, positive predictive value, negative predictive value and area under the curve is an improvement of 10% over previously developed techniques published in literature.
References
More filters
Journal ArticleDOI

Random Forests

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.

Classification and Regression by randomForest

TL;DR: random forests are proposed, which add an additional layer of randomness to bagging and are robust against overfitting, and the randomForest package provides an R interface to the Fortran programs by Breiman and Cutler.
Proceedings ArticleDOI

A detailed analysis of the KDD CUP 99 data set

TL;DR: A new data set is proposed, NSL-KDD, which consists of selected records of the complete KDD data set and does not suffer from any of mentioned shortcomings.
Journal ArticleDOI

Random forest: a classification and regression tool for compound classification and QSAR modeling.

TL;DR: It is the combination of relatively high prediction accuracy and its collection of desired features that makes Random Forest uniquely suited for modeling in cheminformatics.
Proceedings ArticleDOI

A data mining framework for building intrusion detection models

TL;DR: A data mining framework for adaptively building Intrusion Detection (ID) models is described, to utilize auditing programs to extract an extensive set of features that describe each network connection or host session, and apply data mining programs to learn rules that accurately capture the behavior of intrusions and normal activities.
Related Papers (5)