Feature Selection for Intrusion Detection Using Random Forest

doi:10.4236/JIS.2016.73009

Journal of Information Security, 2016, 7, 129-140

Published Online April 2016 in SciRes. http://www.scirp.org/journal/jis

http://dx.doi.org/10.4236/jis.2016.73009

How to cite this paper: Hasan, M.A.M., Nasser, M., Ahmad, S. and Molla, K.I. (2016) Feature Selection for Intrusion Detec-

tion Using Random Forest. Journal of Information Security, 7, 129-140. http://dx.doi.org/10.4236/jis.2016.73009

Feature Selection for Intrusion Detection

Using Random Forest

Md. Al Mehedi Hasan, Mohammed Nasser, Shamim Ahmad, Khademul Islam Molla

Department of Computer Science & Engineering, University of Rajshahi, Rajshahi, Bangladesh

Received 11 September 2015; accepted 4 April 2016; published 7 April 2016

This work is licensed under the Creative Commons Attribution International License (CC BY).

http://creativecommons.org/licenses/by/4.0/

Abstract

An intrusion detection system collects and analyzes information from different areas within a

computer or a network to identify possible security threats that include threats from both outside

as well as inside of the organization. It deals with large amount of data, which contains various ir-

relevant and redundant features and results in increased processing time and low detection rate.

Therefore, feature selection should be treated as an indispensable pre-processing step to improve

the overall system performance significantly while mining on huge datasets. In this context, in this

paper, we focus on a two-step approach of feature selection based on Random Forest. The first

step selects the features with higher variable importance score and guides the initialization of

search process for the second step whose outputs the final feature subset for classification and in-

terpretation. The effectiveness of this algorithm is demonstrated on KDD’99 intrusion detection

datasets, which are based on DARPA 98 dataset, provides labeled data for researchers working in

the field of intrusion detection. The important deficiency in the KDD’99 data set is the huge num-

ber of redundant records as observed earlier. Therefore, we have derived a data set RRE-KDD by

eliminating redundant record from KDD’99 train and test dataset, so the classifiers and feature

selection method will not be biased towards more frequent records. This RRE-KDD consists of

both KDD99Train+ and KDD99Test+ dataset for training and testing purposes, respectively. The

experimental results show that the Random Forest based proposed approach can select most im-

portant and relevant features useful for classification, which, in turn, reduces not only the number

of input features and time but also increases the classification accuracy.

Keywords

Feature Selection, KDD’99 Dataset, RRE-KDD Dataset, Random Forest, Permuted Importance

Measure

M. A. M. Hasan et al.

130

1. Introduction

The internet and local area networks are growing larger in recent years. As a great variety of people all over the

world are connecting to the Internet, they are unconsciously encountering the number of security threats such as

viruses, worms and attacks from hackers [1]. Now firewalls, anti-virus software, message encryption, secured

network protocols, password protection and so on are not sufficient to assure the security in computer networks,

when some intrusions take advantages of weaknesses in computer systems to threaten. Therefore, Intrusion De-

tection Systems (IDSs) have become a necessary addition to the security infrastructure of most organizations

[2].

Deploying highly effective IDS systems is extremely challenging and has emerged as a significant field of re-

search, because it is not theoretically possible to set up a system with no vulnerabilities [3]. Several machine

learning (ML) algorithms, for instance Neural Network [4], Genetic Algorithm [5], Support Vector Machine [2]

[6], Clustering Algorithm [7] and more have been extensively employed to detect intrusion activities from large

quantity of complex and dynamic datasets.

Current Intrusion Detection Systems (IDS) examine all data features to detect intrusion or misuse patterns [8].

Since the amount of audit data that an IDS needs to examine is very large even for a small network, therefore

their analysis is difficult even with computer assistance because extraneous features can make it harder to detect

suspicious behavior patterns [8]-[10]. As a result, IDS must reduce the amount of data to be processed. This is

very important if a real-time detection is desired. Reduction can be performed by data filtering, data clustering

or feature selection. In our work, we investigate feature selection to reduce the amount of data directly handled

by the IDS.

Literature survey showed that, most of the researchers used randomly generated records or a portion of record

from the KDD’99 dataset to develop feature selection method and to build intrusion detection system [1] [8] [10]

[11] without using the whole train and test dataset. Yuehui Chen et al. [8], Srilatha et al. [10] [11] present a re-

duced number of features by using a randomly generated dataset containing only 11,982 records [8] [10] [11],

therefore, the number of features reduced to 12 or 17 [10] [11] is in question if the property of whole dataset is

considered. So, those findings do not indicate the actual relevant features for classification. Although some re-

searcher use the whole dataset but do not remove redundant records, which implies a limitation of having a

chance of redundant record used for the same feature selection and because of that, classification methods may

be biased toward to the class that has redundant records [12]. These limitations have motivated us to find out the

actual relevant features for classification based on the whole train and test dataset of KDD’99 by removing re-

dundant record.

Feature selection also known as variable selection, feature reduction, attribute selection or variable subset se-

lection, is a widely used dimensionality reduction technique, which has been the focus of much research in ma-

chine learning and data mining and has found applications in text classification, web mining, and so on [1]. It

allows faster model building by reducing the number of features, and also helps removing irrelevant, redundant

and noisy features. This begets simpler and more comprehensible classification models with classification per-

formance. Hence, selecting relevant attributes are a critical issue for competitive classifiers and for data reduc-

tion. Feature Selection can fall into two approaches: filter and wrapper [13]. The difference between the filter

model and wrapper model is whether feature selection relies on any learning algorithm. The filter model is in-

dependent of any learning algorithm, and its advantages lies in better generality and low computational cost [13].

It ranks the features by a metric and eliminates all features that do not achieve an adequate score (selecting only

important features). The wrapper model relies on some learning algorithm, and it can expect high classification

performance, but it is computationally expensive especially when dealing with large scale data sets [14] like

KDDCUP99. It searches for the set of possible features for the optimal subset. In this paper, we adapt Random

Forest to rank the features and select a subset feature, which can bring to a successful conclusion of intrusion

detection.

Random Forest directly performs feature selection while a classification rule is built. The two commonly used

variable importance measures in RF are Gini importance index and permutation importance index (PIM) [15]. In

this paper, we have used two steps approach to feature selection. In first step, permutation importance index are

used to rank the features and then in second step, Random Forest is used to select the best subset of features for

classification. This reduced feature set is then employed to implement an Intrusion Detection System. Our ap-

proach results in more accurate detection as well as fast training and testing process.

The remainder of the paper is organized as follows. Section 2 provides the description of KDD’99 dataset. We

M. A. M. Hasan et al.

131

outline mathematical overview of RF and calculation procedure of variable importance in Section 3. Experi-

mental setup is presented in Section 4 and RF model selection is drawn in Section 5. Measurement of Variable

Importance and Variable Selection are discussed in section 6. Finally, Section 7 reports the experimental result

followed by conclusion in Section 8.

2. KDDCUP’99 Dataset

Under the sponsorship of Defense Advanced Research Projects Agency (DARPA) and Air Force Research La-

boratory (AFRL), MIT Lincoln Laboratory has collected and distributed the datasets for the evaluation of re-

searches in computer network intrusion detection systems [16]. The KDD’99 dataset is a subset of the DARPA

benchmark dataset prepared by Sal Stofo and Wenke Lee [17]. The KDD data set was acquired from raw

tcpdump data for a length of nine weeks. It is made up of a large number of network traffic activities that in-

clude both normal and malicious connections.

2.1. Attack and Feature Description of KDD’99 Dataset

The KDD’99 data set includes three independent sets; “whole KDD”, “10% KDD”, and “corrected KDD”. Most

of researchers have used the “10% KDD” and the “corrected KDD” as training and testing set, respectively [18].

The training set contains a total of 22 training attack types and one type for normal. The “corrected KDD” test-

ing set includes an additional 17 types of attack and excludes 2 types (spy, warezclient) of attack from training

set, so therefore there are 37 attack types that are included in the testing set, as shown in Table 1 and Table 2.

The simulated attacks fall in one of the four categories [2] [18]: 1) Denial of Service Attack (DoS), 2) User to

Root Attack (U2R), 3) Remote to Local Attack (R2L), 4) Probing Attack.

A connection in the KDD’99 dataset is represented by 41 features, each of which is in one of the continuous,

discrete and symbolic forms, with significantly varying ranges [19] (Table 3). The description of various fea-

tures is shown in Table 3. In Table 3, C is used to denote continuous and D is used to donate discrete and sym-

bolic type data in the Data Type field.

2.2. Inherent Problems and Criticisms against the KDD’99

Statistical analysis on KDD’99 dataset found important issues which highly affects the performance of evaluated

systems and results in a very poor evaluation of anomaly detection approaches [20]. The most important defi-

ciency in the KDD data set is the huge number of redundant records. Analyzing KDD train and test sets, Moh-

bod Tavallaee found that about 78% and 75% of the records are duplicated in the train and test set, respectively

[21]. This large amount of redundant records in the train set will cause learning algorithms to be biased towards

the more frequent records.

As a result, this biasing prevents the system from learning infrequent records which are usually more harmful

Table 1. Attacks in KDD’99 training dataset.

Classification of Attacks Attack Name

Probing Port-sweep, IP-sweep, Nmap, Satan

DoS Neptune, Smurf, Pod, Teardrop, Land, Back

U2R Buffer-overflow, Load-module, Perl, Rootkit

R2L Guess-password, Ftp-write, Imap, Phf, Multihop, Spy, Warezclient, Warezmaster

Table 2. Attacks in KDD’99 testing dataset.

Classification of Attacks Attack Name

Probing Port-sweep, IP-sweep, Nmap, Satan, Saint, Mscan

DoS Neptune, Smurf, Pod, Teardrop, Land, Back, Apache2,Udpstorm, Processtable, Mail-bomb

U2R Buffer-overflow, Load-module, Perl, Rootkit, Xterm, Ps, Sqlattack.

R2L

Guess-password, Ftp-write, Imap, Phf, Multihop, Warezmaster, Snmpget attack, Named, Xlock, Xsnoop,

Send-mail, Http-tunnel, Worm, Snmp-guess.

M. A. M. Hasan et al.

132

Table 3. List of Features with their descriptions and data types.

S.

No

Feature Description

Data

Type

S.

No

Feature Description

Data

Type

1 Duration

Duration of the

connection.

C 22 Is guest login

1 if the login is a “guest” login; 0

otherwise

D

2 Protocol type Connection protocol D 23 Count

Number of connections to the same

host as the current connection in the

past two seconds

C

3 Service Destination service D 24 Srv count

Number of connections to the same

service as the current connection in the

past two seconds

C

4 Flag

Status flag of the

connection

D 25 Serror rate

% of connections that have “SYN”

errors

C

5 Source bytes

Bytes sent from source

to destination

C 26 Srv serror rate

% of connections that have “SYN”

errors

C

6 Destination bytes

Bytes sent from

destination to source

C 27 Rerror rate

% of connections that have “REJ”

errors

C

7 Land

1 if connection is

from/to the same

host/port; 0 otherwise

D 28 Srv rerror rate

% of connections that have “REJ”

errors

C

8 Wrong fragment

Number of wrong

fragments

C 29 Same srv rate % of connections to the same service C

9 Urgent

Number of urgent

packets

C 30 Diff srv rate

% of connections to different

Services

C

10 Hot

Number of “hot”

indicators

C 31 Srv diff host rate % of connections to different hosts C

11 Failed Login

Logins number of failed

logins

C 32

Dst host count

Count of connections having the same

destination host

C

12 Logged in

1 if successfully logged

in; 0 otherwise

D 33

Dst host srv count

Count of connections having the same

destination host and using the same

service

C

13 Compromised

Number of

“compromised”

conditions

C 34 Dst host same srv rate

% of connections having the same

destination host and using the same

service

C

14 Root shell

1 if root shell is

obtained; 0 otherwise

C 35 Dst host diff srv rate

% of different services on the current

host

C

15 Su attempted

1 if “su root” command

attempted; 0 otherwise

C 36

Dst host same src port

rate

% of connections to the current host

having the same src port

C

16 Root

Number of “root”

accesses

C 37

Dst host srv diff host

rate

% of connections to the same service

coming from different hosts

C

17 File creations

Number of file creation

operations

C 38 Dst host serror rate

% of connections to the current host

that have an S0 error

C

18 Shells

Number of shell

prompts

C 39

Dst host srv serror

rate

% of connections to the current host

and specified service that have an S0

error

C

19 Access files

Number of operations

on access control files

C 40 Dst host rerror rate

% of connections to the current host

that have an RST error

C

20 Outbound cmds

Number of outbound

commands in an ftp

session

C 41

Dst host srv rerror

rate

% of connections to the current host

and specified service that have an RST

error

C

21 Is hot login

1 if the login belongs to

the “hot” list; 0

otherwise

D

M. A. M. Hasan et al.

133

to networks such as U2R attacks. The existence of these repeated records in the test set, on the other hand, will

cause the evaluation results to be biased by the methods which have better detection rates on the frequent

records.

To solve these issues, we have derived a new data set RRE-KDD by eliminating redundant record from

KDD’99 train and test dataset (10% KDD and corrected KDD), so the classifiers will not be biased towards

more frequent records. This RRE-KDD dataset consists of KDD99Train+ and KDD99Test+ dataset for training

and testing purposes, respectively. The numbers of records in the train and test sets are now reasonable, which

makes it affordable to run the experiments on the complete set without the need to randomly select a small por-

tion.

3. Variable Selection and Classification

Consider the problem of separating the set of training vectors belong to two separate classes, (x

1

, y

1

), (x

2

, y

2

), …,

(x

n

, y

n

) where

{ }

and 1 +1

p

ii

xR y∈ ∈−，

is the corresponding class label, 1 ≤ i ≤ n. The main task is to find a

classifier with a decision function f(x, θ) such that y = f(x, θ), where y is the class label for x, θ is a vector of un-

known parameters in the function.

3.1. Random Forest

The random forest is an ensemble of unpruned classification or regression trees [15]. Random forest generates

many classification trees and each tree is constructed by a different bootstrap sample from the original data us-

ing a tree classification algorithm. After the forest is formed, a new object that needs to be classified is put down

each of the tree in the forest for classification. Each tree gives a vote that indicates the tree’s decision about the

class of the object. The forest chooses the class with the most votes for the object. The random forests algorithm

(for both classification and regression) is as follows [22] [23]:

1) From the Training of n samples draw n

tree

bootstrap samples.

2) For each of the bootstrap samples, grow classification or regression tree with the following modification:

at each node, rather than choosing the best split among all predictors, randomly sample m

try

of the predic-

tors and choose the best split among those variables. The tree is grown to the maximum size and not

pruned back. Bagging can be thought of as the special case of random forests obtained when m

try

= p, the

number of predictors.

3) Predict new data by aggregating the predictions of the n

tree

trees (i.e., majority votes for classification, av-

erage for regression).

There are two ways to evaluate the error rate. One is to split the dataset into training part and test part. We can

employ the training part to build the forest, and then use the test part to calculate the error rate. Another way is

to use the Out-of-Bag (OOB) error estimate. Because random forests algorithm calculates the OOB error during

the training phase, therefore to get OOB error, we do not need to split the training data. In our work, we have

used both ways to evaluate the error rate.

There are three tuning parameters of Random Forest: number of trees (n

tree

), number of descriptors randomly

sampled as candidates for splitting at each node (m

try

) and minimum node size [23]. When the forest is growing,

random features are selected at random out of the all features in the training data. The number of features em-

ployed in splitting each node for each tree is the primary tuning parameter (m

try

). To improve the performance of

random forests, this parameter should be optimized. The number of trees should only be chosen to be sufficient-

ly large so that the OOB error has stabilized. In many cases, 500 trees are sufficient (more are needed if de-

scriptor’s importance or intrinsic proximity is desired). In contrast to other algorithms having a stopping rule, in

RF, there is no penalty for having “too many” trees, other than waste in computational resources. Another para-

meter, minimum node size, determines the minimum size of nodes below which no split will be attempted. This

parameter has some effect on the size of the trees grown. In Random Forest, for classification, the default value

of minimum node size is 1, ensuring that trees are grown to their maximum size and for regression, the default

value is 5 [23].

3.2. Variable Important Measure and Selection Using Random Forest

The high dimensional nature of many tasks in pattern recognition has created an urgent need for feature selec-

tion techniques. The goal of feature selection in this field is manifold, where the two most important are: 1) to

Feature Selection for Intrusion Detection Using Random Forest

Citations

Survey on SDN based network intrusion detection system using machine learning approaches

Ensembles for feature selection: A review and future trends

A Survey of Random Forest Based Methods for Intrusion Detection Systems

Large group activity security risk assessment and risk early warning based on random forest algorithm

Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers

References

Random Forests

Classification and Regression by randomForest

A detailed analysis of the KDD CUP 99 data set

Random forest: a classification and regression tool for compound classification and QSAR modeling.

A data mining framework for building intrusion detection models

Related Papers (5)

Random Forests

A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection

A detailed analysis of the KDD CUP 99 data set

Benchmarking datasets for Anomaly-based Network Intrusion Detection: KDD CUP 99 alternatives

Feature Analysis, Evaluation and Comparisons of Classification Algorithms Based on Noisy Intrusion Dataset☆