A detailed analysis of the KDD CUP 99 data set

doi:10.1109/CISDA.2009.5356528

Home
/
Papers
/
A detailed analysis of the KDD CUP 99 data set

Proceedings Article•DOI•

A detailed analysis of the KDD CUP 99 data set

Mahbod Tavallaee¹, Ebrahim Bagheri², Wei Lu¹, Ali A. Ghorbani¹•Institutions (2)

University of New Brunswick¹, National Research Council²

08 Jul 2009-Iss: 2, pp 53-58

TL;DR: A new data set is proposed, NSL-KDD, which consists of selected records of the complete KDD data set and does not suffer from any of mentioned shortcomings.

read less

Abstract: During the last decade, anomaly detection has attracted the attention of many researchers to overcome the weakness of signature-based IDSs in detecting novel attacks, and KDDCUP'99 is the mostly widely used data set for the evaluation of these systems. Having conducted a statistical analysis on this data set, we found two important issues which highly affects the performance of evaluated systems, and results in a very poor evaluation of anomaly detection approaches. To solve these issues, we have proposed a new data set, NSL-KDD, which consists of selected records of the complete KDD data set and does not suffer from any of mentioned shortcomings.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization

[...]

Iman Sharafaldin¹, Arash Habibi Lashkari¹, Ali A. Ghorbani¹•Institutions (1)

University of New Brunswick¹

01 Jan 2018

TL;DR: A reliable dataset is produced that contains benign and seven common attack network flows, which meets real world criteria and is publicly avaliable and evaluates the performance of a comprehensive set of network traffic features and machine learning algorithms to indicate the best set of features for detecting the certain attack categories.

...read moreread less

Abstract: With exponential growth in the size of computer networks and developed applications, the significant increasing of the potential damage that can be caused by launching attacks is becoming obvious. Meanwhile, Intrusion Detection Systems (IDSs) and Intrusion Prevention Systems (IPSs) are one of the most important defense tools against the sophisticated and ever-growing network attacks. Due to the lack of adequate dataset, anomaly-based approaches in intrusion detection systems are suffering from accurate deployment, analysis and evaluation. There exist a number of such datasets such as DARPA98, KDD99, ISC2012, and ADFA13 that have been used by the researchers to evaluate the performance of their proposed intrusion detection and intrusion prevention approaches. Based on our study over eleven available datasets since 1998, many such datasets are out of date and unreliable to use. Some of these datasets suffer from lack of traffic diversity and volumes, some of them do not cover the variety of attacks, while others anonymized packet information and payload which cannot reflect the current trends, or they lack feature set and metadata. This paper produces a reliable dataset that contains benign and seven common attack network flows, which meets real world criteria and is publicly avaliable. Consequently, the paper evaluates the performance of a comprehensive set of network traffic features and machine learning algorithms to indicate the best set of features for detecting the certain attack categories.

...read moreread less

1,931 citations

Cites background or methods from "A detailed analysis of the KDD CUP ..."

...KDD’99 (University of California, Irvine 1998-99): This dataset is an updated version of the DARPA98, by processing the tcpdump portion....
[...]
...This dataset has a large number of redundant records and is studded by data corruptions that led to skewed testing results (Tavallaee et al., 2009)....
[...]
...NSL-KDD was created using KDD (Tavallaee et al., 2009) to address some of the KDD’s shortcomings (McHugh, 2000)....
[...]

Proceedings Article•DOI•

UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set)

[...]

Nour Moustafa¹, Jill Slay¹•Institutions (1)

University of New South Wales¹

10 Dec 2015

TL;DR: Countering the unavailability of network benchmark data set challenges, this paper examines a UNSW-NB15 data set creation which has a hybrid of the real modern normal and the contemporary synthesized attack activities of the network traffic.

...read moreread less

Abstract: One of the major research challenges in this field is the unavailability of a comprehensive network based data set which can reflect modern network traffic scenarios, vast varieties of low footprint intrusions and depth structured information about the network traffic. Evaluating network intrusion detection systems research efforts, KDD98, KDDCUP99 and NSLKDD benchmark data sets were generated a decade ago. However, numerous current studies showed that for the current network threat environment, these data sets do not inclusively reflect network traffic and modern low footprint attacks. Countering the unavailability of network benchmark data set challenges, this paper examines a UNSW-NB15 data set creation. This data set has a hybrid of the real modern normal and the contemporary synthesized attack activities of the network traffic. Existing and novel methods are utilised to generate the features of the UNSWNB15 data set. This data set is available for research purposes and can be accessed from the link.

...read moreread less

1,745 citations

Cites background or methods from "A detailed analysis of the KDD CUP ..."

...Further, the signature based NIDSs cannot detect unknown attacks, and for these anomaly NIDS are recommended in many studies [4] [5]....
[...]
...Finally, the output files of the two different tools, Argus and Bro-IDS are stored in the SQL Server 20088 database to match the Argus and Bro-IDS generated features by using the flow features as reflected in Table II....
[...]
...Countering the unavailability of network benchmark data set challenges, this paper examines a UNSW-NB15 data set creation....
[...]
...Keywords- UNSW-NB15 data set; NIDS; low footprint attacks; pcap files; testbed I. INTRODUCTION Currently, due to the massive growth in computer networks and applications, many challenges arise for cyber security research....
[...]

Journal Article•DOI•

A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection

[...]

Anna L. Buczak¹, Erhan Guven¹•Institutions (1)

Johns Hopkins University Applied Physics Laboratory¹

22 Jan 2016-IEEE Communications Surveys and Tutorials

TL;DR: The complexity of ML/DM algorithms is addressed, discussion of challenges for using ML/ DM for cyber security is presented, and some recommendations on when to use a given method are provided.

...read moreread less

Abstract: This survey paper describes a focused literature survey of machine learning (ML) and data mining (DM) methods for cyber analytics in support of intrusion detection. Short tutorial descriptions of each ML/DM method are provided. Based on the number of citations or the relevance of an emerging method, papers representing each method were identified, read, and summarized. Because data are so important in ML/DM approaches, some well-known cyber data sets used in ML/DM are described. The complexity of ML/DM algorithms is addressed, discussion of challenges for using ML/DM for cyber security is presented, and some recommendations on when to use a given method are provided.

...read moreread less

1,704 citations

Cites background from "A detailed analysis of the KDD CUP ..."

...[21] and found to have some serious limitations....
[...]

Journal Article•DOI•

A Deep Learning Approach for Intrusion Detection Using Recurrent Neural Networks

[...]

Chuanlong Yin, Yuefei Zhu, Jinlong Fei, Xinzheng He

12 Oct 2017-IEEE Access

TL;DR: The experimental results show that RNN-IDS is very suitable for modeling a classification model with high accuracy and that its performance is superior to that of traditional machine learning classification methods in both binary and multiclass classification.

...read moreread less

Abstract: Intrusion detection plays an important role in ensuring information security, and the key technology is to accurately identify various attacks in the network. In this paper, we explore how to model an intrusion detection system based on deep learning, and we propose a deep learning approach for intrusion detection using recurrent neural networks (RNN-IDS). Moreover, we study the performance of the model in binary classification and multiclass classification, and the number of neurons and different learning rate impacts on the performance of the proposed model. We compare it with those of J48, artificial neural network, random forest, support vector machine, and other machine learning methods proposed by previous researchers on the benchmark data set. The experimental results show that RNN-IDS is very suitable for modeling a classification model with high accuracy and that its performance is superior to that of traditional machine learning classification methods in both binary and multiclass classification. The RNN-IDS model improves the accuracy of the intrusion detection and provides a new research method for intrusion detection.

...read moreread less

1,123 citations

Cites methods from "A detailed analysis of the KDD CUP ..."

...In the binary classification experiments, we have compared the performance with an ANN, naive Bayesian, random forest, multi-layer perceptron, support vector machine and other machine learning methods, as mentioned in [13] and [21]....
[...]
...In [21], the authors have shown the results obtained by J48, Naive Bayesian, Random Forest, Multi-layer Perceptron, Support Vector Machine and the other classification algorithms, and the artificial neural network algorithm also gives 81....
[...]
...The NSL-KDD dataset [21], [22] generated in 2009 is widely used in intrusion detection experiments....
[...]

Journal Article•DOI•

A Deep Learning Approach to Network Intrusion Detection

[...]

Nathan Shone¹, Tran Nguyen Ngoc, Vu Dinh Phai, Qi Shi¹•Institutions (1)

Liverpool John Moores University¹

23 Jan 2018

TL;DR: This paper presents a novel deep learning technique for intrusion detection, which addresses concerns regarding the feasibility and sustainability of current approaches when faced with the demands of modern networks and details the proposed nonsymmetric deep autoencoder (NDAE) for unsupervised feature learning.

...read moreread less

Abstract: Network intrusion detection systems (NIDSs) play a crucial role in defending computer networks. However, there are concerns regarding the feasibility and sustainability of current approaches when faced with the demands of modern networks. More specifically, these concerns relate to the increasing levels of required human interaction and the decreasing levels of detection accuracy. This paper presents a novel deep learning technique for intrusion detection, which addresses these concerns. We detail our proposed nonsymmetric deep autoencoder (NDAE) for unsupervised feature learning. Furthermore, we also propose our novel deep learning classification model constructed using stacked NDAEs. Our proposed classifier has been implemented in graphics processing unit (GPU)-enabled TensorFlow and evaluated using the benchmark KDD Cup ’99 and NSL-KDD datasets. Promising results have been obtained from our model thus far, demonstrating improvements over existing approaches and the strong potential for use in modern NIDSs.

...read moreread less

979 citations

Cites background from "A detailed analysis of the KDD CUP ..."

...to overcome the inherent problems of the KDD ’99 data set, which are discussed in [35]....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Random Forests

[...]

Leo Breiman¹•Institutions (1)

University of California, Berkeley¹

01 Oct 2001

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.

...read moreread less

Abstract: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, aaa, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

...read moreread less

79,257 citations

"A detailed analysis of the KDD CUP ..." refers methods in this paper

...In a similar approach, we have selected seven widely used machine learning techniques, namely J48 decision tree learning [16], Naive Bayes [17], NBTree [18], Random Forest [19], Random Tree [20], Multilayer Perceptron [21], and Support Vector Machine (SVM) [22] from the Weka [23] collection to learn the overall behavior of the KDD’99 data set....
[...]

Journal Article•DOI•

LIBSVM: A library for support vector machines

[...]

Chih-Chung Chang¹, Chih-Jen Lin¹•Institutions (1)

National Taiwan University¹

06 May 2011-ACM Transactions on Intelligent Systems and Technology

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.

...read moreread less

Abstract: LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.

...read moreread less

40,826 citations

"A detailed analysis of the KDD CUP ..." refers methods in this paper

...However, SVM is the only learning technique whose performance is improved on KDDTest+. Analyzing both test sets, we found that SVM wrongly detects one of the most frequent records in KDDTest, which highly affects its detection performance....
[...]
...As an example, classification of SVM on KDDTest is 65.01% which is quite poor compared to other learning approaches....
[...]
...In a similar approach, we have selected seven widely used machine learning techniques, namely J48 decision tree learning [16], Naive Bayes [17], NBTree [18], Random Forest [19], Random Tree [20], Multilayer Perceptron [21], and Support Vector Machine (SVM) [22] from the Weka [23] collection to learn the overall behavior of the KDD’99 data set....
[...]
...In contrast, in KDDTest+ since this record is only occurred once, it does not have any effects on the classification rate of SVM, and provides better evaluation of learning methods....
[...]

Book•

C4.5: Programs for Machine Learning

[...]

J. Ross Quinlan¹•Institutions (1)

University of Sydney¹

15 Oct 1992

TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.

...read moreread less

Abstract: From the Publisher: Classifier systems play a major role in machine learning and knowledge-based systems, and Ross Quinlan's work on ID3 and C4.5 is widely acknowledged to have made some of the most significant contributions to their development. This book is a complete guide to the C4.5 system as implemented in C for the UNIX environment. It contains a comprehensive guide to the system's use , the source code (about 8,800 lines), and implementation notes. The source code and sample datasets are also available on a 3.5-inch floppy diskette for a Sun workstation. C4.5 starts with large sets of cases belonging to known classes. The cases, described by any mixture of nominal and numeric properties, are scrutinized for patterns that allow the classes to be reliably discriminated. These patterns are then expressed as models, in the form of decision trees or sets of if-then rules, that can be used to classify new cases, with emphasis on making the models understandable as well as accurate. The system has been applied successfully to tasks involving tens of thousands of cases described by hundreds of properties. The book starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting. Advantages and disadvantages of the C4.5 approach are discussed and illustrated with several case studies. This book and software should be of interest to developers of classification-based intelligent systems and to students in machine learning and expert systems courses.

...read moreread less

21,674 citations

Programs for Machine Learning

[...]

Steven L. Salzberg¹, Alberto Segre•Institutions (1)

Johns Hopkins University¹

01 Jan 1994

TL;DR: In his new book, C4.5: Programs for Machine Learning, Quinlan has put together a definitive, much needed description of his complete system, including the latest developments, which will be a welcome addition to the library of many researchers and students.

...read moreread less

Abstract: Algorithms for constructing decision trees are among the most well known and widely used of all machine learning methods. Among decision tree algorithms, J. Ross Quinlan's ID3 and its successor, C4.5, are probably the most popular in the machine learning community. These algorithms and variations on them have been the subject of numerous research papers since Quinlan introduced ID3. Until recently, most researchers looking for an introduction to decision trees turned to Quinlan's seminal 1986 Machine Learning journal article [Quinlan, 1986]. In his new book, C4.5: Programs for Machine Learning, Quinlan has put together a definitive, much needed description of his complete system, including the latest developments. As such, this book will be a welcome addition to the library of many researchers and students.

...read moreread less

8,046 citations

"A detailed analysis of the KDD CUP ..." refers methods in this paper

...In a similar approach, we have selected seven widely used machine learning techniques, namely J48 decision tree learning [16], Naive Bayes [17], NBTree [18], Random Forest [19], Random Tree [20], Multilayer Perceptron [21], and Support Vector Machine (SVM) [22] from the Weka [23] collection to learn the overall behavior of the KDD’99 data set....
[...]

Posted Content•

Estimating Continuous Distributions in Bayesian Classifiers

[...]

George H. John¹, Pat Langley¹•Institutions (1)

Stanford University¹

20 Feb 2013-arXiv: Learning

TL;DR: This paper abandon the normality assumption and instead use statistical methods for nonparametric density estimation for kernel estimation, which suggests that kernel estimation is a useful tool for learning Bayesian models.

...read moreread less

Abstract: When modeling a probability distribution with a Bayesian network, we are faced with the problem of how to handle continuous variables. Most previous work has either solved the problem by discretizing, or assumed that the data are generated by a single Gaussian. In this paper we abandon the normality assumption and instead use statistical methods for nonparametric density estimation. For a naive Bayesian classifier, we present experimental results on a variety of natural and artificial domains, comparing two methods of density estimation: assuming normality and modeling each conditional distribution with a single Gaussian; and using nonparametric kernel density estimation. We observe large reductions in error on several natural and artificial data sets, which suggests that kernel estimation is a useful tool for learning Bayesian models.

...read moreread less

3,071 citations

"A detailed analysis of the KDD CUP ..." refers methods in this paper

...In a similar approach, we have selected seven widely used machine learning techniques, namely J48 decision tree learning [16], Naive Bayes [17], NBTree [18], Random Forest [19], Random Tree [20], Multilayer Perceptron [21], and Support Vector Machine (SVM) [22] from the Weka [23] collection to learn the overall behavior of the KDD’99 data set....
[...]