The Proposition and Evaluation of the RoEduNet-SIMARGL2021 Network Intrusion Detection Dataset.

doi:10.3390/S21134319

Home
/
Papers
/
The Proposition and Evaluation of the RoEduNet-SIMARGL2021 Network Intrusion Detection Dataset.

Journal Article•DOI•

The Proposition and Evaluation of the RoEduNet-SIMARGL2021 Network Intrusion Detection Dataset.

Maria-Elena Mihailescu¹, Darius Mihai¹, Mihai Carabas, Mikołaj Komisarek, Marek Pawlicki, Witold Hołubowicz, Rafał Kozik - Show less +3 more•Institutions (1)

Politehnica University of Bucharest¹

24 Jun 2021-Sensors (Multidisciplinary Digital Publishing Institute)-Vol. 21, Iss: 13, pp 4319

TL;DR: In this paper, the authors introduce the effects of using machine-learning-based intrusion detection methods in network traffic coming from a real-life architecture, which is part of an effort to bring security against novel cyberthreats and was completed in the SIMARGL project.

read less

Abstract: Cybersecurity is an arms race, with both the security and the adversaries attempting to outsmart one another, coming up with new attacks, new ways to defend against those attacks, and again with new ways to circumvent those defences. This situation creates a constant need for novel, realistic cybersecurity datasets. This paper introduces the effects of using machine-learning-based intrusion detection methods in network traffic coming from a real-life architecture. The main contribution of this work is a dataset coming from a real-world, academic network. Real-life traffic was collected and, after performing a series of attacks, a dataset was assembled. The dataset contains 44 network features and an unbalanced distribution of classes. In this work, the capability of the dataset for formulating machine-learning-based models was experimentally evaluated. To investigate the stability of the obtained models, cross-validation was performed, and an array of detection metrics were reported. The gathered dataset is part of an effort to bring security against novel cyberthreats and was completed in the SIMARGL project.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Towards Zero-Shot Flow-Based Cyber-Security Anomaly Detection Framework

[...]

Mikołaj Komisarek, Rafał Kozik, Marek Pawlicki, Michał Choraś

26 Sep 2022-Applied Sciences

TL;DR: An innovative approach is proposed which adapts sketchy data structures to extract generic and universal features and leverages the principles of domain adaptation to improve classification quality in zero- and few-shot scenarios.

...read moreread less

Abstract: Network flow-based cyber anomaly detection is a difficult and complex task. Although several approaches to tackling this problem have been suggested, many research topics remain open. One of these concerns the problem of model transferability. There is a limited number of papers which tackle transfer learning in the context of flow-based network anomaly detection, and the proposed approaches are mostly evaluated on outdated datasets. The majority of solutions employ various sophisticated approaches, where different architectures of shallow and deep machine learning are leveraged. Analysis and experimentation show that different solutions achieve remarkable performance in a single domain, but transferring the performance to another domain is tedious and results in serious deterioration in prediction quality. In this paper, an innovative approach is proposed which adapts sketchy data structures to extract generic and universal features and leverages the principles of domain adaptation to improve classification quality in zero- and few-shot scenarios. The proposed approach achieves an F1 score of 0.99 compared to an F1 score of 0.97 achieved by the best-performing related methods.

...read moreread less

6 citations

Journal Article•DOI•

How to Effectively Collect and Process Network Data for Intrusion Detection

[...]

Mikołaj Komisarek, Marek Pawlicki, Rafał Kozik, Witold Hołubowicz, Michał Choraś¹ - Show less +1 more•Institutions (1)

Rolf C. Hagen Group¹

18 Nov 2021-Entropy

TL;DR: In this article, several feature selection techniques have been applied on five flow-based network intrusion detection datasets, establishing an informative flowbased feature set, and the results show that a set of 10 features and a small amount of data is enough for the final model to perform very well.

...read moreread less

Abstract: The number of security breaches in the cyberspace is on the rise. This threat is met with intensive work in the intrusion detection research community. To keep the defensive mechanisms up to date and relevant, realistic network traffic datasets are needed. The use of flow-based data for machine-learning-based network intrusion detection is a promising direction for intrusion detection systems. However, many contemporary benchmark datasets do not contain features that are usable in the wild. The main contribution of this work is to cover the research gap related to identifying and investigating valuable features in the NetFlow schema that allow for effective, machine-learning-based network intrusion detection in the real world. To achieve this goal, several feature selection techniques have been applied on five flow-based network intrusion detection datasets, establishing an informative flow-based feature set. The authors’ experience with the deployment of this kind of system shows that to close the research-to-market gap, and to perform actual real-world application of machine-learning-based intrusion detection, a set of labeled data from the end-user has to be collected. This research aims at establishing the appropriate, minimal amount of data that is sufficient to effectively train machine learning algorithms in intrusion detection. The results show that a set of 10 features and a small amount of data is enough for the final model to perform very well.

...read moreread less

5 citations

Proceedings Article•DOI•

VHS-22 – A Very Heterogeneous Set of Network Traffic Data for Threat Detection

[...]

Paweł Szumełda, Natan Orzechowski, Mariusz Rawski, Artur Janicki

15 Jun 2022

TL;DR: It is claimed that the data in the VHS-22 dataset are more demanding, and therefore that the dataset can better stimulate further progress in detecting network threats.

...read moreread less

Abstract: Researching new methods of detecting network threats, e.g., malware-related, requires large and diverse sets of data. In recent years, a variety of network traffic datasets have been proposed, which have been intensively used by the research community. However, most of them are quite homogeneous, which means that detecting threats using these data became relatively easy, allowing for detection accuracy close to 100%. Therefore, they are not a challenge anymore. As a remedy, in this article we propose a VHS-22 dataset – a Very Heterogeneous Set of network traffic data. We prepared it using a software network probe and a set of existing datasets. We describe the process of dataset creation, as well as its basic statistics. We also present initial experiments on attack detection, which yielded lower results than for other datasets. We claim that the data in the VHS-22 dataset are more demanding, and therefore that our dataset can better stimulate further progress in detecting network threats.

...read moreread less

2 citations

Journal Article•DOI•

Malicious Network Behavior Detection Using Fusion of Packet Captures Files and Business Feature Data.

[...]

Mingshu He¹, Xiaojuan Wang¹, Lei Jin¹, Bingying Dai², Kaiwenlv Kacuila¹, Xiaosu Xue¹ - Show less +2 more•Institutions (2)

Beijing University of Posts and Telecommunications¹, Colorado State University²

03 Sep 2021-Sensors

TL;DR: Wang et al. as mentioned in this paper proposed a one-dimensional convolution-based fusion model of packet capture files and business feature data for malicious network behavior detection, which improved the malicious behavior detection results compared with single ones in some available network traffic and IOT datasets.

...read moreread less

Abstract: Information and communication technologies have essential impacts on people’s life. The real time convenience of the internet greatly facilitates the information transmission and knowledge exchange of users. However, network intruders utilize some communication holes to complete malicious attacks. Some traditional machine learning (ML) methods based on business features and deep learning (DL) methods extracting features automatically are used to identify these malicious behaviors. However, these approaches tend to use only one type of data source, which can result in the loss of some features that can not be mined in the data. In order to address this problem and to improve the precision of malicious behavior detection, this paper proposed a one-dimensional (1D) convolution-based fusion model of packet capture files and business feature data for malicious network behavior detection. Fusion models improve the malicious behavior detection results compared with single ones in some available network traffic and Internet of things (IOT) datasets. The experiments also indicate that early data fusion, feature fusion and decision fusion are all effective in the model. Moreover, this paper also discusses the adaptability of one-dimensional convolution and two-dimensional (2D) convolution to network traffic data.

...read moreread less

2 citations

Journal Article•DOI•

Human-driven and human-centred cybersecurity: policy-making implications

[...]

Aleksandra Pawlicka, Marek Pawlicki, Rafał Kozik, Michał Choraś

05 Aug 2022-Transforming Government: People, Process and Policy

TL;DR: This paper provides a number of practical recommendations for policymakers, as well as cybersecurity managers on how to make the cybersecurity more human-centred; it also inspires further research directions.

...read moreread less

Abstract: Purpose The purpose of this paper is to challenge the prevailing, stereotypical approach of the human aspect of cybersecurity, i.e. treating people as weakness or threat. Instead, several reflections are presented, pertaining to the ways of making cybersecurity human-centred. Design/methodology/approach This paper bases on the authors’ own experiences, gathered whilst working in cybersecurity projects; the resulting comments and reflections have been enriched and backed up by the results of a targeted literature study. Findings The findings show that the way the human aspects of cybersecurity are understood is changing, and deviates from the stereotypical approach. Practical implications This paper provides a number of practical recommendations for policymakers, as well as cybersecurity managers on how to make the cybersecurity more human-centred; it also inspires further research directions. Originality/value This paper presents a fresh, positive approach to humans in cybersecurity and opens the doors to further discourse about new paradigms in the field.

...read moreread less

2 citations

References

PDF

Open Access

More filters

Journal Article•DOI•

Random Forests

[...]

Leo Breiman¹•Institutions (1)

University of California, Berkeley¹

01 Oct 2001

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.

...read moreread less

Abstract: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, aaa, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

...read moreread less

79,257 citations

Journal Article•DOI•

Deep learning

[...]

Yann LeCun¹, Yann LeCun², Yoshua Bengio³, Geoffrey E. Hinton⁴, Geoffrey E. Hinton⁵ - Show less +1 more•Institutions (5)

New York University¹, Facebook², Université de Montréal³, Google⁴, University of Toronto⁵

28 May 2015-Nature

TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.

...read moreread less

Abstract: Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

...read moreread less

46,982 citations

Journal Article•DOI•

Greedy function approximation: A gradient boosting machine.

[...]

Jerome H. Friedman¹•Institutions (1)

Stanford University¹

01 Oct 2001-Annals of Statistics

TL;DR: A general gradient descent boosting paradigm is developed for additive expansions based on any fitting criterion, and specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification.

...read moreread less

Abstract: Function estimation/approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest-descent minimization. A general gradient descent “boosting” paradigm is developed for additive expansions based on any fitting criterion.Specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such “TreeBoost” models are presented. Gradient boosting of regression trees produces competitive, highly robust, interpretable procedures for both regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire and Friedman, Hastie and Tibshirani are discussed.

...read moreread less

17,764 citations

Journal Article•DOI•

SMOTE: synthetic minority over-sampling technique

[...]

Nitesh V. Chawla¹, Kevin W. Bowyer², Lawrence O. Hall¹, W. Philip Kegelmeyer³•Institutions (3)

University of South Florida¹, University of Notre Dame², Sandia National Laboratories³

01 Jan 2002-Journal of Artificial Intelligence Research

TL;DR: In this article, a method of over-sampling the minority class involves creating synthetic minority class examples, which is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.

...read moreread less

Abstract: An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of oversampling the minority (abnormal)cla ss and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space)tha n only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space)t han varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC)and the ROC convex hull strategy.

...read moreread less

17,313 citations

Journal Article•DOI•

A logical calculus of the ideas immanent in nervous activity

[...]

Warren S. McCulloch¹, Walter Pitts²•Institutions (2)

University of Illinois at Urbana–Champaign¹, University of Chicago²

01 Jan 1990-Bulletin of Mathematical Biology

TL;DR: In this article, it is shown that many particular choices among possible neurophysiological assumptions are equivalent, in the sense that for every net behaving under one assumption, there exists another net which behaves under another and gives the same results, although perhaps not in the same time.

...read moreread less

14,937 citations