Enhancing data quality in real-time threat intelligence systems using machine learning

doi:10.1007/S13278-020-00707-X

Journal ArticleDOI

Enhancing data quality in real-time threat intelligence systems using machine learning

Ariel Rodriguez, +1 more

- 01 Dec 2020 -

Social Network Analysis and Mining

- Vol. 10, Iss: 1, pp 91

TLDR

This research proposes a method to classify open-source intelligence data into a cybersecurity-related information stream and subsequently increase the quality of that stream using an unsupervised clustering method.

Abstract:

In this research, we aim to expand the utility of keyword filtering on text-based data in the domain of cyber threat intelligence. Existing research-based cyber threat intelligence systems and production systems often utilize keyword filtering as a method to obtain training data for a classification model or as a classifier in itself. This method is known to have concerns with false-positives that affect data quality and thus can produce downstream issues for security analysts that utilize these types of systems. We propose a method to classify open-source intelligence data into a cybersecurity-related information stream and subsequently increase the quality of that stream using an unsupervised clustering method. Our method expands on keyword filtering techniques by introducing a word2vec generated associated words list which assists in the classification of ambiguous posts to reduce false-positives while still retrieving large scope data. We then use k-means clustering on positively classified entries to identify and remove clusters that are not relevant to threats. We further explore this method by investigating the effects of using segmentation based on data characteristics to achieve better classification. Together these methods are able to create a higher quality cyber threat-related data stream that can be applied to existing text-based threat intelligence systems that use keyword filtering methods.

Enhancing data quality in real-time threat intelligence systems using machine learning

Citations

Big data analytics of social network marketing and personalized recommendations

Data divide in digital trade, and its impacts on the digital economy: A literature review

CAVeCTIR: Matching Cyber Threat Intelligence Reports on Connected and Autonomous Vehicles Using Machine Learning

References

Distributed Representations of Sentences and Documents

UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set)

Ensemble learning via negative correlation

The Pushshift Reddit Dataset

Robust network traffic classification

Related Papers (5)

Exploiting the Data Mining Methodology for Cyber Security

Conceptual Framework for Analyzing Knowledge in Social Big Data

Text Classification Using Machine Learning and Deep Learning Models

COVID-19 & Cyber Security Challenges US, Canada & Korea

Deep learning feature selection to unhide demographic recommender systems factors

Trending Questions (1)