scispace - formally typeset
Journal ArticleDOI

Enhancing data quality in real-time threat intelligence systems using machine learning

Ariel Rodriguez, +1 more
- 01 Dec 2020 - 
- Vol. 10, Iss: 1, pp 91
TLDR
This research proposes a method to classify open-source intelligence data into a cybersecurity-related information stream and subsequently increase the quality of that stream using an unsupervised clustering method.
Abstract
In this research, we aim to expand the utility of keyword filtering on text-based data in the domain of cyber threat intelligence. Existing research-based cyber threat intelligence systems and production systems often utilize keyword filtering as a method to obtain training data for a classification model or as a classifier in itself. This method is known to have concerns with false-positives that affect data quality and thus can produce downstream issues for security analysts that utilize these types of systems. We propose a method to classify open-source intelligence data into a cybersecurity-related information stream and subsequently increase the quality of that stream using an unsupervised clustering method. Our method expands on keyword filtering techniques by introducing a word2vec generated associated words list which assists in the classification of ambiguous posts to reduce false-positives while still retrieving large scope data. We then use k-means clustering on positively classified entries to identify and remove clusters that are not relevant to threats. We further explore this method by investigating the effects of using segmentation based on data characteristics to achieve better classification. Together these methods are able to create a higher quality cyber threat-related data stream that can be applied to existing text-based threat intelligence systems that use keyword filtering methods.

read more

Citations
More filters
Journal ArticleDOI

Big data analytics of social network marketing and personalized recommendations

TL;DR: Wang et al. as discussed by the authors examined the experience of various Taiwanese fan page users utilizing a market survey, a total of 1032 valid questionnaire data, and the questionnaire is divided into five sections with 33 items in terms of a big data structure based on a relational database on the first research stage.
Proceedings ArticleDOI

Data divide in digital trade, and its impacts on the digital economy: A literature review

TL;DR: In this paper , the authors identify problems regarding the data divide in digital trade and offer proposals to the stakeholders to tackle the Data Divide in digital Trade through data policies and holistic trade regulatory frameworks.
Journal ArticleDOI

CAVeCTIR: Matching Cyber Threat Intelligence Reports on Connected and Autonomous Vehicles Using Machine Learning

TL;DR: In this paper , the authors present CAVeCTIR, a novel approach that finds similarities between CTI reports that describe malicious activities detected on connected and automated vehicles (CAVs) and provides a quick, automated, and effective solution for clustering similar malicious activities.
References
More filters
Proceedings Article

Distributed Representations of Sentences and Documents

TL;DR: Paragraph Vector is an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents, and its construction gives the algorithm the potential to overcome the weaknesses of bag-of-words models.
Proceedings ArticleDOI

UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set)

TL;DR: Countering the unavailability of network benchmark data set challenges, this paper examines a UNSW-NB15 data set creation which has a hybrid of the real modern normal and the contemporary synthesized attack activities of the network traffic.
Journal ArticleDOI

Ensemble learning via negative correlation

TL;DR: The experimental results show that negative correlation learning can produce neural network ensembles with good generalisation ability.
Posted Content

The Pushshift Reddit Dataset

TL;DR: The Pushshift Reddit dataset makes it possible for social media researchers to reduce time spent in the data collection, cleaning, and storage phases of their projects.
Journal ArticleDOI

Robust network traffic classification

TL;DR: The proposed RTC scheme has the capability of identifying the traffic of zero-day applications as well as accurately discriminating predefined application classes and is significantly better than four state-of-the-art methods.
Trending Questions (1)
Is AI capable of improving information quality?

AI, specifically machine learning techniques like word2vec and k-means clustering, can enhance data quality in real-time threat intelligence systems by reducing false-positives and improving classification accuracy.