Posted Content•DOI•

Behavioral Malware Detection using Deep Graph Convolutional Neural Networks

Angelo Schranko de Oliveira, Renato José Sassi

02 Nov 2019-International Journal of Computer Applications (TechRxiv)-Vol. 174, Iss: 29, pp 1-8

TL;DR: Experimental results show that the DGCNN models achieve similar Area Under the ROC Curve (AUC-ROC) and F1-Score to Long-Short Term Memory (LSTM) networks, thus indicating that the models can effectively learn to distinguish between malicious and benign temporal patterns through convolution operations on graphs.

read less

Abstract: Malware behavioral graphs provide a rich source of information that can be leveraged for detection and classification tasks. In this paper, we propose a novel behavioral malware detection method based on Deep Graph Convolutional Neural Networks (DGCNNs) to learn directly from API call sequences and their associated behavioral graphs. In order to train and evaluate the models, we created a new public domain dataset of more than 40,000 API call sequences resulting from the execution of malware and goodware instances in a sandboxed environment. Experimental results show that our models achieve similar Area Under the ROC Curve (AUC-ROC) and F1-Score to Long-Short Term Memory (LSTM) networks, widely used as the base architecture for behavioral malware detection methods, thus indicating that the models can effectively learn to distinguish between malicious and benign temporal patterns through convolution operations on graphs. To the best of our knowledge, this is the first paper that investigates the applicability of DGCNN to behavioral malware detection using API call sequences.

...read moreread less

Summary (4 min read)

Jump to: [1 Introduction] – [2 Related Work] – [3 Background on Deep Graph Convolutional Neural Networks] – [4 Proposed Method] – [4.1 Data Collection and Post-Processing] – [4.2 API Call Sequences and Behavioral Graphs Generation] – [4.3 Deep Graph Convolutional Neural Networks and Graph Convolutional Layers] – [4.5 The Method] – [5 Performance Evaluation] – [5.1 Experiment 1] – [5.2 Experiment 2] – [6.1 Balanced Dataset] – [6.2 Imbalanced Dataset] – [6.3 General Considerations] – [6.4 Visualization] and [7 Conclusion]

1 Introduction

According to a report published by AV-TEST [1], 9.74 million new malware specimens were released just in September of 2019, totaling 948 million known specimens in the wild.
In order to collect dynamic analysis data, it is often necessary to run the program in a sandbox environment [5].
The authors propose a novel behavioral malware detection method that exploits yet another structure of the dynamic analysis data, the graph structure of the API call sequences.
To accomplish this task, their method is based on a state-of-the-art Deep Learning architecture designed for graph classification; more specifically, the Deep Graph Convolutional Neural Network [15].
The rest of the paper is organized as follows.

3 Background on Deep Graph Convolutional Neural Networks

DGCNN is a state-of-the-art neural network architecture that can directly accept graphs of arbitrary structures to learn a graph classification function [15].
The augmented diagonal degree matrix of G, D̃i,i = ∑ j Ãi,j for row-wise normalization.
Then, the graph convolution operation can be written as follows [15]: Z = f(D̃−1ÃXW ) (1) The graph convolution operation defined by Equation 1 aggregates local substructure information by considering the nodes’ immediate neighborhoods.
4) The ordered graph data is flattened and passed to a standard 1-dimensional CNN layer followed by a fully connected layer to learn a classification function.
For a more comprehensive review, please refer to [15].

4 Proposed Method

As illustrated in Figure 1, their method has eight sequential steps from data gathering to detection.
At this point, the authors have tracked the temporal behavioral information from the PE files and the ordered set of all possible API calls.
If multiple graph convolutional layers are stacked together to form a deep network, it is necessary to concatenate their results in order to consider multi-scale substructure features.
Finally, the learned representations are passed to a fully connected layer (7), followed by a sigmoid layer (8) binary classification.
In the next sections, a more in-depth description of the method is presented.

4.1 Data Collection and Post-Processing

The authors introduced a new public domain dataset of 42,797 malware API call sequences and 1,079 goodware API call sequences each [30].
On the other hand, the authors were motivated by the desire to provide an open dataset that the research community could further utilize and extend.
3) We built the list of unique API calls, considering all the samples, and then converted each API call name into a unique integer identifier equal to the index of the API call name in the list.the authors.
The last column contains the label of the sample, 0 for goodware, and 1 for malware.
The authors Cuckoo sandbox environment was based on an Intel Xeon D-1540, 8 cores, 16 threads, 2.6 GHz, 64 GB RAM, and 2 TB SSD running Ubuntu Server 16.04 as the Cuckoo host and 8 32-bit Windows 7 Ultimate VirtualBox virtual machines running in parallel as Cuckoo analysis guests.

4.2 API Call Sequences and Behavioral Graphs Generation

On the one hand, API call sequences represent the most important part of the program behavior through time [13].
On the other hand, graph structures encode spatial relations, such as adjacency and connectivity, between API calls.
The authors method leverages both temporal and spatial information for malware detection.
In order to accomplish that, it is necessary to extract the graph structure from the API call sequences to generate their associated behavioral graphs.
Figure 2 step I shows the behavioral graph G resulting from the adjacency matrix generated by Equation 3 applied to the API call sequence x = (0, 1, 2, 0, 2, 3).

4.3 Deep Graph Convolutional Neural Networks and Graph Convolutional Layers

In order to take advantage of the DGCNN architecture, let us define the node feature matrix X ∈ {0, 1}|N |×L of G as the result of one-hot encoding each xi in the API call sequence x.
For the sake of clarity, let us take the product AX and its 2The reader may forgive a little abuse of notation here.
Also, notice that the rows of AX represent ordered nodes, and the columns of X represent the behavior of the program in time given by the API call sequence x.
Moreover, since the nodes of G are already sorted by their natural order, their model does not require the SortPooling layer introduced in [15], thus reducing its execution time.
Finally, the term D̃−1ÃX is multiplied by the weight matrix W , allowing the model to learn higher-level representations.

4.5 The Method

In summary, without considering the data collection and post-processing steps, their method can be implemented using Algorithm 2.
According to the principles of Deep Learning [7], Algorithm 2 can be extended by stacking the graph convolutional layers or fully connected layers followed by the sigmoid layer for binary classification or a softmax layer multi-class classification.
Furthermore, the authors included a Dropout [34] layer after each graph convolutional layer in order to prevent overfitting and used ReLU [35] as the activation function to perform non-linear transformations while preventing the vanishing gradient problem: Algorithm 2: The Model input :API call sequence x.

5 Performance Evaluation

First, to measure the performance of their method in detecting malware considering a balanced dataset and the original imbalanced dataset of API call sequences.
Second, to establish a fair performance comparison between their models and LSTM networks on the same task.
Two experiments were performed for model selection, training and evaluation: Experiment 5.1 and Experiment 5.2.
In total, 1,296 models were defined, trained, and evaluated, resulting in 6 optimized models for malware detection using API call sequences.

5.1 Experiment 1

In an exhaustive grid search, the model is trained and evaluated with all the hyperparameters combinations.
The stratified k-fold cross-validation ensures that each training set split contains a similar proportion of positive and negative samples.
Then, the model is trained with k − 1 folds, and then its performance is evaluated using the fold that was left out of the training process.
The average of the evaluation performances is an estimate of the model’s performance on unseen data.

5.2 Experiment 2

In the second experiment, the original imbalanced dataset of 42,797 malware API call sequences and 1,079 goodware API call sequences was considered without undersampling.
Then, the same procedures of Experiment 5.1 were followed.

6.1 Balanced Dataset

As the authors can see in Table 2, their models achieve the highest AUC-ROC, F1-score, precision, recall, and accuracy.
A particularly important performance metric when evaluating malware detectors is the recall.
High precision implies a low number of false positives, which is less critical but is desired for malware detectors.
Ideally, both recall and precision should be high, implying a high F1-score.
Finally, high accuracy implies a high number of correct overall predictions.

6.2 Imbalanced Dataset

As the authors can see in Table 3, LSTM networks achieve the best results, followed by Model-2 and Model1, respectively; however, notice that their models are capable of learning a classification function using considerably fewer parameters and epochs.
AUC-ROC is the most reliable metric in this scenario [45] since even the Dummy detector achieves a relatively high F1-score and, consequently, high recall and precision.

6.3 General Considerations

In general, their models achieved similar performances to LSTM networks on the proposed task.
As Tables 4 and 5 show, Model-1 and Model-2 dropout rates are the highest as opposed to the number of parameters.
In fact, their models overfitted the training set just after ten epochs on average, indicating that additional dropout layers or L2 regularization [47], as well as the addition of more examples, could further improve their performance.
In addition, notice that their work only took into account one kind of execution trace, the API call sequences.

6.4 Visualization

In an attempt to visualize the inner workings of the models, the authors applied Principal Component Analysis (PCA) [46] to the sets of activations in the hidden layer preceding the fully connected layer during the evaluation phases.
Deeper layers should contain high-level features that are able to be separated into classes by the fully connected layer.
Figure 4 (a) shows the result of PCA applied to the test set, Figure 4 (b) shows the result of PCA applied to LSTM networks, and Figures 4 (c) and (d) show the PCA visualization for Model-1 and Model-2, respectively.
Taking that into account, it is interesting to consider how a DGCNN-based behavioral malware classification method would behave in a multiclass classification problem.

7 Conclusion

The authors propose a novel behavioral malware detection method based on DGCNNs to learn directly from API call sequences.
In order to train, evaluate, and test the models, the authors introduced a new public domain dynamic analysis dataset of more than 40k API call sequences of malware and goodware.
Even though DGCNNs are memory-less networks, as opposed to LSTM networks, their results show that the graph structure of the API call sequences plays an essential role in the problem of detecting whether a program is malware.

Did you find this useful? Give us your feedback

Figures (9)

Figure 2: Behavioral graphs generation and graph convolution operations. 0∗, . . . , 3∗, represent the high-level features grouped by the natural order of the nodes.

Table 4: Model selection results on the balanced dataset

Table 2: Performance metrics on the balanced dataset. The Dummy detector predicts all the samples as malware.

Table 3: Performance metrics on the original (imbalanced) dataset. The Dummy detector predicts all the samples as malware.

Table 1: Hyperparameter optimization search space. W (0) and W (1) are the sizes of the output channels for Model-1 or Model-2 for each layer. H is the size of the hidden layer for the LSTM network.

Figure 4: Visualization using Principal Component Analysis. The red dots represent malware instances, and the green dots represent goodware instances.

Table 5: Model selection results on the original (imbalanced) dataset

Figure 1: High-level flow of the proposed method.

Content maybe subject to copyright Report

Behavioral Malware Detection Using Deep Graph

Convolutional Neural Networks

A Preprint

Angelo Oliveira

∗

Universidade Nove de Julho, Brazil

alpha@angeloliveira.net

Renato José Sassi

Universidade Nove de Julho, Brazil

renato.sassi@ieee.org

October 24, 2019

Abstract

Malware behavioral graphs provide a rich source of information that can be leveraged for

detection and classiﬁcation tasks. In this paper, we propose a novel behavioral malware

detection method based on Deep Graph Convolutional Neural Networks (DGCNNs) to learn

directly from API call sequences and their associated behavioral graphs. In order to train

and evaluate the models, we created a new public domain dataset of more than 40,000

API call sequences resulting from the execution of malware and goodware instances in a

sandboxed environment. Experimental results show that our models achieve similar Area

Under the ROC Curve (AUC-ROC) and F1-Score to Long-Short Term Memory (LSTM)

networks, widely used as the base architecture for behavioral malware detection methods,

thus indicating that the models can eﬀectively learn to distinguish between malicious and

benign temporal patterns through convolution operations on graphs. To the best of our

knowledge, this is the ﬁrst paper that investigates the applicability of DGCNN to behavioral

malware detection using API call sequences.

K eywords

malware detection

behavioral graphs

deep graph convolutional neural networks

deep

learning · computer security

1 Introduction

According to a report published by AV-TEST [1], 9.74 million new malware specimens were released just in

September of 2019, totaling 948 million known specimens in the wild. Dealing with the rapid increase of

∗

Corresp onding author.

A preprint - October 24, 2019

malware in number, complexity, and variability requires the research and development of new intelligent,

automatic malware detection methods capable of scaling accordingly. There are two main approaches to

detecting malware; static malware analysis, and dynamic malware analysis [2]. On the one hand, static

malware analysis can be conducted quickly by comparing a set of handcrafted features of the incoming ﬁle to

previously observed malware features or signatures, which makes static analysis vulnerable to code obfuscation

techniques employed by polymorphic and metamorphic malware [3], as well as to complete new specimens of

malware or zero-days. Traditional signature-based malware detection methods are the cornerstone of the

majority of the commercial endpoint protection systems since they are relatively fast and do not depend

on any additional infrastructure to collect and analyze the data; however, they require expert knowledge to

reverse engineer malware instances and produce the features that will be used for detection. Evidently, this

approach does not scale as fast as malware production. On the other hand, dynamic malware analysis or

behavioral analysis is based on behavioral data such as API or system calls, which is harder to obfuscate [4].

In order to collect dynamic analysis data, it is often necessary to run the program in a sandbox environment

[5]. A sandbox provides a controlled and isolated environment for the guest program to run while monitoring

and tracking its activities. Once the data is collected and preprocessed, it can be used to feed behavioral

detection algorithms [6]. Deep Learning algorithms have shown unprecedented success in various domains

such as image classiﬁcation, natural language processing, and speech recognition [7]. Following this trend,

Deep Learning algorithms have also been applied to malware detection and classiﬁcation tasks using static

and dynamic analysis data exploiting its temporal [8, 9, 10], spatial [11, 12], or spatio-temporal [13, 14]

structure. In this work, we propose a novel behavioral malware detection method that exploits yet another

structure of the dynamic analysis data, the graph structure of the API call sequences. To accomplish this

task, our method is based on a state-of-the-art Deep Learning architecture designed for graph classiﬁcation;

more speciﬁcally, the Deep Graph Convolutional Neural Network (DGCNN) [15]. Due to their capability of

learning from non-Euclidean data such as graphs, Graph Neural Networks (GNNs) [16, 17] can be applied to

problems in a vast range of domains from protein classiﬁcation [18] to Materials science [19]. By deﬁning a

graph structure to represent the API call sequence of a program, we combine both the spatial and temporal

information from its behavior. Then, we introduce a simpliﬁed version of the DGCNN to learn high-level

representations that can be used by a classiﬁer to detect whether the program is malware. Experimental

results show that the proposed method achieves similar AUC-ROC [20] and F1-Score to specialized Deep

Learning architectures for sequence learning such as LSTM networks [21], widely used as the base architecture

for behavioral malware detection methods [22]. In particular, our models achieve higher AUC-ROC, F1-Score,

Precision, Recall, and Accuracy than LSTM networks when trained and tested on a balanced dataset. To the

best of our knowledge, this is the ﬁrst paper that investigates the applicability of DGCNN to behavioral

malware detection using API call sequences. The rest of the paper is organized as follows. Related work is

reviewed in Section 2. A brief background on DGCNNs is introduced in Section 3. The proposed method is

A preprint - October 24, 2019

detailed in Section 4. Performance evaluation is described in Section 5. Results, discussion, limitations, and

future work are presented in Section 6. Finally, conclusions are drawn in Section 7.

2 Related Work

Classical Deep Learning algorithms such as Convolutional Neural Networks (CNNs) [23] and LSTM networks

have been successfully applied to malware detection and classiﬁcation problems using both static and dynamic

analysis data [22]; however, Deep Learning on graphs has been mainly applied to data extracted employing

static analysis methods. [24] proposed a malware classiﬁcation method (MAGIC) based on a modiﬁed version

of the DGCNN to learn directly from attributed control ﬂow graphs (ACFGs) extracted from disassembled

binaries, in which each vertex summarizes code characteristics as numerical values. [25] introduced a

malware detection approach using graph embedding to map the control ﬂow graphs (CFGs) extracted from

disassembled binaries to low-dimensional vectors as inputs for two stacked denoising autoencoders (SDAs)

that are responsible for representation learning. [26] presented a method for Android malware detection that

creates CFGs using data extracted from decompiled applications’ source codes and information from their

manifest ﬁles. Then, a Graph Convolutional Network (GCN) [27] was used to learn high-level representations

that could be used in detection tasks. [28] studied the eﬀectiveness of DGCNNs in processing large-scale

graphs with hundreds of thousands of nodes by conducting experiments on malware detection and software

defect prediction. Our work follows a similar approach to [24] and shares the same theoretical basis on

applying DGCNNs for classiﬁcation tasks [15]. However, we use the standpoint of dynamic analysis by

extracting behavioral graphs from the API call sequences and using both the API call sequences and the

behavioral graphs as inputs to a modiﬁed version of the DGCNN. In addition, the standard LSTM network

was chosen as a benchmark since it has been successfully applied as the base architecture for several behavioral

malware detection and classiﬁcation methods using API call sequences data [22].

3 Background on Deep Graph Convolutional Neural Networks

DGCNN is a state-of-the-art neural network architecture that can directly accept graphs of arbitrary structures

to learn a graph classiﬁcation function [15]. In other words, DGCNNs deal with the task of graph classiﬁcation

as opposed to node classiﬁcation [27]. Let

be a directed graph of order

n ∈ N

and

A ∈ Z

n×n

its associated

adjacency matrix. Now, let us deﬁne the following: The augmented adjacency matrix of

, to

ensure that the convolution operation takes into account the features of each node as well as its neighbours’

features. The augmented diagonal degree matrix of

i,i

i,j

for row-wise normalization. The node

feature matrix

X ∈ R

n×c

c ∈ N

, where each row of

is a node “feature descriptor”, and each column of

is a node “feature channel.” The matrix of learning parameters

W ∈ R

c×c

, where

∈ N

is the number

of output feature channels, with the non-linear activation function

n×c

→ R

n×c

. Then, the graph

A preprint - October 24, 2019

convolution operation can be written as follows [15]:

Z = f(

−1

AXW ) (1)

The graph convolution operation deﬁned by Equation 1 aggregates local substructure information by

considering the nodes’ immediate neighborhoods. In order to extract multi-scale substructure features,

Equation 1 can be stacked to form a deep network, using the following recurrence relation [15]:

(t+1)

= f(

−1

(t)

), where Z

(0)

= X and W

(t)

∈ R

×c

t+1

, t ∈ N (2)

In summary, standard DGCNNs have four sequential steps [15]: 1) Graph convolutional layers generalize the

convolution operation from Euclidean domains or grid-like structures such as image data to non-Euclidean

domains such as graph data by generating node representations as the aggregation of their own feature

descriptors and their neighbors’ feature descriptors. 2) Unordered graph data from each convolutional

layer are concatenated along their feature channels (or columns), resulting in matrix

Z ∈ R

n×

. 3) A

SortPooling layer sorts the unordered graph data according to their feature descriptors or structural roles.

This step guarantees that the nodes of diﬀerent graphs will be placed in similar positions, according to their

weighted feature descriptors. 4) The ordered graph data is ﬂattened and passed to a standard 1-dimensional

CNN layer followed by a fully connected layer to learn a classiﬁcation function. For a more comprehensive

review, please refer to [15].

4 Proposed Method

As illustrated in Figure 1, our method has eight sequential steps from data gathering to detection. First of

all, Portable Executable (PE) ﬁles (1) are fed to a Cuckoo Sandbox [29] environment (2), which in turn runs

the PE ﬁles and generates raw JSON reports containing dynamic analysis data such as API call sequences,

generated traﬃc and dropped ﬁles (3). Next, the API call sequences are extracted from the reports and

post-processed in order to identify and convert the API calls into ordinal categorical values (4). At this point,

we have tracked the temporal behavioral information from the PE ﬁles and the ordered set of all possible API

calls. Behavioral graphs are then generated based on both the API call sequences and the set of API calls

(5), and both are passed to a graph convolutional layer to learn high-level representations of the spatial and

temporal relations among the API calls (6). If multiple graph convolutional layers are stacked together to

form a deep network, it is necessary to concatenate their results in order to consider multi-scale substructure

features. Finally, the learned representations are passed to a fully connected layer (7), followed by a sigmoid

layer (8) binary classiﬁcation. In the next sections, a more in-depth description of the method is presented.

A preprint - October 24, 2019

(1)

PE Files

(2)

Cuckoo Sandbox

(3)

Cuckoo Reports

(4)

API Call Sequences

(5)

Behavioral Graphs

Generation

(6)

Graph Convolutional

Layers

(7)

Fully Connected

Layers

(8)

Sigmoid Layer

Data Collection and Post-Processing

High-level Features Binary Classiﬁcation

Temporal Data

Spatial Data

Raw PE Files Runtime Enviroment Raw JSON Files

Figure 1: High-level ﬂow of the proposed method.

4.1 Data Collection and Post-Processing

We introduced a new public domain dataset of 42,797 malware API call sequences and 1,079 goodware API

call sequences each [30]. Our motivation was twofold. On the one hand, we were motivated by the lack

of public domain PE dynamic malware analysis dataset for training and evaluating our models. On the

other hand, we were motivated by the desire to provide an open dataset that the research community could

further utilize and extend. Malware samples were collected from VirusShare [31], and goodware samples were

collected from both portablepps.com [32] and a 32-bit Windows 7 Ultimate directory. Both online download

and local goodware were included to increase the variability of the dataset and decrease its imbalance. In

order to gather the API call sequences from each sample, we chose Cuckoo Sandbox, which is a largely used,

open-source automated malware analysis system capable of monitoring processes behavior while running in

an isolated environment. Once the data was collected, three additional post-processing steps were performed.

1) Similar to [13], it was considered the ﬁrst 100 non-consecutive repeated API calls to avoid tracking loops.

2) Since in malware detection tasks, it is prominent to recognize malicious patterns as early as possible, the

sequences were extracted from the parent process only. 3) We built the list of unique API calls, considering

all the samples, and then converted each API call name into a unique integer identiﬁer equal to the index of

the API call name in the list. As a result, 307 distinct API calls were identiﬁed. We produced a dataset where

the ﬁrst column contains the MD5 hash of the sample. The next 100 columns contain ordinal categorical

values between 0 and 306, representing the API call sequence of the sample. The last column contains the

label of the sample, 0 for goodware, and 1 for malware. The total running time to collect the data was about

3000 hours, resulting in approximately 50,000 Cuckoo JSON report ﬁles and 1.5 TB of raw data. Our Cuckoo

sandbox environment was based on an Intel Xeon D-1540, 8 cores, 16 threads, 2.6 GHz, 64 GB RAM, and

2 TB SSD running Ubuntu Server 16.04 as the Cuckoo host and 8 32-bit Windows 7 Ultimate VirtualBox

virtual machines running in parallel as Cuckoo analysis guests.

HTML Viewer

References

PDF

Open Access

More filters

Patent•

Application Sandbox to Detect, Remove, and Prevent Malware

[...]

Jayant Shukla

27 Jun 2007

TL;DR: In this paper, a new method and apparatus for protecting applications from local and network attacks is presented, which is based on creating a sandbox at application and kernel layer. But it is not suitable for the use of this method for monitoring and controlling the behavior and access privileges of the application and only selectively granting access.

...read moreread less

Abstract: The disclosed invention is a new method and apparatus for protecting applications from local and network attacks. This method also detects and removes malware and is based on creating a sandbox at application and kernel layer. By monitoring and controlling the behavior and access privileges of the application and only selectively granting access, any attacks that try to take advantage of the application vulnerabilities are thwarted.

...read moreread less

279 citations

Frequently Asked Questions (13)

Q1. What are the contributions mentioned in the paper "Behavioral malware detection using deep graph convolutional neural networks" ?

In this paper, the authors propose a novel behavioral malware detection method based on Deep Graph Convolutional Neural Networks ( DGCNNs ) to learn directly from API call sequences and their associated behavioral graphs. In order to train and evaluate the models, the authors created a new public domain dataset of more than 40,000 API call sequences resulting from the execution of malware and goodware instances in a sandboxed environment. To the best of their knowledge, this is the first paper that investigates the applicability of DGCNN to behavioral malware detection using API call sequences.

Q2. What are the future works in "Behavioral malware detection using deep graph convolutional neural networks" ?

Future work will explore deeper architectures as well as the problem of multiclass malware classification using API call sequences and their associated behavioral graphs.

Q3. What is the process of training a model?

the model is trained with k − 1 folds, and then its performance is evaluated using the fold that was left out of the training process.

Q4. What is the step to concatenate the results of multiple graphs?

If multiple graph convolutional layers are stacked together to form a deep network, it is necessary to concatenate their results in order to consider multi-scale substructure features.

Q5. What is the reliable metric in this scenario?

AUC-ROC is the most reliable metric in this scenario [45] since even the Dummy detector achieves a relatively high F1-score and, consequently, high recall and precision.

Q6. How many new malware specimens were released in September of 2019?

According to a report published by AV-TEST [1], 9.74 million new malware specimens were released just in September of 2019, totaling 948 million known specimens in the wild.

Q7. How many models were trained and evaluated?

In total, 1,296 models were defined, trained, and evaluated, resulting in 6 optimized models for malware detection using API call sequences.

Q8. Why can a graph network be used to learn from non-Euclidean data?

Due to their capability of learning from non-Euclidean data such as graphs, Graph Neural Networks (GNNs) [16, 17] can be applied to problems in a vast range of domains from protein classification [18] to Materials science [19].

Q9. How many API call sequences of malware are there?

In order to train, evaluate, and test the models, the authors introduced a new public domaindynamic analysis dataset of more than 40k API call sequences of malware and goodware.

Q10. How many hours did it take to collect the data?

The total running time to collect the data was about 3000 hours, resulting in approximately 50,000 Cuckoo JSON report files and 1.5 TB of raw data.

Q11. What was the first experiment considered without undersampling?

In the second experiment, the original imbalanced dataset of 42,797 malware API call sequences and 1,079 goodware API call sequences was considered without undersampling.

Q12. How do the authors extract behavioral graphs from the API call sequences?

the authors use the standpoint of dynamic analysis by extracting behavioral graphs from the API call sequences and using both the API call sequences and the behavioral graphs as inputs to a modified version of the DGCNN.

Q13. What is the description of the proposed method?

Experimental results show that the proposed method achieves similar AUC-ROC [20] and F1-Score to specialized Deep Learning architectures for sequence learning such as LSTM networks [21], widely used as the base architecture for behavioral malware detection methods [22].

Behavioral Malware Detection using Deep Graph Convolutional Neural Networks

Summary (4 min read)

1 Introduction

3 Background on Deep Graph Convolutional Neural Networks

4 Proposed Method

4.1 Data Collection and Post-Processing

4.2 API Call Sequences and Behavioral Graphs Generation

4.3 Deep Graph Convolutional Neural Networks and Graph Convolutional Layers

4.5 The Method

5 Performance Evaluation

5.1 Experiment 1

5.2 Experiment 2

6.1 Balanced Dataset

6.2 Imbalanced Dataset

6.3 General Considerations

6.4 Visualization

7 Conclusion

Figures (9)

Citations

References

Related Papers (5)

Frequently Asked Questions (13)

Q1. What are the contributions mentioned in the paper "Behavioral malware detection using deep graph convolutional neural networks" ?

Q2. What are the future works in "Behavioral malware detection using deep graph convolutional neural networks" ?

Q3. What is the process of training a model?

Q4. What is the step to concatenate the results of multiple graphs?

Q5. What is the reliable metric in this scenario?

Q6. How many new malware specimens were released in September of 2019?

Q7. How many models were trained and evaluated?

Q8. Why can a graph network be used to learn from non-Euclidean data?

Q9. How many API call sequences of malware are there?

Q10. How many hours did it take to collect the data?

Q11. What was the first experiment considered without undersampling?

Q12. How do the authors extract behavioral graphs from the API call sequences?

Q13. What is the description of the proposed method?