Behavioral Malware Detection using Deep Graph Convolutional Neural Networks
Summary (4 min read)
1 Introduction
- According to a report published by AV-TEST [1], 9.74 million new malware specimens were released just in September of 2019, totaling 948 million known specimens in the wild.
- In order to collect dynamic analysis data, it is often necessary to run the program in a sandbox environment [5].
- The authors propose a novel behavioral malware detection method that exploits yet another structure of the dynamic analysis data, the graph structure of the API call sequences.
- To accomplish this task, their method is based on a state-of-the-art Deep Learning architecture designed for graph classification; more specifically, the Deep Graph Convolutional Neural Network [15].
- The rest of the paper is organized as follows.
3 Background on Deep Graph Convolutional Neural Networks
- DGCNN is a state-of-the-art neural network architecture that can directly accept graphs of arbitrary structures to learn a graph classification function [15].
- The augmented diagonal degree matrix of G, D̃i,i = ∑ j Ãi,j for row-wise normalization.
- Then, the graph convolution operation can be written as follows [15]: Z = f(D̃−1ÃXW ) (1) The graph convolution operation defined by Equation 1 aggregates local substructure information by considering the nodes’ immediate neighborhoods.
- 4) The ordered graph data is flattened and passed to a standard 1-dimensional CNN layer followed by a fully connected layer to learn a classification function.
- For a more comprehensive review, please refer to [15].
4 Proposed Method
- As illustrated in Figure 1, their method has eight sequential steps from data gathering to detection.
- At this point, the authors have tracked the temporal behavioral information from the PE files and the ordered set of all possible API calls.
- If multiple graph convolutional layers are stacked together to form a deep network, it is necessary to concatenate their results in order to consider multi-scale substructure features.
- Finally, the learned representations are passed to a fully connected layer (7), followed by a sigmoid layer (8) binary classification.
- In the next sections, a more in-depth description of the method is presented.
4.1 Data Collection and Post-Processing
- The authors introduced a new public domain dataset of 42,797 malware API call sequences and 1,079 goodware API call sequences each [30].
- On the other hand, the authors were motivated by the desire to provide an open dataset that the research community could further utilize and extend.
- 3) We built the list of unique API calls, considering all the samples, and then converted each API call name into a unique integer identifier equal to the index of the API call name in the list.the authors.
- The last column contains the label of the sample, 0 for goodware, and 1 for malware.
- The authors Cuckoo sandbox environment was based on an Intel Xeon D-1540, 8 cores, 16 threads, 2.6 GHz, 64 GB RAM, and 2 TB SSD running Ubuntu Server 16.04 as the Cuckoo host and 8 32-bit Windows 7 Ultimate VirtualBox virtual machines running in parallel as Cuckoo analysis guests.
4.2 API Call Sequences and Behavioral Graphs Generation
- On the one hand, API call sequences represent the most important part of the program behavior through time [13].
- On the other hand, graph structures encode spatial relations, such as adjacency and connectivity, between API calls.
- The authors method leverages both temporal and spatial information for malware detection.
- In order to accomplish that, it is necessary to extract the graph structure from the API call sequences to generate their associated behavioral graphs.
- Figure 2 step I shows the behavioral graph G resulting from the adjacency matrix generated by Equation 3 applied to the API call sequence x = (0, 1, 2, 0, 2, 3).
4.3 Deep Graph Convolutional Neural Networks and Graph Convolutional Layers
- In order to take advantage of the DGCNN architecture, let us define the node feature matrix X ∈ {0, 1}|N |×L of G as the result of one-hot encoding each xi in the API call sequence x.
- For the sake of clarity, let us take the product AX and its 2The reader may forgive a little abuse of notation here.
- Also, notice that the rows of AX represent ordered nodes, and the columns of X represent the behavior of the program in time given by the API call sequence x.
- Moreover, since the nodes of G are already sorted by their natural order, their model does not require the SortPooling layer introduced in [15], thus reducing its execution time.
- Finally, the term D̃−1ÃX is multiplied by the weight matrix W , allowing the model to learn higher-level representations.
4.5 The Method
- In summary, without considering the data collection and post-processing steps, their method can be implemented using Algorithm 2.
- According to the principles of Deep Learning [7], Algorithm 2 can be extended by stacking the graph convolutional layers or fully connected layers followed by the sigmoid layer for binary classification or a softmax layer multi-class classification.
- Furthermore, the authors included a Dropout [34] layer after each graph convolutional layer in order to prevent overfitting and used ReLU [35] as the activation function to perform non-linear transformations while preventing the vanishing gradient problem: Algorithm 2: The Model input :API call sequence x.
5 Performance Evaluation
- First, to measure the performance of their method in detecting malware considering a balanced dataset and the original imbalanced dataset of API call sequences.
- Second, to establish a fair performance comparison between their models and LSTM networks on the same task.
- Two experiments were performed for model selection, training and evaluation: Experiment 5.1 and Experiment 5.2.
- In total, 1,296 models were defined, trained, and evaluated, resulting in 6 optimized models for malware detection using API call sequences.
5.1 Experiment 1
- In an exhaustive grid search, the model is trained and evaluated with all the hyperparameters combinations.
- The stratified k-fold cross-validation ensures that each training set split contains a similar proportion of positive and negative samples.
- Then, the model is trained with k − 1 folds, and then its performance is evaluated using the fold that was left out of the training process.
- The average of the evaluation performances is an estimate of the model’s performance on unseen data.
5.2 Experiment 2
- In the second experiment, the original imbalanced dataset of 42,797 malware API call sequences and 1,079 goodware API call sequences was considered without undersampling.
- Then, the same procedures of Experiment 5.1 were followed.
6.1 Balanced Dataset
- As the authors can see in Table 2, their models achieve the highest AUC-ROC, F1-score, precision, recall, and accuracy.
- A particularly important performance metric when evaluating malware detectors is the recall.
- High precision implies a low number of false positives, which is less critical but is desired for malware detectors.
- Ideally, both recall and precision should be high, implying a high F1-score.
- Finally, high accuracy implies a high number of correct overall predictions.
6.2 Imbalanced Dataset
- As the authors can see in Table 3, LSTM networks achieve the best results, followed by Model-2 and Model1, respectively; however, notice that their models are capable of learning a classification function using considerably fewer parameters and epochs.
- AUC-ROC is the most reliable metric in this scenario [45] since even the Dummy detector achieves a relatively high F1-score and, consequently, high recall and precision.
6.3 General Considerations
- In general, their models achieved similar performances to LSTM networks on the proposed task.
- As Tables 4 and 5 show, Model-1 and Model-2 dropout rates are the highest as opposed to the number of parameters.
- In fact, their models overfitted the training set just after ten epochs on average, indicating that additional dropout layers or L2 regularization [47], as well as the addition of more examples, could further improve their performance.
- In addition, notice that their work only took into account one kind of execution trace, the API call sequences.
6.4 Visualization
- In an attempt to visualize the inner workings of the models, the authors applied Principal Component Analysis (PCA) [46] to the sets of activations in the hidden layer preceding the fully connected layer during the evaluation phases.
- Deeper layers should contain high-level features that are able to be separated into classes by the fully connected layer.
- Figure 4 (a) shows the result of PCA applied to the test set, Figure 4 (b) shows the result of PCA applied to LSTM networks, and Figures 4 (c) and (d) show the PCA visualization for Model-1 and Model-2, respectively.
- Taking that into account, it is interesting to consider how a DGCNN-based behavioral malware classification method would behave in a multiclass classification problem.
7 Conclusion
- The authors propose a novel behavioral malware detection method based on DGCNNs to learn directly from API call sequences.
- In order to train, evaluate, and test the models, the authors introduced a new public domain dynamic analysis dataset of more than 40k API call sequences of malware and goodware.
- Even though DGCNNs are memory-less networks, as opposed to LSTM networks, their results show that the graph structure of the API call sequences plays an essential role in the problem of detecting whether a program is malware.
Did you find this useful? Give us your feedback
Citations
50 citations
23 citations
14 citations
10 citations
9 citations
References
279 citations
Related Papers (5)
Frequently Asked Questions (13)
Q2. What are the future works in "Behavioral malware detection using deep graph convolutional neural networks" ?
Future work will explore deeper architectures as well as the problem of multiclass malware classification using API call sequences and their associated behavioral graphs.
Q3. What is the process of training a model?
the model is trained with k − 1 folds, and then its performance is evaluated using the fold that was left out of the training process.
Q4. What is the step to concatenate the results of multiple graphs?
If multiple graph convolutional layers are stacked together to form a deep network, it is necessary to concatenate their results in order to consider multi-scale substructure features.
Q5. What is the reliable metric in this scenario?
AUC-ROC is the most reliable metric in this scenario [45] since even the Dummy detector achieves a relatively high F1-score and, consequently, high recall and precision.
Q6. How many new malware specimens were released in September of 2019?
According to a report published by AV-TEST [1], 9.74 million new malware specimens were released just in September of 2019, totaling 948 million known specimens in the wild.
Q7. How many models were trained and evaluated?
In total, 1,296 models were defined, trained, and evaluated, resulting in 6 optimized models for malware detection using API call sequences.
Q8. Why can a graph network be used to learn from non-Euclidean data?
Due to their capability of learning from non-Euclidean data such as graphs, Graph Neural Networks (GNNs) [16, 17] can be applied to problems in a vast range of domains from protein classification [18] to Materials science [19].
Q9. How many API call sequences of malware are there?
In order to train, evaluate, and test the models, the authors introduced a new public domaindynamic analysis dataset of more than 40k API call sequences of malware and goodware.
Q10. How many hours did it take to collect the data?
The total running time to collect the data was about 3000 hours, resulting in approximately 50,000 Cuckoo JSON report files and 1.5 TB of raw data.
Q11. What was the first experiment considered without undersampling?
In the second experiment, the original imbalanced dataset of 42,797 malware API call sequences and 1,079 goodware API call sequences was considered without undersampling.
Q12. How do the authors extract behavioral graphs from the API call sequences?
the authors use the standpoint of dynamic analysis by extracting behavioral graphs from the API call sequences and using both the API call sequences and the behavioral graphs as inputs to a modified version of the DGCNN.
Q13. What is the description of the proposed method?
Experimental results show that the proposed method achieves similar AUC-ROC [20] and F1-Score to specialized Deep Learning architectures for sequence learning such as LSTM networks [21], widely used as the base architecture for behavioral malware detection methods [22].