scispace - formally typeset
Open AccessJournal ArticleDOI

A Multimodal Deep Learning Method for Android Malware Detection Using Various Features

Reads0
Chats0
TLDR
This paper is the first study of the multimodal deep learning to be used in the android malware detection, and compared the performance of the framework with those of other existing methods including deep learning-based methods.
Abstract
With the widespread use of smartphones, the number of malware has been increasing exponentially. Among smart devices, android devices are the most targeted devices by malware because of their high popularity. This paper proposes a novel framework for android malware detection. Our framework uses various kinds of features to reflect the properties of android applications from various aspects, and the features are refined using our existence-based or similarity-based feature extraction method for effective feature representation on malware detection. Besides, a multimodal deep learning method is proposed to be used as a malware detection model. This paper is the first study of the multimodal deep learning to be used in the android malware detection. With our detection model, it was possible to maximize the benefits of encompassing multiple feature types. To evaluate the performance, we carried out various experiments with a total of 41 260 samples. We compared the accuracy of our model with that of other deep neural network models. Furthermore, we evaluated our framework in various aspects including the efficiency in model updates, the usefulness of diverse features, and our feature representation method. In addition, we compared the performance of our framework with those of other existing methods including deep learning-based methods.

read more

Content maybe subject to copyright    Report

A Multimodal Deep Learning Method for Android Malware Detection
using Various Features
Kim, T., Kang, B., Rho, M., Sezer, S., & Im, E. G. (2018). A Multimodal Deep Learning Method for Android
Malware Detection using Various Features.
IEEE Transactions on Information Forensics and Security
,
14
(3),
773-788. https://doi.org/10.1109/TIFS.2018.2866319
Published in:
IEEE Transactions on Information Forensics and Security
Document Version:
Peer reviewed version
Queen's University Belfast - Research Portal:
Link to publication record in Queen's University Belfast Research Portal
Publisher rights
© 2018 IEEE.
This work is made available online in accordance with the publisher’s policies. Please refer to any applicable terms of use of the publisher.
General rights
Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other
copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated
with these rights.
Take down policy
The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to
ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the
Research Portal that you believe breaches copyright or violates any law, please contact openaccess@qub.ac.uk.
Download date:10. Aug. 2022

T-IFS-07942-2017
1
Abstract With the widespread use of smartphones, the
number of malware has been increasing exponentially. Among
smart devices, Android devices are the most targeted devices by
malware because of their high popularity. This paper proposes a
novel framework for Android malware detection. Our framework
uses various kinds of features to reflect the properties of Android
applications from various aspects, and the features are refined
using our existence-based or similarity-based feature extraction
method for effective feature representation on malware detection.
Besides, a multimodal deep learning method is proposed to be used
as a malware detection model. This paper is the first study of the
multimodal deep learning to be used in the Android malware
detection. With our detection model, it was possible to maximize
the benefits of encompassing multiple feature types. To evaluate
the performance, we carried out various experiments with a total
of 41,260 samples. We compared the accuracy of our model with
that of other deep neural network models. Furthermore, we
evaluated our framework in various aspects including the
efficiency in model updates, the usefulness of diverse features, and
our feature representation method. In addition, we compared the
performance of our framework with those of other existing
methods including deep learning based methods.
Index TermsAndroid malware, malware detection, intrusion
detection, machine learning, neural network.
I. INTRODUCTION
ith the growing popularity of mobile devices such as
smartphones or tablets, attacks on the mobile devices
have been increasing. Mobile malware is one of the most
dangerous threats which cause various security incidents as
well as financial damages. According to the G DATA report [1]
in 2017, security experts discovered about 750,000 new
Android malware during the first quarter of 2017. It is expected
that a large number of mobile malware will keep developed and
spread to commit various cybercrimes on mobile devices.
Android is a mobile operating system that is most targeted by
This paper was first submitted on Oct. 18
th
, 2017. This research was
supported by the MSIT(Ministry of Science, ICT), Korea, under the
ITRC(Information Technology Research Center) support program (IITP-2018-
2013-1-00881) supervised by the IITP(Institute for Information &
communication Technology Promotion). This work was supported by Institute
for Information & communications Technology Promotion (IITP) grant funded
by the Korea government (MSIT) (No.2017-0-00388, Development of Defense
Technologies against Ransomware). This work was supported by the National
Research Foundation of Korea(NRF) grant funded by the Korea
government(MSIP) (No. NRF-2016R1A2B4015254).
TaeGuen Kim is with the Department of Computer and Software, Hanyang
University, Seoul, 04763 Korea (e-mail: cloudio17@hanyang.ac.kr).
mobile malware because of the popularity of Android devices.
In addition to the number of Android devices, there is another
reason that leads malware authors to develop Android malware.
The reason is that the Android operating system allows users to
install applications downloaded from third-party markets and
attackers can seduce or mislead Android users to download
malicious or suspicious applications from attackers’ servers.
To mitigate the attacks by Android malware, various research
approaches have been proposed so far. The malware detection
approaches can be classified into two categories; static analysis
based detection [2-19] and dynamic analysis based detection
[20-24]. The static analysis based methods use syntactic
features that can be extracted without executing an application,
whereas the dynamic analysis based methods use semantic
features that can be monitored when an application is executed
in a controlled environment. Static analysis has an advantage
that it is unnecessary to set the execution environments, and the
computational overheads for static analysis are relatively low.
Dynamic analysis has an advantage that it is possible to handle
malicious applications which use some obfuscation techniques
such as code encryption or packing.
In this paper, we assume that obfuscated malware is
processed by dynamic analysis based methods, and we focus on
the development of a static analysis based method to distinguish
between malware and benign applications. This paper proposes
a novel malware detection framework based on various static
features. Our framework is flexible to add a new type of features,
so, it is possible to utilize dynamic features in the future.
There are many previous works that are related to Android
malware detections, but most of the previous studies use only
limited types of features to detect malware. Each type of feature
can represent only a few properties of applications. On the other
hand, we propose a framework to detect malware using many
feature information to reflect various characteristics of
applications in various aspects. Our proposed framework first
extracts and processes multiple feature types, and refines them
Boojoong Kang is with the Centre for Secure Information Technologies
(CSIT), Queen’s University of Belfast, Belfast, UK (e-mail:
B.Kang@qub.ac.uk).
Mina Rho is with the Department of Computer Science and Engineering,
Hanyang University, Seoul, 04763 Korea (e-mail: minarho@hanyang.ac.kr).
Sakir Sezer is with the Centre for Secure Information Technologies (CSIT),
Queen’s University of Belfast, Belfast, UK (e-mail: s.sezer@qub.ac.uk).
Eul Gyu Im is with the Department of Computer Science and Engineering,
Hanyang University, Seoul, 04763 Korea (e-mail: imeg@hanyang.ac.kr).
A Multimodal Deep Learning Method for Android
Malware Detection using Various Features
TaeGuen Kim, BooJoong Kang, Mina Rho, Sakir Sezer and Eul Gyu Im
W

T-IFS-07942-2017
2
using our feature vector generation methods. Our feature vector
generation method consists of an existence-based method and a
similarity-based method, and these are very effective to
distinguish between malware and benign applications even
though malware has many similar properties of benign
applications. In addition, our framework uses a classification
model that implies the degree of classification according to their
importance. Among many useful classification algorithms, we
concluded that the deep learning algorithm is the suitable
classification algorithm for our framework that uses various
types of feature.
We propose a multimodal deep neural network model to fit
the features with different properties. The multimodal deep
learning method is generally utilized to make the neural
network to reflect the properties with different kinds of feature.
For example, the multimodal deep learning method was used to
recognize human speech using both voice information and
mouth shape information [48]. The different types of the feature
are inputted and processed in different initial neural networks
separately, and each initial network is connected to a final
neural network to produce the classification results. According
to our survey, our research is the first application of the
multimodal deep learning to the Android malware detection.
We conducted many experiments using our framework with
a large dataset from VirusShare [38] and the well-known small
dataset from the Malgenome project [37]. We measured and
compared the performance of our model with that of the deep
neural network model. In addition, we evaluated our framework
in various aspects including efficiency in model updates, the
usefulness of diverse features and effects of our feature
representation method. According to the comparison results
with other deep learning based methods, we argue that our
framework has good performance on the malware detection.
Our contributions can be summarized as follows:
We proposed a novel Android malware detection
framework using diverse features that can reflect the
characteristics of Android applications.
We suggested feature vector generation methods that can
represent malware characteristics effectively even when
malware shares many common properties with benign
applications.
We introduced how the multimodal neural network can be
applied in malware detection system. Model learning
strategies and an online update method for malware
detection are proposed. To the best of our knowledge, this
research is the first application of the multimodal deep
learning to the Android malware detection.
We provided various experimental results of our
framework to evaluate the performance in various aspects.
Total seven experiments were conducted in this paper.
The rest of the paper is organized as follows: Section II
explains the overall architecture of our Android malware
detection framework and describes how the framework works
in detail, Section III presents the feature types that are used in
our framework, and the multimodal neural network algorithm
is explained in Section IV. Section V shows the experimental
results to show the performance of our framework, and Section
VI discusses related work, followed by Section VII that
summarizes our research and provides future work of this
ongoing research.
Fig. 1. The overall architecture of the proposed framework

T-IFS-07942-2017
3
II. PROPOSED FRAMEWORK
Fig.1 shows the overall architecture of our framework, and
our framework uses seven kinds of the feature; String feature,
method opcode feature, method API feature, shared library
function opcode feature, permission feature, component feature,
and environmental feature. Using those features, the seven
corresponding feature vectors are generated first, and then,
among them, the permission/component/predefined setting
feature vectors are merged into one feature vector. Finally, the
five feature vectors are fed to the classification model for
malware detection. The framework conducts four major
processes for the detection; raw data extraction process, feature
extraction process, feature vector generation process, and
detection process. These processes are explained in the next
subsections.
A. Raw Data Extraction Process
The raw data extraction process is performed to make
Android APK (Android Package Kit) files interpretable. To
extract the raw data, an APK file is unzipped, and a manifest
file, a dex file, and shared library files are extracted first. The
manifest file and the dex file are decoded or disassembled by
APKtool [32], and the shared library files (i.e. .so files) in the
package can be disassembled by IDA Pro [33].
B. Feature Extraction Process
The feature extraction process is conducted to obtain the
essential feature data from the raw data. The detailed definition
of feature types is explained in Section III.
First, method opcode features and method API features are
extracted from smali files which are the disassembled results
of a dex file. The smali file is separated into the method
blocks, and, by scanning Dalvik bytecodes, the Dalvik opcode
frequency of each method is calculated. In addition, during the
bytecode scanning, it is checked whether the invocation of the
dangerous APIs exists in the method, and the dangerous API
invocation frequency of each method is calculated. In case of
string features, strings are simply collected from the whole
smali files without considering the method separation.
Shared library function opcode features are extracted from
the instruction sequences of the disassembled code of .so files.
The instruction sequence of each function is scanned to extract
the information of the assembly opcode frequency.
The permission features, the component features, and
environmental features are extracted from the manifest XML
file. While visiting the XML tree nodes, each node’s tag is
checked to confirm whether the node contains the information
about permissions, application components, and so on.
C. Feature Vector Generation Process
The extracted features in the previous process are used to
compose feature vectors. Seven kinds of the feature vector are
generated from extracted features. The seven feature vectors are
divided into two types according to their feature representations:
existence-based feature vectors and similarity-based feature
vectors. The existence-based feature vector is the feature vector
whose elements only represent the existence of features in the
malicious feature database, and examples of these are string,
permission, component and environmental feature vectors. On
the other hand, the similarity-based feature vector is the feature
vectors whose elements are similar to the malware
representatives in the malicious feature database, and method
opcode, method API and shared library function feature vectors
are the similarity-based feature vectors.
The malicious feature database herein is a repository that
contains features and malware representatives of known
malicious applications. The structure of the database is
described in Fig. 5 in APPENDIX B, and each feature is
explained in Section III. In addition, the malware
representatives mean the centroids of the clusters which are
calculated using the K-means clustering algorithm [44].
Algorithms I and II explain in APPENDIX A the processing
flows of the feature generation. First, as explained in Algorithm
I, the existence-based feature generation process is simple. The
feature values in the malicious feature database correspond to
the elements of the feature vector, and every feature value is
searched in the features extracted from input applications. If
there is no certain feature value in the extracted features, its
absence is represented as zero. Otherwise, the existence of the
feature value is represented as one in the vector.
Second, the similarity-based feature vectors are generated as
explained in Algorithm II. The method opcode feature, the
method API feature, and the shared library function opcode
feature used in this feature vector generation process are in the
form of a list of frequencies. The frequency values can vary
considerably, so the features of an input application are first
normalized to fit the feature values in the range of [0, 1]. The
min-max scaling method is used in the normalization [45]. Then,
each malware representative (the centroid of the cluster) in the
malicious feature database is compared with the features of the
input application using the Euclidean distance measure. Among
the distances of each malware representative, the minimum
distance is selected to convert to the similarity, and the
calculated similarity is recorded in the corresponding element
of the feature vector. By recording the highest similarity values
of the multiple malware representatives, the feature vector can
contain similarities to multiple clusters’ centroids which are
computed with known malware applications. Therefore, the
similarity-based feature vector can represent information
whether the input application’s features belong to clusters.
To improve the performance of our framework, we refined
the feature vector with a predefined threshold value. The
similarity values that exceed the predefined similarity threshold
become one. Otherwise, it is set to zero. This refinement
removes the features that are not close enough to a certain
malware representative but have small similarity values, and it
also simplifies the computation in the deep learning process.
D. Detection Process
After all the seven feature vectors are generated in the
previous process, the detection process is conducted to
determine whether the given application is malicious or not.
Before examining the feature vectors with the detection model,
the permission feature vector, the component feature vector,

T-IFS-07942-2017
4
and the environmental feature vector are merged into a single
feature vector. Therefore, our model gets the five feature
vectors and performs mathematical operations at each layer. If
all operations are conducted completely, the model produces
the estimated label for the given input application.
III. THE DEFINITION OF FEATURES
Diverse features could be helpful to reflect the characteristics
of an application. Even though some features such as
environmental information are not directly related to malicious
activities, these features may contribute to defining the
application characteristics.
Our proposed framework uses the following features:
String feature
Method opcode feature
Method API feature
Shared library function opcode feature
Permission feature
Component feature
Environmental feature
In our framework, the deep learning algorithm is used to
classify the unknown samples into the malware class or the
benign class. The deep learning algorithm generates a neural
network model that can derive the best classification accuracy
by updating the weight of each neuron input. The degree of
influence of the feature on classification is determined
according to the weight of the neurons affected by the feature.
If there is an insignificant feature in the classification, the
weight of the relevant neurons is reduced. Therefore, each
feature can be used differently by their contributions.
The next subsections explain each feature type that is used in
our framework. It is noted that the features are converted to the
feature vectors to apply them to the neural network.
A. String Feature
The string feature is extracted from a set of string values in
smali files. The feature extraction module collects all operand
values with the types of const-string and const-
string/jumbo. There are also the Dalvik opcodes that move
a reference to a string into a specific register. The number of
strings in an application spans a wide range. If the number of
applications increases, then the number of strings from those
applications will increase explosively. Therefore, strings are
hashed, and the hashed values of strings are applied to the
modular operation. The hash function used in the framework is
the SHA512 hash function.
B. Method opcode and API Feature
Dalvik opcode frequency and API invocation frequency of
methods may imply application behaviors and coding habits of
the developer. For this reason, Dalvik opcode frequency and
API invocation frequency of methods are used to define the
method features. The method opcode frequency can be
calculated by scanning the bytecode in each method. In the case
of the API invocation frequency, the bytecodes for API
invocation are checked to count the API invocations in each
method. To capture malicious behaviors, invocations of only
selected APIs are counted. The APIs that might be used in
malicious activities are investigated manually using the
Android Developer reference pages [50]. Additionally, the
APIs that were introduced in [35] are also added to the selected
API list. According to [35], those selected APIs are useful to
distinguish malware and benign applications.
C. Shared Library Function Opcode Feature
Android provides the Java Native Interface (JNI) and allows
applications to incorporate native libraries. It is well known that
native code defeats Android security mechanisms because
native code is not covered by the security model. For example,
shared library files can be used to hide malicious behaviors or
to avoid countermeasure against attacks. That is the reason why
many malicious applications use the native code to attack the
Android system.
To prevent malware with native code from hiding its
behaviors, our framework defines and uses the shared library
function features in the detection. Similar to the method feature
extraction, ARM opcode frequency and system call invocation
frequency are extracted from native code. While scanning the
disassembled code of each function, the opcodes and system
call invocations in each function are counted.
D. Permission Feature
Android is a privilege-separated operating system, and an
application runs with a unique system identifier. Android
provides a permission-based access control mechanism to
restrict the operations that a process can perform. In addition,
per-URI permissions are used to grant access to specific data.
To perform a certain behavior, an application should request
necessary permissions to Android, and this means that
permissions defined in an application can indicate the behaviors
of an application.
The manifest file in the application includes various
information related to permissions. First, the permissions to be
requested when the application is installed are defined in the
manifest file. Second, security permission that can be used to
limit accesses to specific components is also defined to protect
the application. The permission-related information can be
collected by parsing the <uses-permission> tag and the
<permission> tag in the manifest file. The request
permissions names are collected from the <uses-
permission> tag, and the security permissions’ names,
permission groups and protection levels are collected from the
<permission> tag. The extracted request permissions and
security permissions (the tuples of name, permission group, and
protection level) are used as permission features.
E. Component Feature
Application components are the essential building blocks of
an Android application. There are four components in an
Android application; Activity, service, broadcast receiver, and

Citations
More filters
Journal ArticleDOI

Deep Learning Approach for Intelligent Intrusion Detection System

TL;DR: A highly scalable and hybrid DNNs framework called scale-hybrid-IDS-AlertNet is proposed which can be used in real-time to effectively monitor the network traffic and host-level events to proactively alert possible cyberattacks.
Journal ArticleDOI

Robust Intelligent Malware Detection Using Deep Learning

TL;DR: A novelty in combining visualization and deep learning architectures for static, dynamic, and image processing-based hybrid approach applied in a big data environment is the first of its kind toward achieving robust intelligent zero-day malware detection.
Journal ArticleDOI

Lucid: A Practical, Lightweight Deep Learning Solution for DDoS Attack Detection

TL;DR: In this paper, the authors presented a lightweight deep learning DDoS detection system called Lucid, which exploits the properties of Convolutional Neural Networks (CNNs) to classify traffic flows as either malicious or benign.
Journal ArticleDOI

A Survey of Android Malware Detection with Deep Neural Models

TL;DR: This survey aims to address the challenges in DL-based Android malware detection and classification by systematically reviewing the latest progress, including FCN, CNN, RNN, DBN, AE, and hybrid models, and organize the literature according to the DL architecture.
Journal ArticleDOI

LUCID: A Practical, Lightweight Deep Learning Solution for DDoS Attack Detection

TL;DR: This paper presents a practical, lightweight deep learning DDoS detection system called Lucid, which exploits the properties of Convolutional Neural Networks (CNNs) to classify traffic flows as either malicious or benign, with a 40x reduction in processing time.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal Article

Dropout: a simple way to prevent neural networks from overfitting

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Proceedings Article

A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise

TL;DR: In this paper, a density-based notion of clusters is proposed to discover clusters of arbitrary shape, which can be used for class identification in large spatial databases and is shown to be more efficient than the well-known algorithm CLAR-ANS.
Proceedings Article

Rectified Linear Units Improve Restricted Boltzmann Machines

TL;DR: Restricted Boltzmann machines were developed using binary stochastic hidden units that learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset.
Book

Machine Learning : A Probabilistic Perspective

TL;DR: This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach, and is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What contributions have the authors mentioned in the paper "A multimodal deep learning method for android malware detection using various features" ?

This paper proposes a novel framework for Android malware detection. This paper is the first study of the multimodal deep learning to be used in the Android malware detection. Furthermore, the authors evaluated their framework in various aspects including the efficiency in model updates, the usefulness of diverse features, and their feature representation method. In addition, the authors compared the performance of their framework with those of other existing methods including deep learning based methods. 

For the evaluation of their model, 20,000 malware samples from VirusShare [38] and 1,260 from the Malgenome project [37] were used. 

The seven feature vectors are divided into two types according to their feature representations: existence-based feature vectors and similarity-based feature vectors. 

The framework conducts four major processes for the detection; raw data extraction process, feature extraction process, feature vector generation process, and detection process. 

Since the malware detection model should reflect the characteristics of those new applications for accurate and prompt detection, it is necessary to update the model continuously. 

The extracted request permissions and security permissions (the tuples of name, permission group, and protection level) are used as permission features. 

The size of the raw data such as naïve binary files of each application varies greatly, so the resizing algorithms are necessary to provide the fixed sized feature vectors which fit in their neural network model. 

CHEX [13], DroidChecker [14], AAPL [15], and Amandroid [16] are methods to verify Android applications to defend against the component hijacking attacks. 

The degree of influence of the feature on classification is determined according to the weight of the neurons affected by the feature. 

To show the effectiveness of their feature vector generation method including feature extraction, the authors conducted experiments to compare their framework with other methods: the native binary-based detection method, the bag-of-words based detection method, and an open-sourced opcode sequence-based detection method [30].