scispace - formally typeset
Search or ask a question
Book ChapterDOI

PE-Miner: Mining Structural Information to Detect Malicious Executables in Realtime

TL;DR: The results show that the extracted features are robust to different packing techniques and PE-Miner is also resilient to majority of crafty evasion strategies.
Abstract: In this paper, we present an accurate and realtime PE-Miner framework that automatically extracts distinguishing features from portable executables (PE) to detect zero-day (i.e. previously unknown) malware. The distinguishing features are extracted using the structural information standardized by the Microsoft Windows operating system for executables, DLLs and object files. We follow a threefold research methodology: (1) identify a set of structural features for PE files which is computable in realtime, (2) use an efficient preprocessor for removing redundancy in the features' set, and (3) select an efficient data mining algorithm for final classification between benign and malicious executables. We have evaluated PE-Miner on two malware collections, VX Heavens and Malfease datasets which contain about 11 and 5 thousand malicious PE files respectively. The results of our experiments show that PE-Miner achieves more than 99% detection rate with less than 0.5% false alarm rate for distinguishing between benign and malicious executables. PE-Miner has low processing overheads and takes only 0.244 seconds on the average to scan a given PE file. Finally, we evaluate the robustness and reliability of PE-Miner under several regression tests. Our results show that the extracted features are robust to different packing techniques and PE-Miner is also resilient to majority of crafty evasion strategies.

Summary (4 min read)

1 Introduction

  • A number of non-signature based malware detection techniques have been proposed recently.
  • We, therefore, believe that the domain of realtime deployable non-signature based malware detection techniques is still open to novel research.
  • The authors follow a threefold research methodology in their static analysis: (1) identify a set of structural features for PE files which is computable in realtime, (2) use an efficient preprocessor for removing redundancy in the features’ set, and (3) select an efficient data mining algorithm for final classification.
  • The second malware dataset is the Malfease dataset, which contains more than five thousand malicious PE files [21].
  • The authors experiments also demonstrate that the detection mechanism of PE-Miner does not show any significant bias towards packed/non-packed PE files.

2 PE-Miner Framework

  • The authors discuss their proposed PE-Miner framework.
  • The authors set the following strict requirements on their PE-Miner framework to ensure that their research is enacted with a product development cycle that has a short time-to-market: – It must be a pure non-signature based framework with an ability to detect zero-day malicious PE files.
  • Throughout this text, the terms detection accuracy and Area Under ROC Curve (AUC) are used interchangeably.
  • In their research, the authors systematically raised following relevant questions, analyzed their potential solutions, and finally selected the best one through extensive empirical studies.

2.1 Feature Extraction

  • Let us revisit the PE file format [12] before the authors start discussing the structural features used in their features’ set.
  • These sections contain the actual data such as code, initialized data, exports, imports and resources [12], [15].
  • Their pilot experimental studies have revealed that using them as individual binary features can reveal more information, and hence can be more helpful in detecting malicious PE files.
  • The size of the initialized data in benign executables is usually significantly higher compared to those of the malicious executables.
  • The Windows specific fields of the optional header include information about the operating system version, the image version, the checksum, the size of the stack and the heap.

2.2 Feature Selection/Preprocessing

  • It is possible that some of the features might not convey useful information in a particular scenario.
  • The authors have used three well-known features’ selection/preprocessing filters.
  • This dimensionality reduction can possibly improve the quality of an analysis on a given data if the dataset consists of highly correlated or redundant features.
  • The principle of this technique is that the most relevant information is stored with the highest coefficients at each order of a transform.
  • The wavelet transform has also been used for dimensionality reduction.

2.3 Classification

  • Once the dimensionality of the input features’ set is reduced by applying one of the above-mentioned preprocessing filters, it is given as an input to the wellknown data mining algorithms for classification.
  • (2) decision tree (J48), (3) Näıve Bayes (NB), (4) inductive rule learner , and (5) support vector machines using sequential minimal optimization (SMO).the authors.
  • An interested reader can find their details in the accompanying technical report [23].

3 Datasets

  • The authors have collected 1, 447 benign PE files from the local network of their virology lab.
  • Moreover, the authors have combined some categories that have similar functionality.
  • The Malfease collection contains 46.6% packed and 27.2% non-packed malicious PE files.
  • In their collection of benign files, 43.1% are packed and 27.0% are non-packed PE files respectively.
  • The authors speculate that a significant portion of the packed executables are not classified as packed because the signatures of their respective packers are not present in the database of PEiD or Protection ID.

5 Experimental Results

  • The authors have compared their PE-Miner framework with recently proposed promising techniques by Perdisci et al. [18], Schultz et al. [22], and Kolter et al. [11].
  • The authors have used the standard 10 fold cross-validation process in their experiments, i.e., the dataset is randomly divided into 10 smaller subsets, where 9 subsets are used for training and 1 subset is used for testing.
  • The process is repeated 10 times for every combination.
  • This methodology helps in systematically evaluating the effectiveness of their approach to detect previously unknown (i.e. zero-day) malicious PE files.
  • The ROC curves are generated by varying the threshold on output class probability [5], [28].

5.1 Malicious PE File Detection

  • In their first experimental study, the authors attempt to distinguish between benign and malicious PE files.
  • The authors experiments show that the feature selection process in KM still takes more than 31 seconds per file even with their optimized implementation.
  • The processing overheads of training RIPPER are the highest among all classifiers.
  • The average AUC values of the compared techniques for worms and trojans are approximately 0.95.
  • The results in Table 5 are averaged over 100 runs.

5.2 Miscellaneous Discussions

  • The authors tabulate the AUC and the scan time of the best techniques in Table 7.
  • Moreover, the authors also show the scan time of two well-known COTS AV products for doing the realtime deployable analysis of different non-signature based techniques.
  • One might argue that PE-Miner framework provides only a small improvement in detection accuracy over the KM approach.
  • A significant proportion of malicious PE files have anomalous structure which can crash a näıve PE file parser.

6 Robustness and Reliability Analysis of PE-Miner

  • The authors have now established the fact that PE-Miner is a realtime deployable scheme for zero-day malware detection.
  • A careful reader might ask whether the statement still holds if the “ground truth” is now changed as: (1) the authors cannot trust the classification of signature based packer detectors PEiD and Protection ID, and (2) a “crafty” attacker can forge the features of malicious files with those of benign files to evade detection.

6.1 Robustness Analysis of Extracted Features

  • It is a well-known fact that signature based packer detector PEiD, which the authors are using to distinguish between packed and non-packed executables, has approximately 30% false negative rate [17].
  • The confusion about “ground truth”, however, stems in the fact that a reasonable proportion of packed PE files could be misclassified as non-packed because of false negative rate of PEiD.
  • The results of their experiments for this scenario are tabulated in Table 10.
  • In the second experiment, the authors train PE-Miner on non-packed benign and malicious PE files and test it on packed benign and malicious PE files.
  • The authors conclude that the detection accuracy of PE-Miner, even in these unrealistic stress testing scenarios, gracefully degrades.

6.2 Reliability of PE-Miner

  • The authors particularly focus their attention on the false negative rate (or miss detection rate)10 of PE-Miner when they replace features in malicious files with those of benign files.
  • The examples of such strategies could be especially crafted packing techniques, insertion of dummy resources, obfuscation of address pointers, and other information present in headers etc.
  • To this end, the authors have “crafted” malware files in the datasets to contain benign-like features.
  • The authors now analyze the false negative rate of PE-Miner (RFR-J48) across these “crafty” datasets.
  • The results tabulated in Table 11 highlight the robustness of PE-Miner to such crafty attacks.

10 The false negative rate is defined by the fraction of malicious files wrongly classified

  • For both datasets, the average false negative rate is approximately 5% even when 100 out of 189 features are forged.
  • If an attacker tries to randomly forge, using brute-force, the structural features of a PE malware file with those of a benign PE file then he/she will inevitably end up corrupting the executable image.
  • The file will not load successfully into memory.
  • The authors have manually executed the “crafted” malicious executables.
  • This figure proves their hypothesis that the probability of having valid PE files decreases exponentially with an increase in the number of simultaneously forged features.

7 Conclusion

  • PE-Miner leverages the structural information of PE files and the data mining algorithms to provide high detection accuracy with low processing overheads.
  • The authors believe that their PE-Miner framework can be ported to Unix and other non-Windows operating systems.
  • To this end, the authors have identified similar structural features for the ELF file format in Unix and Unix-like operating systems.
  • This dimension of their work will be the subject of forthcoming publications.
  • Finally, the authors are also doing research to develop techniques to fully remove the bias of PE-Miner in detecting packed/non-packed executables [24].

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

PE-Miner: Mining Structural Information to
Detect Malicious Executables in Realtime
M. Zubair Shafiq
1
, S. Momina Tabish
1,2
, Fauzan Mirza
2,1
, Muddassar Farooq
1
1
Next Generation Intelligent Networks Research Center (nexGIN RC)
National University of Computer & Emerging Sciences (FAST-NUCES)
Islamabad, 44000, Pakistan
{zubair.shafiq,momina.tabish,muddassar.farooq}@nexginrc.org
2
Scho ol of E lectrical Engineering & Computer Science (SEEC S)
National University of Sciences & Technology (NUST)
Islamabad, 44000, Pakistan
fauzan.mirza@seecs.edu.pk
Abstract. In this pap er, we present an accurate and realtime PE-Miner
framework that automatically extracts distinguishing features from portable
executables (PE) to detect zero-day (i.e. previously unknown) malware.
The distinguishing features are extracted using the s tructural informa-
tion standardized by the Microsoft Windows operating system for exe-
cutables, DLLs and object files. We follow a threefold research method-
ology: (1) identify a set of structural features for PE files which is com-
putable in realtime, (2) use an efficient preprocessor for removing re-
dundancy in the features’ set, and (3) select an efficient data mining
algorithm for final classification between benign and malicious executa-
bles.
We have evaluated PE-Miner on two malware collections, VX Heavens
and Malfease datasets which contain about 11 and 5 thousand malicious
PE files respectively. The results of our experiments show that PE-Miner
achieves more than 99% detection rate with less than 0.5% false alarm
rate for distinguishing between benign and malicious executables. PE-
Miner has low processing overheads and takes only 0.244 seconds on the
average to scan a given PE file. Finally, we evaluate the robustness and
reliability of PE-Miner under several regression tests. Our results show
that the extracted features are robust to different packing techniques and
PE-Miner is also resilient to majority of crafty evasion strategies.
Key words: Data Mining, Malicious Executable Detection, Malware
Detection, Portable Executables, Structural Information
1 Introduction
A number of non-signature based malware detection techniques have been pro-
posed recently. These techniques mostly use heuristic analysis, behavior analysis,
or a combination of both to detect malware. Such techniques are being actively
investigated because of their ability to detect zero-day malware without any a

2 M. Zubair Shafiq, S. Momina Tabish, Fauzan Mirza, Muddassar Farooq
priori knowledge about them. Some of them have been integrated into the exist-
ing Commercial Off the Shelf Anti Virus (COTS AV) products, but have achieved
only limited success [26], [13]. The most important shortcoming of these tech-
niques is that they are not realtime deployable
1
. We, therefore, believe that the
domain of realtime deployable non-signature based malware detection techniques
is still open to novel re search.
Non-signature based malware detection techniques are primarily criticized
because of two inherent problems: (1) high fp rate, and (2) large processing over-
heads. Consequently, COTS AV products mostly utilize signature based detec-
tion schemes that provide low fp rate and have acceptable processing overheads.
But it is a well-known fact that signature based malware detection schemes are
unable to detec t zero-day malware. We cite two reports to highlight the alarming
rate at which new malware is proliferating. The first report is by Symantec that
shows an increase of 468% in the number of malware from 2006 to 2007 [25].
The second report shows that the number of malware produced in 2007 alone
was more than the total number of malware produced in the last 20 years [6].
These surveys suggest that signature based techniques cannot keep abreast with
the security challenges of the new millennium because not only the size of the
signatures’ database will exponentially increase but also the time of matching
signatures. These bottlenecks are even more relevant on resource constrained
smart phones and mobile devices [3]. We, therefore, envision that in near future
signature based malware detec tion schemes will not be able to meet the criterion
of realtime deployable as well.
We argue that a malware detection scheme which is realtime deployable
should use an intelligent yet simple static analysis technique. In this paper we
prop os e a framework, called PE-Miner, which uses novel structural features to
efficiently detect malicious PE files. PE is a file format which is standardized by
the Microsoft Windows operating systems for executables, dynamically linked
libraries (DLL), and object files. We follow a threefold research methodology in
our static analysis: (1) identify a set of structural features for PE files which
is computable in realtime, (2) use an efficient preprocessor for removing redun-
dancy in the features’ set, and (3) select an effic ient data mining algorithm for
final classification. Consequently, our proposed framework consists of three mod-
ules: the feature extraction module, the feature selection/preprocessing module,
and the detection module.
We have evaluated our proposed detection framework on two independently
collected malware datasets with different statistics. The first malware dataset
is the VX Heavens Virus collection consisting of more than ten thousand mali-
cious PE files [27]. The second malware dataset is the Malfease dataset, which
contains more than five thousand malicious PE files [21]. We also collected more
than one thousand benign PE files from our virology lab, which we use in con-
junction with both malware datasets in our study. The results of our experiments
1
We define a technique as realtime deployable if it has three properties: (1) a tp rate
(or true positive rate) of approximately 1, (2) an fp rate (or false positive rate) of
approximately 0, and (3) the file scanning time is comparable to existing COTS AV.

PE-Miner: Mining Structural Information to Detect Malicious Executables 3
show that our PE-miner framework achieves more than 99% detection rate with
less than 0.5% false alarm rate for distinguishing between benign and malicious
executables. Further, our framework takes on the ave rage only 0 .244 s ec onds
to scan a given PE file. Therefore, we can conclude that PE-Miner is realtime
deployable, and consequently it can be easily integrated into existing COTS AV
products. PE-Miner framework can also categorize the malicious executables as
a function of their payload. This analysis is of great value for system adminis-
trators and malware forensic expe rts. An interested reader can find details in
the accompanying technical report [23].
We have also compared PE-Miner with other promising malware detection
schemes proposed by Perdisci et al. [18], Schultz et al. [22], and Kolter et al.
[11]. These techniques use some variation of n-gram analysis for malware de-
tection. PE-Miner provides better detection accuracy
2
with significantly smaller
processing overheads compared with these approaches. We believe that the su-
perior performance of PE-Miner is attributable to a rich set of novel PE format
specific structural features, which provides relevant information for better de-
tection accuracy [10]. In comparison, n-gram based techniques are more suitable
for classification of loosely structured data; therefore, they fail to exploit format
specific structural information of a PE file. As a result, they provide lower de-
tection rates and have higher processing overheads as compared to PE-Miner.
Our experiments also demonstrate that the detection mechanism of PE-Miner
does not show any significant bias towards packed/non-packed PE files. Finally,
we investigate the robustness of PE-Miner against “crafty” attacks which are
specifically designed to evade detection mechanism of PE-Miner. Our results
show that PE-Miner is resilient to majority of such evasion attacks.
2 PE-Miner Framework
In this section, we discuss our proposed PE-Miner framework. We set the follow-
ing strict requirements on our PE-Miner framework to ensure that our research
is enacted with a product development cyc le that has a s hort time-to-market:
It must be a pure non-signature based framework with an ability to detect
zero-day malicious PE files.
It must be realtime deployable. To this end, we say that it s hould have more
than 99% tp rate and less than 1% fp rate. We argue that it is still a challenge
for non-signature based techniques to achieve these true and false positive
rates. Moreover, its time to scan a PE file must be comparable to those of
existing COTS AV products.
2
Throughout this text, the terms detection accuracy and Area Under ROC Curve
(AUC) are used interchangeably. ROC curves are extensively used in machine learn-
ing and data mining to depict the tradeoff between the true positive rate and false
positive rate of a classifier. The AUC (0 AUC 1) is used as a yardstick to de-
termine the detection accuracy from ROC curve. Higher values of AUC mean high
tp rate and low fp rate [28]. At AUC = 1, tp rate = 1 and fp rate = 0.

4 M. Zubair Shafiq, S. Momina Tabish, Fauzan Mirza, Muddassar Farooq
Fig. 1. The architecture of our PE-Miner
framework
Fig. 2. The PE file format
Its design must be modular that allows for the plug-n-play design philosophy.
This feature will be useful in customizing the detection framework to specific
requirements, such as porting it to the file formats used by other operating
systems.
We have evolved the final modular architecture of our PE-Miner framework
in a question oriented engineering fashion. In our research, we systematically
raised following relevant questions, analyzed their potential solutions, and finally
selected the best one through extensive empirical studies.
1. Which PE format specific features c an be statically extracted from PE files
to distinguish between benign and malicious files? Moreover, are the format
specific features better than the existing n-grams or string-based features in
terms of detection accuracy and efficiency?
2. Do we need to deploy preprocessors on the features’ set? If yes then which
preprocessors are best suited for the raw features’ set?
3. Which are the best back-end classification algorithms in terms of detection
accuracy and processing ove rheads.
Our PE-Miner framework consists of three main modules inline with the
above-m entioned vision: (1) feature extraction, (2) feature preprocessing, and
(3) classification (see Figure 1). We now discuss each module se parately.
2.1 Feature Extraction
Let us revisit the PE file format [12] before we start discussing the structural
features used in our features’ set. A PE file consists of a PE file header, a section
table (section headers) followed by the sections’ data. The PE file header consists
of a MS DOS stub, a PE file s ignature, a COFF (Common Object File Format)
header, and an optional header. It contains important information about a file
such as the number of se ctions, the siz e of the stack and the heap, etc. The
section table contains important information about the sections that follow it,

PE-Miner: Mining Structural Information to Detect Malicious Executables 5
Table 1. List of the features extracted from PE files
Feature Description Type Quantity
DLLs referred binary 73
COFF file header integer 7
Optional header standard fields integer 9
Optional header Windows specific fields integer 22
Optional header data directories integer 30
.text section header fields integer 9
.data section header fields integer 9
.rsrc section header fields integer 9
Resource directory table & resources integer 21
Total 189
Table 2. Mean values of the extracted features. The bold values in every row highlight interesting outliers.
Dataset VX Heavens Malfease
Name of Benign Backdoor Constructor DoS + Flooder Exploit + Worm Trojan Virus -
Feature + Sniffer + Virtool Nuker Hacktool
WSOCK32.DLL 0.037 0.503 0.038 0.188 0.353 0.261 0.562 0.242 0.053 0.065
WININET.DLL 0.073 0.132 0.009 0.013 0.04 0.141 0.004 0.103 0.019 0.086
# Symbols 430.2 2.0E6 14.7 59.4 25.8 3.5E6 38.8 4.1E 6 1.0E6 2.7E7
Maj Linker Ver 4.7 14.4 11.2 14.1 12.1 12.3 18.7 12.2 19.3 6.5
Init Data Size (E5) 4.4 1.1 0.5 0.4 0.8 0.7 0.4 0.4 0.1 0.6
Maj Img Ver 163.1 1.6 6.3 0.4 0.6 11.2 0.3 6.0 53.6 0.2
DLL Char 5.8x10
3
0.0 0.0 0.0 0.0 24.9 0.0 3.1 230.8 18.7
Exp Tbl Size (E2) 13.7 2.4 1.7 14.1 5.0 0.3 1.2 2.1 0.9 0.05
Imp Tbl Size (E2) 5.8 19.2 6.1 7.9 20.8 7.1 23.4 10.3 6.2 4.7
Rsrc Tbl Size (E4) 32.6 5.5 1.5 1.4 6.2 1.0 2.6 2.2 0.5 5.9
Except Tbl Size 12.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.5
.data Raw Size (E3) 25.2 8.4 5.6 6.3 6.0 7.9 6.1 5.5 6.7 22.1
# Cursors 14.5 6.4 6.7 7.4 6.1 5.9 5.8 6.0 3.0 6.8
# Bitmaps 12.6 1.2 0.0 1.0 0.6 0.7 1.2 1.4 2.4 0.5
# Icons 17.6 2.5 1.9 2.7 2.0 2.1 1.8 1.9 4.5 2.2
# Dialogs 10.9 3.2 1.5 3.2 1.5 2.0 1.9 1.7 2.2 2.3
# Group Cursors 11.6 6.0 6.6 7.2 5.8 5.8 5.4 5.7 2.7 6.7
# Group Icons 4.1 1.0 0.7 1.0 0.8 0.7 0.5 0.7 1.5 0.6
such as their name, offset and size . These sections contain the actual data such
as code, initialized data, exports, imports and resources [12], [15].
Figure 2 shows an overview of the PE file format [12], [15]. It is important to
note that the section table contains Relative Virtual Addresses (RVAs) and the
pointers to the start of every section. On the other hand, the data directories in
an optional header contain references to various tables (such as import, export,
resource, etc.) present in different sections. These references, if appropriately
analyzed, c an provide useful information.
We believe that this structural information about a PE file should be lever-
aged to extract features that have the potential to achieve high detection accu-
racy. Using this principle, we statically extract a set of large number of features
from a given PE file
3
. These features are summarized in Table 1. In the discus-
sion below, we first intuitively argue about the features that have the potential
to distinguish between benign and malicious files. We then show interesting ob-
servations derived from the executable datasets used in our empirical studies.
DLLs referred. The list of DLLs referred in an executable effectively pro-
vides an overview of its functionality. For example, if an executable calls WINSOCK.DLL
3
A well-known Microsoft Visual C++ utility, called dumpbin, dumps the relevant
information which is present inside a given PE file [4]. Another freely available utility,
called pedump, also do es the required task [20].

Citations
More filters
Posted Content
TL;DR: The authors hope that the dataset, code and baseline model provided by EMBER will help invigorate machine learning research for malware detection, in much the same way that benchmark datasets have advanced computer vision research.
Abstract: This paper describes EMBER: a labeled benchmark dataset for training machine learning models to statically detect malicious Windows portable executable files. The dataset includes features extracted from 1.1M binary files: 900K training samples (300K malicious, 300K benign, 300K unlabeled) and 200K test samples (100K malicious, 100K benign). To accompany the dataset, we also release open source code for extracting features from additional binaries so that additional sample features can be appended to the dataset. This dataset fills a void in the information security machine learning community: a benign/malicious dataset that is large, open and general enough to cover several interesting use cases. We enumerate several use cases that we considered when structuring the dataset. Additionally, we demonstrate one use case wherein we compare a baseline gradient boosted decision tree model trained using LightGBM with default settings to MalConv, a recently published end-to-end (featureless) deep learning model for malware detection. Results show that even without hyper-parameter optimization, the baseline EMBER model outperforms MalConv. The authors hope that the dataset, code and baseline model provided by EMBER will help invigorate machine learning research for malware detection, in much the same way that benchmark datasets have advanced computer vision research.

264 citations


Cites background or methods from "PE-Miner: Mining Structural Informa..."

  • ...PE-Miner aimed to produce a machine-learning based malware detector that exceeded 99% true positive rate (TPR) at less than a 1% false positive rate (FPR), with a runtime comparable to signature-based scanners of the day [30]....

    [...]

  • ...As such, it provides a useful summary of the contents of an executable [30]....

    [...]

Proceedings ArticleDOI
09 Mar 2016
TL;DR: This paradigm is presented and discussed in the present paper, where emphasis has been given to the phases related to the extraction, and selection of a set of novel features for the effective representation of malware samples.
Abstract: Modern malware is designed with mutation characteristics, namely polymorphism and metamorphism, which causes an enormous growth in the number of variants of malware samples. Categorization of malware samples on the basis of their behaviors is essential for the computer security community, because they receive huge number of malware everyday, and the signature extraction process is usually based on malicious parts characterizing malware families. Microsoft released a malware classification challenge in 2015 with a huge dataset of near 0.5 terabytes of data, containing more than 20K malware samples. The analysis of this dataset inspired the development of a novel paradigm that is effective in categorizing malware variants into their actual family groups. This paradigm is presented and discussed in the present paper, where emphasis has been given to the phases related to the extraction, and selection of a set of novel features for the effective representation of malware samples. Features can be grouped according to different characteristics of malware behavior, and their fusion is performed according to a per-class weighting paradigm. The proposed method achieved a very high accuracy ($\approx$ 0.998) on the Microsoft Malware Challenge dataset.

243 citations

Posted Content
TL;DR: EldeRan, a machine learning approach for dynamically analysing and classifying ransomware, is presented, suggesting that dynamic analysis can support ransomware detection, since ransomware samples exhibit a set of characteristic features at run-time that are common across families, and that helps the early detection of new variants.
Abstract: Recent statistics show that in 2015 more than 140 millions new malware samples have been found. Among these, a large portion is due to ransomware, the class of malware whose specific goal is to render the victim's system unusable, in particular by encrypting important files, and then ask the user to pay a ransom to revert the damage. Several ransomware include sophisticated packing techniques, and are hence difficult to statically analyse. We present EldeRan, a machine learning approach for dynamically analysing and classifying ransomware. EldeRan monitors a set of actions performed by applications in their first phases of installation checking for characteristics signs of ransomware. Our tests over a dataset of 582 ransomware belonging to 11 families, and with 942 goodware applications, show that EldeRan achieves an area under the ROC curve of 0.995. Furthermore, EldeRan works without requiring that an entire ransomware family is available beforehand. These results suggest that dynamic analysis can support ransomware detection, since ransomware samples exhibit a set of characteristic features at run-time that are common across families, and that helps the early detection of new variants. We also outline some limitations of dynamic analysis for ransomware and propose possible solutions.

199 citations

Proceedings ArticleDOI
11 Aug 2013
TL;DR: This work explores techniques that can automatically classify malware variants into their corresponding families and adopts an ensemble of classifiers for automated malware classification.
Abstract: The voluminous malware variants that appear in the Internet have posed severe threats to its security. In this work, we explore techniques that can automatically classify malware variants into their corresponding families. We present a generic framework that extracts structural information from malware programs as attributed function call graphs, in which rich malware features are encoded as attributes at the function level. Our framework further learns discriminant malware distance metrics that evaluate the similarity between the attributed function call graphs of two malware programs. To combine various types of malware attributes, our method adaptively learns the confidence level associated with the classification capability of each attribute type and then adopts an ensemble of classifiers for automated malware classification. We evaluate our approach with a number of Windows-based malware instances belonging to 11 families, and experimental results show that our automated malware classification method is able to achieve high classification accuracy.

144 citations

Proceedings ArticleDOI
20 May 2018
TL;DR: The design and implementation details of the first malware analysis pipeline specifically tailored for Linux malware are presented and the first large-scale measurement study conducted on 10,548 malware samples is presented documenting detailed statistics and insights that can help directing future work in the area.
Abstract: For the past two decades, the security community has been fighting malicious programs for Windows-based operating systems. However, the recent surge in adoption of embedded devices and the IoT revolution are rapidly changing the malware landscape. Embedded devices are profoundly different than traditional personal computers. In fact, while personal computers run predominantly on x86-flavored architectures, embedded systems rely on a variety of different architectures. In turn, this aspect causes a large number of these systems to run some variants of the Linux operating system, pushing malicious actors to give birth to ""Linux malware."" To the best of our knowledge, there is currently no comprehensive study attempting to characterize, analyze, and understand Linux malware. The majority of resources on the topic are available as sparse reports often published as blog posts, while the few systematic studies focused on the analysis of specific families of malware (e.g., the Mirai botnet) mainly by looking at their network-level behavior, thus leaving the main challenges of analyzing Linux malware unaddressed. This work constitutes the first step towards filling this gap. After a systematic exploration of the challenges involved in the process, we present the design and implementation details of the first malware analysis pipeline specifically tailored for Linux malware. We then present the results of the first large-scale measurement study conducted on 10,548 malware samples (collected over a time frame of one year) documenting detailed statistics and insights that can help directing future work in the area.

137 citations

References
More filters
Book
15 Oct 1992
TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Abstract: From the Publisher: Classifier systems play a major role in machine learning and knowledge-based systems, and Ross Quinlan's work on ID3 and C4.5 is widely acknowledged to have made some of the most significant contributions to their development. This book is a complete guide to the C4.5 system as implemented in C for the UNIX environment. It contains a comprehensive guide to the system's use , the source code (about 8,800 lines), and implementation notes. The source code and sample datasets are also available on a 3.5-inch floppy diskette for a Sun workstation. C4.5 starts with large sets of cases belonging to known classes. The cases, described by any mixture of nominal and numeric properties, are scrutinized for patterns that allow the classes to be reliably discriminated. These patterns are then expressed as models, in the form of decision trees or sets of if-then rules, that can be used to classify new cases, with emphasis on making the models understandable as well as accurate. The system has been applied successfully to tasks involving tens of thousands of cases described by hundreds of properties. The book starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting. Advantages and disadvantages of the C4.5 approach are discussed and illustrated with several case studies. This book and software should be of interest to developers of classification-based intelligent systems and to students in machine learning and expert systems courses.

21,674 citations

Book
25 Oct 1999
TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

20,196 citations

Book
01 Jan 2008
TL;DR: In this paper, generalized estimating equations (GEE) with computing using PROC GENMOD in SAS and multilevel analysis of clustered binary data using generalized linear mixed-effects models with PROC LOGISTIC are discussed.
Abstract: tic regression, and it concerns studying the effect of covariates on the risk of disease. The chapter includes generalized estimating equations (GEE’s) with computing using PROC GENMOD in SAS and multilevel analysis of clustered binary data using generalized linear mixed-effects models with PROC LOGISTIC. As a prelude to the following chapter on repeated-measures data, Chapter 5 presents time series analysis. The material on repeated-measures analysis uses linear additive models with GEE’s and PROC MIXED in SAS for linear mixed-effects models. Chapter 7 is about survival data analysis. All computing throughout the book is done using SAS procedures.

9,995 citations

Proceedings Article
10 Apr 2005
TL;DR: QEMU supports full system emulation in which a complete and unmodified operating system is run in a virtual machine and Linux user mode emulation where a Linux process compiled for one target CPU can be run on another CPU.
Abstract: We present the internals of QEMU, a fast machine emulator using an original portable dynamic translator. It emulates several CPUs (x86, PowerPC, ARM and Sparc) on several hosts (x86, PowerPC, ARM, Sparc, Alpha and MIPS). QEMU supports full system emulation in which a complete and unmodified operating system is run in a virtual machine and Linux user mode emulation where a Linux process compiled for one target CPU can be run on another CPU.

2,420 citations

Journal ArticleDOI
TL;DR: This article serves both as a tutorial introduction to ROC graphs and as a practical guide for using them in research.
Abstract: Receiver Operating Characteristics (ROC) graphs are a useful technique for organizing classifiers and visualizing their performance. ROC graphs are commonly used in medical decision making, and in recent years have been increasingly adopted in the machine learning and data mining research communities. Although ROC graphs are apparently simple, there are some common misconceptions and pitfalls when using them in practice. This article serves both as a tutorial introduction to ROC graphs and as a practical guide for using them in research.

2,046 citations


"PE-Miner: Mining Structural Informa..." refers background in this paper

  • ...The ROC curves are generated by varying the threshold on output class probability [5], [28]....

    [...]

Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "Pe-miner: mining structural information to detect malicious executables in realtime" ?

In this paper, the authors present an accurate and realtime PE-Miner framework that automatically extracts distinguishing features from portable executables ( PE ) to detect zero-day ( i. e. previously unknown ) malware. The authors follow a threefold research methodology: ( 1 ) identify a set of structural features for PE files which is computable in realtime, ( 2 ) use an efficient preprocessor for removing redundancy in the features ’ set, and ( 3 ) select an efficient data mining algorithm for final classification between benign and malicious executa-