Book Chapter•DOI•

Abstraction-Based Malware Analysis Using Rewriting and Model Checking

Philippe Beaucamps¹, Isabelle Gnaedig², Jean-Yves Marion¹•Institutions (2)

University of Lorraine¹, French Institute for Research in Computer Science and Automation²

10 Sep 2012-Vol. 7459, pp 806-823

TL;DR: This work uses a rewriting-based abstraction mechanism, producing abstracted forms of program traces, independent of the program implementation, which allows it to handle similar behaviors in a generic way and thus to be robust with respect to variants.

read less

Abstract: We propose a formal approach for the detection of high-level malware behaviors. Our technique uses a rewriting-based abstraction mechanism, producing abstracted forms of program traces, independent of the program implementation. It then allows us to handle similar behaviors in a generic way and thus to be robust with respect to variants. These behaviors, defined as combinations of patterns given in a signature, are detected by model-checking on the high-level representation of the program. We work on unbounded sets of traces, which makes our technique useful not only for dynamic analysis, considering one trace at a time, but also for static analysis, considering a set of traces inferred from a control flow graph. Abstracting traces with rewriting systems on first order terms with variables allows us in particular to model dataflow and to detect information leak.

...read moreread less

Summary (3 min read)

Jump to: [1 Introduction] – [Previous work.] – [2 Background] – [3 Behavior Patterns] – [5 Detection Problem] – [6 Detection Complexity] – [7 Information Leak Behaviors] – [8 Experiments] and [9 Conclusion]

1 Introduction

These dynamic abstraction-based approaches, though they can detect unknown viruses whose execution traces exhibit known malicious behaviors, only deal with a single execution trace.
Static behavior analysis by abstraction is more challenging than its dynamic counterpart because, precisely, this approach needs to abstract a program behavior potentially representing an infinite set of execution traces.
An interesting application of static behavior analysis is the audit of programs in high-level technologies, like mobile applications, browser extensions, web page scripts, .NET or Java programs.

Previous work.

In [4] , the authors already proposed to abstract program sets of traces with respect to behavior patterns, for detection and analysis.
These samples belonged to known malware families, like Allaple, Virut, Agent, Rbot, Afcore and Mimail.
But patterns were defined by string rewriting systems, which did not allow the actions composing a trace to have parameters, precluding dataflow analysis.
The formalism proposed in this paper addresses both issues: first, the authors handle interleaved patterns by keeping the identified patterns when abstracting them.
Second, the authors extend the rewriting framework to express data constraints on action parameters by using term rewriting systems.

2 Background

The elements of T Trace (F) are called traces, the elements of T Action (F) are called actions.
The authors distinguish the sort Action from the sort Trace but, for a sake of readability, they may denote by a the trace (a, ǫ), for some action a.
Similarly, the authors use the symbol with infix notation and right associativity, and ǫ is understood when the context is unambiguous.
Σ therefore represents the finite set of library calls, while terms built on F d identify the arguments and the return values of these calls.
Using FOLTL on finite traces allows us a correct balance between behavior expresivity and decidability.

3 Behavior Patterns

The problem under study can be formalized in the following way.
The authors goal is then to find an effective and efficient method solving this problem.
The authors describe a functionality by an FOLTL formula, such that traces satisfying this formula are traces carrying out the functionality.
One way of realizing it consists in calling the socket function with the parameter IPPROTO ICMP describing the network protocol and, then, calling the sendto function with the parameter ICMP ECHOREQ describing the data to be sent.
Between these two calls, the socket should not be freed.

5 Detection Problem

Then the detection problem can be formalizeded as follows.
The authors want to exclude traces unreliably realizing the abstract behavior in R ≤n (L), while not having to reach normal forms.
The following propositions show that the (m, n)-completeness property is realistic for abstract behaviors considered in practice.
The first step computes the abstract forms of the program traces while the second step applies usual verification techniques in order to decide whether one of the computed traces verifies the FOLTL formula defining the abstract behavior.
The authors therefore show that, in the previous proposition, (m, n)-completeness allows us to nonetheless preserve that decomposition, so that the abstraction step now becomes decidable.

6 Detection Complexity

The detection problem, like the more general problem of program analysis, requires computing a partial abstraction of the set of analyzed traces.
4, the abstraction relation is rational, which entails the decidability of detection.
Using the set of traces n-reliably realizing M , when T Action (F) is finite, the authors get the following detection complexity, which is linear in the size of the automaton recognizing the program set of traces, a major improvement on the exponential complexity bound of [17] .

7 Information Leak Behaviors

Such a leak can be decomposed into two steps: capturing sensitive information and sending this information to an exogenous location.
The captured data can be keystrokes, passwords or data read from a sensitive network location, while the exogenous location can be the network, a removable device, etc.
Moreover, since the captured data must not be invalidated before being leaked, the authors define a behavior pattern λ inval (x), which represents such an invalidation.
The authors consider the following definitions of the four behavior patterns involved, after looking at several malware samples, like keyloggers, sms message leaking applications or personal information stealing mobile applications: keystroke capture functionality:.

8 Experiments

The authors goal is to detect the information leak behavior M defined in the previous section.
In order to perform behavior pattern abstraction and behavior detection in the presence of data, the authors use the CADP toolbox [14] , which allows us to manipulate and model-check communicating processes written in the LO-TOS language.
First, approximation of conditional branches by nondeterministic branches may result in false positives, especially when the program code is obfuscated.
The first one comes from a study on the detection rate of keylogger programs by existing antivirus [13] , which shows a high failure rate.
It then requests Android systems through its file metadata, to execute OnReceive on each SMS received or sent.

9 Conclusion

The authors presented an original approach for detecting high-level behaviors in programs, describing combinations of functionalities and defined by first-order temporal logic formulas.
Behavior patterns, expressing concrete realizations of functionalities, are also defined by first-order temporal logic formulas.
Validation of the abstracted traces with respect to some high-level behavior is performed via usual model checking techniques.
Moreover, high-level behaviors and behavior patterns are easy to update since they are expressed in terms of basic blocks.
Applicability of their detection technique could be further enhanced by automating construction of reference behavior patterns, for example using mining techniques as in [9] .

Did you find this useful? Give us your feedback

Content maybe subject to copyright Report

HAL Id: hal-00762252

https://hal.inria.fr/hal-00762252

Submitted on 10 Dec 2012

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Abstraction-based Malware Analysis Using Rewriting

and Model Checking

Philippe Beaucamps, Isabelle Gnaedig, Jean-Yves Marion

To cite this version:

Philippe Beaucamps, Isabelle Gnaedig, Jean-Yves Marion. Abstraction-based Malware Analysis Using

Rewriting and Model Checking. ESORICS - 17th European Symposium on Research in Computer

Security - 2012, Sep 2012, Pisa, Italy. pp.806-823, �10.1007/978-3-642-33167-1�. �hal-00762252�

Abstraction-based Malware Analysis Using

Rewriting and Model Checking

Philippe Beaucamps

, Isabelle Gnaedig

, Jean-Yves Marion

Universit´e de Lorraine, LORIA, UMR 7503, Vandoeuvre-l`es-Nancy, F-54506, France

Inria, Villers-l`es-Nancy, F-54600, France

{Philippe.Beaucamps,Isabelle.Gnaedig,Jean-Yves.Marion}@loria.fr

Abstract. We propose a formal approach for the detection of high-level

malware behaviors. Our technique uses a rewriting-based abstraction

mechanism, producing abstracted forms of program traces, independent

of the program implementation. It then allows us to handle similar be-

haviors in a generic way and thus to be robust with respect to variants.

These behaviors, deﬁned as combinations of patterns given in a signa-

ture, are detected by model-checking on the high-level representation of

the program. We work o n unbounded sets of traces, which makes our

technique useful not only for dynamic analysis, considering one trace at

a time, but also for static analysis, considering a set of traces inferred

from a control ﬂow graph. Abstrac ting traces with rewriting systems on

ﬁrst order terms with variables allows us in particular to model dataﬂow

and to detect information leak.

Keywords: Malware, behavioral detection, behavior abstraction, trace,

term rewriting, model checking, ﬁrst order temporal logic, ﬁnite state

automaton, formal language.

1 Introduction

Behavior analysis was introduced by Cohen’s seminal work [11] to detect mal-

ware and in particular unknown malware. In general, a behavior is described by

a sequence of system calls and recognition uses the formalism of ﬁnite state au-

tomata [22, 26, 24, 6]. New approaches have been proposed recently. In [18, 27],

malicious behaviors ar e speciﬁed by temporal logic formulas with parameters

and detection is carried out by model-checking. However, th ese approaches are

tightly dependent on the way malicious actions are realized: using any oth er

system facility to realize an action allows a malware to go undetected. This has

motivated yet another approach where a malicious behavior is speciﬁed as a

combination of high-level actions, in order to be independent from the way these

actions are realized and to only consider their eﬀect on a system. In [23] and

in [3], a captured execution trace is transformed into a higher-level represen-

tation capturing its semantic m eaning, i.e., the trace is ﬁrst abstracted before

being compared to a malicious behavior. In [17], the authors propose to use

attribute automata, at the price of an exponential time complexity detection.

These dynamic abstraction-based approaches, though they can detect unknown

viruses whose execution traces exhibit known malicious behaviors, only deal with

a single execution trace.

In this paper, we propose a formal approach for high-level behavior analysis,

with the following features. Underpinned by language theory, term rewriting and

ﬁrst-order temporal logic, it allows us to determine whether a program exhibits

a high-level behavior. Detection is achieved in two steps. First, traces of the pro-

gram are abstracted in order to reveal the sequences of high-level functionalities

they realize. Then, abstracted traces are compared with the behavior formula,

using usual model-checking techniques. Functionalities have parameters repre-

senting the manipulated data, so our formalism is adapted to the protection

against generic threats like the leak of sensitive information.

Our goal here is not to provide a ready-made software to detect behaviors, but

to propose a formal framewok emphasizing fundamental detection mechanisms,

which are independent of implementation-based solutions.

Our approach has two main characteristics. First, we work on an unbounded

set of traces representing the behavior of a program, in order to consider a more

complete representation of the program than with a single trace. To deal with

the inﬁnity of the set of traces, we restrict to regular sets an d safely approximate

the set of abstract traces, so that we detect in linear time whether a program

exhibits a given behavior. Second, we work on abstract forms of traces, in or-

der to only keep the essence of the functions performed by the program, to be

independent of their possible implementations and to be generic with respect

to behavior mutations. Behavior components are abstracted in program traces,

by identifying known functionalities and marking them by inserting abstract

functionality symbols.

By working on sets of tr aces, which may consist of a single trace as well

as of an unbounded numb er of traces, our approach may be used not only for

classical, dynamic behavior analysis, but also for static behavior analysis i.e.,

behavior analysis in a static analysis setting.

Static behavior analysis by abstraction is more challenging than its dynamic

counterpart because, precisely, this approach needs to abstract a program behav-

ior potentially representing an inﬁnite set of execution traces. The construction

of an exhaustive representation of a program behavior is an intractable prob-

lem in general: in particular, a program ﬂow may not be easily followed due to

indirect jumps, and a program may use complex code protection, for instance

by dynamically modifying its code or by using obfuscation. Self modiﬁcation is

usually tackled by emulating the program long enough to deactivate most code

protections. Indirect jumps and obfuscation are usually handled by abstract in-

terpretation [25, 19] or symbolic execution [7].

Static behavior analysis has many advantages and applications. First, it al-

lows us to analyze the behavior of a program in a more exhaustive way, as it

analyzes the unbounded set of the program execution traces, or an approxima-

tion of it. Second, static behavior analysis can complement classical, dynamic,

behavior analysis with an analysis of the future behavior, to prevent damages

when some critical point is reached in an execution.

An interesting application of static behavior analysis is the audit of pro-

grams in high-level technologies, like mobile applications, browser extensions,

web page scripts, .NET or Java programs. Auditing these programs is complex

and mostly manual, resulting in highly publicized infections [2, 1]. In this con-

text, static analysis can provide an appropriate help, because it is usually easier

than for usual programs, especially when additionally enforcing a security pol-

icy (e.g. p rohibiting self-modiﬁcation [28]) or when enforcing strict development

guidelines (e.g. for iPhone applications).

To our knowledge, the use of behavior abstraction on top of static behavior

analysis has not been investigated so far. As our detection mechanism relies on

satisfaction of temporal logic formulas, it is akin to model checking [21], for which

there already exist numerous frameworks and tools [16, 14, 8]. Th e speciﬁcity of

our approach, however, is that, rather than being applied on the set of program

traces, veriﬁcation is applied on the set of abstract forms of these traces, which is

not computable in general. Accordingly, we identify a property of practical high-

level behaviors allowing us to approximate this set, in a sound and complete way

with respect to detection, and then to apply classical veriﬁcation techniques.

Our abstraction framework can be used in two scenarios:

– Detection of given behaviors: signatures of given high-level behaviors are ex-

pressed in terms of abstract functionalities. Given some program, we then

assess whether one of its execution traces exhibits a s equence of known func-

tionalities, in a way speciﬁc to one of the given behaviors. This can be applied

to detection of suspicious behaviors. Although detection of such suspicious

behaviors may not suﬃce to label a program as malicious, it can be used to

supplement existing detection techniques with additional decision criteria.

– Analysis of programs: abstraction provides a simple and high-level represen-

tation of a program behavior, which is more suitable than the original traces

for manual analysis, or for analysis of behavior similarity with known be-

haviors, etc. For instance, it could be used to detect not necessarily harmful

behaviors, in order to get a basic understanding of the program and to fur-

ther investigate if deemed necessary. It could also be used to automatically

discover sequences of high-level functionalities and their dataﬂow dependen-

cies, exhibited by a program.

Previous work. In [4], we already proposed to abstract program sets of traces

with respect to behavior patterns, for detection and analysis. We tested our

approach on samples of malicious programs collected using a honeypot

and

identiﬁed using Kaspersky Antivirus. These samples belonged to known malware

families, like Allaple, Virut, Agent, R bot, Afcore and Mimail. Most of them were

successfully matched to our malware database.

The honeypot of the Loria’s High Security Lab: http://lhs.loria.fr

But patterns were deﬁned by string rewriting systems, which did not allow

the actions composing a trace to have parameters, precluding dataﬂow analysis.

Moreover, abstraction rules replaced identiﬁed patterns by abstraction symbols

in the original trace, precluding a further detection of patterns interleaved with

the rewritten ones.

The formalism proposed in this paper addresses both issues: ﬁrst, we handle

interleaved patterns by keeping the identiﬁed patterns when abstracting them.

Second, we extend the rewriting framework to express data constraints on action

parameters by using term rewriting systems. An important consequence is that,

unlike in [4], using the dataﬂow, we can detect information leaks in order to

prevent unauthorized disclosure or modiﬁcations of information.

2 Background

Term Algebras. Let S = {T race, Action, Data} be a set of sorts, F = F

∪ F

∪F

be a ﬁnite S-sorted signature, where F

, F

are mutually distinct and:

– F

= {ǫ, ·} is the set of the trace constructors, where ǫ :→ T race denotes

the empty trace, . has proﬁle Data T race → T race;

– F

is a set of function symbols or constants, with proﬁle Data

→ Action,

n ∈ N, describing actions;

– F

is a set of data constructors, with proﬁle → Data or Data

→ Data,

n ∈ N.

Let N

∗

be the set of ﬁnite strings of positive natural numbers, called positions.

The empty string is denoted by λ, and u ≤ v means that u is preﬁx of v. Let X

be a set of S-sorted variables. A S-sorted term over (F, X) is a partial function

t : N

∗

→ F ∪ X, such that the domain of deﬁnition of t, denoted by Pos(t),

is ﬁnite and satisﬁes, for w ∈ N

∗

and i ∈ N: (1) wi ∈ Pos(t) ⇒ w ∈ Pos(t),

(2) w ∈ Pos(t) ⇒ t(w) ∈ F ∪ X. Pos(t) is called the set of positions of t. We

denote by T (F, X) (resp. T (F)) the set of S-sorted terms over ( F, X) (resp. the

set of ﬁnite ground terms over F). For any sort s ∈ S, and any of the above sets

of terms T we denote by T

the restriction of T to terms of sort s and by X

the subset of variables of X of sort s. For a term t with p ∈ Pos(t), we denote

by t|

the subterm of t at position p. We denote by t[t

′

]

the term obtained by

replacing by t

′

the subterm at position p in t. We use the abbreviated notation

x for variables x

, . . . , x

. So x ∈ X stands for x

, . . . , x

∈ X, and if f ∈ F is

a symbol of arity n ∈ N, we denote by f (

x) the term f (x

, . . . , x

The elements of T

Trace

(F) are called traces, the elements of T

Action

(F) are

called actions. We distinguish the sort Action f rom the sort Trace but, for a

sake of readability, we may denote by a the trace · (a, ǫ), for some action a.

Similarly, we use the · symbol with inﬁx notation and right associativity, and

ǫ is understood when the context is unambiguous. For instance, if a, b, c are

actions, a · b · c denotes the trace · (a, · ( b , · (c, ǫ)) ).

We partition F

in a set Σ of symbols, denoting concrete program-le vel ac-

tions, and a set Γ , denoting abstract actions identifying abstracted functional-

ities. To construct purely concrete (resp. abstract) terms, we use F

= F \ Γ

HTML Viewer

Frequently Asked Questions (12)

Q1. What contributions have the authors mentioned in the paper "Abstraction-based malware analysis using rewriting and model checking" ?

The authors propose a formal approach for the detection of high-level malware behaviors. The authors work on unbounded sets of traces, which makes their technique useful not only for dynamic analysis, considering one trace at a time, but also for static analysis, considering a set of traces inferred from a control flow graph. The authors propose a formal approach for the detection of high-level malware behaviors. The authors work on unbounded sets of traces, which makes their technique useful not only for dynamic analysis, considering one trace at a time, but also for static analysis, considering a set of traces inferred from a control flow graph.

Q2. What is the function that allows on-the-fly model checking of formulas?

CADP features a verification tool, which allows on-the-fly model checking of formulas expressed in the MCL language, a fragment of the modal mu-calculus extended with data variables, whose FOLTL logic used in this paper is a subset.

Q3. What is the main idea behind the behavior analysis?

Underpinned by language theory, term rewriting and first-order temporal logic, it allows us to determine whether a program exhibits a high-level behavior.

Q4. What is the interesting application of static behavior analysis?

An interesting application of static behavior analysis is the audit of programs in high-level technologies, like mobile applications, browser extensions, web page scripts, .NET or Java programs.

Q5. What is the behavior pattern that is used to represent the data?

since the captured data must not be invalidated before being leaked, the authors define a behavior pattern λinval (x), which represents such an invalidation.

Q6. How do the authors describe a function by an FOLTL formula?

The authors describe a functionality by an FOLTL formula, such that traces satisfying this formula are traces carrying out the functionality.

Q7. What is the general definition of a behavior?

In general, a behavior is described by a sequence of system calls and recognition uses the formalism of finite state automata [22, 26, 24, 6].

Q8. What is the key to the problem of constructing the normal form trace set?

In order to address the general intractability of the problem of constructing the normal form trace set for a given program, the authors have identified a property of practical high-level behaviors allowing us to avoid computing normal forms and yielding a linear time detection algorithm.

Q9. How is the ping behavior pattern in Example 1 defined?

The ping behavior pattern in Example 1 is abstracted in traces by inserting the λping symbol after the send action or after the IcmpSendEcho action.

Q10. What is the purpose of the abstract behavior analysis framework?

Their abstraction framework can be used in two scenarios:– Detection of given behaviors: signatures of given high-level behaviors are expressed in terms of abstract functionalities.

Q11. What is the simplest way to prove that a tree transducer is a rational?

The authors show that this is sufficient, with termination of the set of rules, to ensure that the abstraction relation is realizable by a tree transducer, in other words that it is a rational tree transduction.

Q12. What is the motivation behind the behavior analysis approach?

This has motivated yet another approach where a malicious behavior is specified as a combination of high-level actions, in order to be independent from the way these actions are realized and to only consider their effect on a system.

Abstraction-Based Malware Analysis Using Rewriting and Model Checking

Summary (3 min read)

1 Introduction

Previous work.

2 Background

3 Behavior Patterns

5 Detection Problem

6 Detection Complexity

7 Information Leak Behaviors

8 Experiments

9 Conclusion

Citations

Cites methods from "Abstraction-Based Malware Analysis ..."

Additional excerpts

Cites background from "Abstraction-Based Malware Analysis ..."

Cites background from "Abstraction-Based Malware Analysis ..."

References

"Abstraction-Based Malware Analysis ..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (12)

Q1. What contributions have the authors mentioned in the paper "Abstraction-based malware analysis using rewriting and model checking" ?

Q2. What is the function that allows on-the-fly model checking of formulas?

Q3. What is the main idea behind the behavior analysis?

Q4. What is the interesting application of static behavior analysis?

Q5. What is the behavior pattern that is used to represent the data?

Q6. How do the authors describe a function by an FOLTL formula?

Q7. What is the general definition of a behavior?

Q8. What is the key to the problem of constructing the normal form trace set?

Q9. How is the ping behavior pattern in Example 1 defined?

Q10. What is the purpose of the abstract behavior analysis framework?

Q11. What is the simplest way to prove that a tree transducer is a rational?

Q12. What is the motivation behind the behavior analysis approach?