scispace - formally typeset
Open AccessProceedings ArticleDOI

All You Ever Wanted to Know about Dynamic Taint Analysis and Forward Symbolic Execution (but Might Have Been Afraid to Ask)

TLDR
The algorithms for dynamic taint analysis and forward symbolic execution are described as extensions to the run-time semantics of a general language to highlight important implementation choices, common pitfalls, and considerations when using these techniques in a security context.
Abstract
Dynamic taint analysis and forward symbolic execution are quickly becoming staple techniques in security analyses. Example applications of dynamic taint analysis and forward symbolic execution include malware analysis, input filter generation, test case generation, and vulnerability discovery. Despite the widespread usage of these two techniques, there has been little effort to formally define the algorithms and summarize the critical issues that arise when these techniques are used in typical security contexts. The contributions of this paper are two-fold. First, we precisely describe the algorithms for dynamic taint analysis and forward symbolic execution as extensions to the run-time semantics of a general language. Second, we highlight important implementation choices, common pitfalls, and considerations when using these techniques in a security context.

read more

Content maybe subject to copyright    Report

All You Ever Wanted to Know About
Dynamic Taint Analysis and Forward Symbolic Execution
(but might have been afraid to ask)
Edward J. Schwartz, Thanassis Avgerinos, David Brumley
Carnegie Mellon University
Pittsburgh, PA
{edmcman, thanassis, dbrumley}@cmu.edu
Abstract—Dynamic taint analysis and forward symbolic
execution are quickly becoming staple techniques in security
analyses. Example applications of dynamic taint analysis and
forward symbolic execution include malware analysis, input
filter generation, test case generation, and vulnerability dis-
covery. Despite the widespread usage of these two techniques,
there has been little effort to formally define the algorithms and
summarize the critical issues that arise when these techniques
are used in typical security contexts.
The contributions of this paper are two-fold. First, we
precisely describe the algorithms for dynamic taint analysis and
forward symbolic execution as extensions to the run-time se-
mantics of a general language. Second, we highlight important
implementation choices, common pitfalls, and considerations
when using these techniques in a security context.
Keywords-taint analysis, symbolic execution, dynamic
analysis
I. INTRODUCTION
Dynamic analysis the ability to monitor code as it
executes has become a fundamental tool in computer
security research. Dynamic analysis is attractive because
it allows us to reason about actual executions, and thus
can perform precise security analysis based upon run-time
information. Further, dynamic analysis is simple: we need
only consider facts about a single execution at a time.
Two of the most commonly employed dynamic analysis
techniques in security research are dynamic taint analysis
and forward symbolic execution. Dynamic taint analysis runs
a program and observes which computations are affected
by predefined taint sources such as user input. Dynamic
forward symbolic execution automatically builds a logical
formula describing a program execution path, which reduces
the problem of reasoning about the execution to the domain
of logic. The two analyses can be used in conjunction to
build formulas representing only the parts of an execution
that depend upon tainted values.
The number of security applications utilizing these two
techniques is enormous. Example security research areas
employing either dynamic taint analysis, forward symbolic
execution, or a mix of the two, are:
1) Unknown Vulnerability Detection. Dynamic taint
analysis can look for misuses of user input during an
execution. For example, dynamic taint analysis can be
used to prevent code injection attacks by monitoring
whether user input is executed [2325, 50, 59].
2) Automatic Input Filter Generation. Forward sym-
bolic execution can be used to automatically generate
input filters that detect and remove exploits from the
input stream [14, 22, 23]. Filters generated in response
to actual executions are attractive because they provide
strong accuracy guarantees [14].
3) Malware Analysis. Taint analysis and forward sym-
bolic execution are used to analyze how information
flows through a malware binary [7, 8, 65], explore
trigger-based behavior [12, 45], and detect emula-
tors [58].
4) Test Case Generation. Taint analysis and forward
symbolic execution are used to automatically generate
inputs to test programs [17, 19, 36, 57], and can
generate inputs that cause two implementations of the
same protocol to behave differently [10, 17].
Given the large number and variety of application do-
mains, one would imagine that implementing dynamic taint
analysis and forward symbolic execution would be a text-
book problem. Unfortunately this is not the case. Previous
work has focused on how these techniques can be applied
to solve security problems, but has left it as out of scope to
give exact algorithms, implementation choices and pitfalls.
As a result, researchers seeking to use these techniques often
rediscover the same limitations, implementation tricks, and
trade-offs.
The goals and contributions of this paper are two-fold.
First, we formalize dynamic taint analysis and forward
symbolic execution as found in the security domain. Our
formalization rests on the intuition that run-time analyses
can precisely and naturally be described in terms of the
formal run-time semantics of the language. This formal-
ization provides a concise and precise way to define each
analysis, and suggests a straightforward implementation. We

program ::= stmt*
stmt s ::= var := exp | store(exp, exp)
| goto exp | assert exp
| if exp then goto exp
else goto exp
exp e ::= load(exp) | exp
b
exp |
u
exp
| var | get input(src) | v
b
::= typical binary operators
u
::= typical unary operators
value v ::= 32-bit unsigned integer
Table I: A simple intermediate language (SIMPIL).
then show how our formalization can be used to tease out
and describe common implementation details, caveats, and
choices as found in various security applications.
II. FIRST STEPS: A GENERAL LANGUAGE
A. Overview
A precise definition of dynamic taint analysis or forward
symbolic execution must target a specific language. For
the purposes of this paper, we use SIMPIL: a Simple
Intermediate Language. The grammar of SIMPIL is pre-
sented in Table I. Although the language is simple, it is
powerful enough to express typical languages as varied as
Java [31] and assembly code [1, 2]. Indeed, the language is
representative of internal representations used by compilers
for a variety of programming languages [3].
A program in our language consists of a sequence of
numbered statements. Statements in our language consist
of assignments, assertions, jumps, and conditional jumps.
Expressions in SIMPIL are side-effect free (i.e., they do
not change the program state). We use
b
to represent
typical binary operators, e.g., you can fill in the box with
operators such as addition, subtraction, etc. Similarly,
u
represents unary operators such as logical negation. The
statement get input(src) returns input from source src. We
use a dot (·) to denote an argument that is ignored, e.g.,
we will write get input(·) when the exact input source is
not relevant. For simplicity, we consider only expressions
(constants, variables, etc.) that evaluate to 32-bit integer
values; extending the language and rules to additional types
is straightforward.
For the sake of simplicity, we omit the type-checking
semantics of our language and assume things are well-typed
in the obvious way, e.g., that binary operands are integers
or variables, not memories, and so on.
B. Operational Semantics
The operational semantics of a language specify unam-
biguously how to execute a program written in that language.
Context Meaning
Σ Maps a statement number to a statement
µ Maps a memory address to the current value
at that address
Maps a variable name to its value
pc The program counter
ι The next instruction
Figure 2: The meta-syntactic variables used in the execution
context.
Because dynamic program analyses are defined in terms
of actual program executions, operational semantics also
provide a natural way to define a dynamic analysis. However,
before we can specify program analyses, we must first define
the base operational semantics.
The complete operational semantics for SIMPIL are
shown in Figure 1. Each statement rule is of the form:
computation
hcurrent statei, stmt hend statei, stmt’
Rules are read bottom to top, left to right. Given a statement,
we pattern-match the statement to find the applicable rule,
e.g., given the statement x := e, we match to the ASSIGN
rule. We then apply the computation given in the top of
the rule, and if successful, transition to the end state. If
no rule matches (or the computation in the premise fails),
then the machine halts abnormally. For instance, jumping to
an address not in the domain of Σ would cause abnormal
termination.
The execution context is described by five parameters: the
list of program statements (Σ), the current memory state (µ),
the current value for variables (), the program counter (pc),
and the current statement (ι). The Σ, µ, and contexts are
maps, e.g., ∆[x] denotes the current value of variable x. We
denote updating a context variable x with value v as x v,
e.g., ∆[x 10] denotes setting the value of variable x to the
value 10 in context . A summary of the five meta-syntactic
variables is shown in Figure 2.
In our evaluation rules, the program context Σ does not
change between transitions. The implication is that our oper-
ational semantics do not allow programs with dynamically
generated code. However, adding support for dynamically
generated code is straightforward. We discuss how SIMPIL
can be augmented to support dynamically generated code
and other higher-level language features in Section II-C.
The evaluation rules for expressions use a similar notation.
We denote by µ, ` e v evaluating an expression e
to a value v in the current state given by µ and . The
expression e is evaluated by matching e to an expression
evaluation rule and performing the attached computation.

v is input from src
µ, ` get input(src) v
INPUT
µ, ` e v
1
v = µ[v
1
]
µ, ` load e v
LOAD
µ, ` var ∆[var]
VAR
µ, ` e v v
0
=
u
v
µ, `
u
e v
0
UNOP
µ, ` e
1
v
1
µ, ` e
2
v
2
v
0
= v
1
b
v
2
µ, ` e
1
b
e
2
v
0
BINOP
µ, ` v v
CONST
µ, ` e v
0
= ∆[var v] ι = Σ[pc + 1]
Σ, µ, , pc, var := e Σ, µ,
0
, pc + 1, ι
ASSIGN
µ, ` e v
1
ι = Σ[v
1
]
Σ, µ, , pc, goto e Σ, µ, , v
1
, ι
GOTO
µ, ` e 1 ` e
1
v
1
ι = Σ[v
1
]
Σ, µ, , pc, if e then goto e
1
else goto e
2
Σ, µ, , v
1
, ι
TCOND
µ, , ` e 0 ` e
2
v
2
ι = Σ[v
2
]
Σ, µ, , pc, if e then goto e
1
else goto e
2
Σ, µ, , v
2
, ι
FCOND
µ, ` e
1
v
1
µ, ` e
2
v
2
ι = Σ[pc + 1] µ
0
= µ[v
1
v
2
]
Σ, µ, , pc, store(e
1
, e
2
) Σ, µ
0
, , pc + 1, ι
STORE
µ, ` e 1 ι = Σ[pc + 1]
Σ, µ, , pc, assert(e) Σ, µ, , pc + 1, ι
ASSERT
Figure 1: Operational semantics of SIMPIL.
Most of the evaluation rules break the expression down into
simpler expressions, evaluate the subexpressions, and then
combine the resulting evaluations.
Example 1. Consider evaluating the following program:
1 x := 2 g e t i n p u t ( ·)
The evaluation for this program is shown in Figure 3 for
the input of 20. Notice that since the ASSIGN rule requires
the expression e in var := e to be evaluated, we had to
recurse to other rules (BINOP, INPUT, CONST) to evaluate
the expression 2get input(·) to the value 40.
C. Language Discussion
We have designed our language to demonstrate the critical
aspects of dynamic taint analysis and forward symbolic
execution. We do not include some high-level language
constructs such as functions or scopes for simplicity and
space reasons. This omission does not fundamentally limit
the capability of our language or our results. Adding such
constructs is straightforward. For example, two approaches
are:
1) Compile missing high-level language constructs down
to our language. For instance, functions, buffers and
user-level abstractions can be compiled down to
SIMPIL statements instead of assembly-level instruc-
tions. Tools such as BAP [1] and BitBlaze [2] already
use a variant of SIMPIL to perform analyses. BAP is
freely available at http://bap.ece.cmu.edu.
Example 2. Function calls in high-level code can
be compiled down to SIMPIL by storing the return
address and transferring control flow. The following
code calls and returns from the function at line 9.
1 / C a l l e r f u n c t i o n /
2 esp := e sp + 4
3 s t o r e ( esp , 6 ) / r e t a d d r i s 6 /
4 goto 9
5 / The c a l l w i l l r e t u r n h e r e /
6 h a l t
7
8 / C a l l e e f u n c t i o n /
9 . . .
10 goto l o a d ( e s p )
We assume this choice throughout the paper since
previous dynamic analysis work has already demon-
strated that such languages can be used to reason about
programs written in any language.
2) Add higher-level constructs to SIMPIL. For instance,
it might be useful for our language to provide di-
rect support for functions or dynamically generated
code. This could slightly enhance our analyses (e.g.,
allowing us to reason about function arguments), while
requiring only small changes to our semantics and
analyses. Figure 4 presents the CALL and RET rules
that need to be added to the semantics of SIMPIL to
provide support for call-by-value function calls. Note
that several new contexts were introduced to support
functions, including a stack context (λ) to store return
addresses, a scope context (ζ) to store function-local
variable contexts and a map from function names to
addresses (φ).
In a similar manner we can enhance SIMPIL to support

µ, ` 2 2
CONST
20 is input
µ, ` get input(·) 20
INPUT
v
0
= 2 20
µ, ` 2*get input(·) 40
BINOP
0
= ∆[x 40] ι = Σ[pc + 1]
Σ, µ, , pc, x := 2*get input(·) Σ, µ,
0
, pc + 1, ι
ASSIGN
Figure 3: Evaluation of the program in Listing 1.
dynamically generated code. We redefine the abstract
machine transition to allow updates to the program
context (Σ Σ
0
) and provide the rules for adding
generated code to Σ. An example GENCODE rule is
shown in Figure 4.
III. DYNAMIC TAINT ANALYSIS
The purpose of dynamic taint analysis is to track in-
formation flow between sources and sinks. Any program
value whose computation depends on data derived from a
taint source is considered tainted (denoted T). Any other
value is considered untainted (denoted F). A taint policy
P determines exactly how taint flows as a program ex-
ecutes, what sorts of operations introduce new taint, and
what checks are performed on tainted values. While the
specifics of the taint policy may differ depending upon the
taint analysis application, e.g., taint tracking policies for
unpacking malware may be different than attack detection,
the fundamental concepts stay the same.
Two types of errors can occur in dynamic taint analysis.
First, dynamic taint analysis can mark a value as tainted
when it is not derived from a taint source. We say that such
a value is overtainted. For example, in an attack detection
application overtainting will typically result in reporting
an attack when no attack occurred. Second, dynamic taint
analysis can miss the information flow from a source to a
sink, which we call undertainting. In the attack detection
scenario, undertainting means the system missed a real
attack. A dynamic taint analysis system is precise if no
undertainting or overtainting occurs.
In this section we first describe how dynamic taint analysis
is implemented by monitoring the execution of a program.
We then describe various taint analysis policies and trade-
offs. Finally, we describe important issues and caveats that
often result in dynamic taint analysis systems that overtaint,
undertaint, or both.
A. Dynamic Taint Analysis Semantics
Since dynamic taint analysis is performed on code at
runtime, it is natural to express dynamic taint analysis in
terms of the operational semantics of the language. Taint
policy actions, whether it be taint propagation, introduction,
or checking, are added to the operational semantics rules.
To keep track of the taint status of each program value, we
redefine values in our language to be tuples of the form
hv, τ i, where v is a value in the initial language, and τ is
taint t ::= T | F
value ::= hv, ti
τ
::= Maps variables to taint status
τ
µ
::= Maps addresses to taint status
Table II: Additional changes to SIMPIL to enable dynamic
taint analysis.
the taint status of v. A summary of the necessary changes
to SIMPIL is provided in Table II.
Figure 5 shows how a taint analysis policy P is added to
SIMPIL. The semantics show where the taint policy is used;
the semantics are independent of the policy itself. In order
to support taint policies, the semantics introduce two new
contexts: τ
and τ
µ
. τ
keeps track of the taint status of
scalar variables. τ
µ
keeps track of the taint status of memory
cells. τ
and τ
µ
are initialized so that all values are marked
untainted. Together, τ
and τ
µ
keep the taint status for all
variables and memory cells, and are used to derive the taint
status for all values during execution.
B. Dynamic Taint Policies
A taint policy specifies three properties: how new taint is
introduced to a program, how taint propagates as instructions
execute, and how taint is checked during execution.
Taint Introduction. Taint introduction rules specify how
taint is introduced into a system. The typical convention is
to initialize all variables, memory cells, etc. as untainted.
In SIMPIL, we only have a single source of user input:
the get input(·) call. In a real implementation, get input(·)
represents values returned from a system call, return values
from a library call, etc. A taint policy will also typically
distinguish between different input sources. For example, an
internet-facing network input source may always introduce
taint, while a file descriptor that reads from a trusted
configuration file may not [2, 50, 65]. Further, specific taint
sources can be tracked independently, e.g., τ
can map not
just the bit indicating taint status, but also the source.
Taint Propagation. Taint propagation rules specify the taint
status for data derived from tainted or untainted operands.
Since taint is a bit, propositional logic is usually used to
express the propagation policy, e.g., t
1
t
2
indicates the
result is tainted if t
1
is tainted or t
2
is tainted.

µ, ` e
1
v
1
. . . µ, ` e
i
v
i
0
= ∆[x
1
v
1
, . . . , x
i
v
i
] pc
0
= φ[f] ι = Σ[pc
0
]
λ, Σ, φ, µ, , ζ, pc, call f(e
1
,. . . ,e
i
) (pc + 1) :: λ, Σ, φ, µ,
0
, :: ζ, pc
0
, ι
CALL
ι = Σ[pc
0
]
pc
0
:: λ
0
, Σ, φ, µ, ,
0
:: ζ
0
, pc, return λ
0
, Σ, φ, µ,
0
, ζ
0
, pc
0
, ι
RET
µ, ` e v v 6∈ dom(Σ) s = disassemble(µ[v]) Σ
0
= Σ[v s] ι = Σ
0
[v]
Σ, µ, , pc, jmp e Σ
0
, µ, , v, ι
GENCODE
Figure 4: Example operational semantics for adding support for call-by-value function calls and dynamically generated code.
Component Policy Check
P
input
(·), P
bincheck
(·), P
memcheck
(·) T
P
const
() F
P
unop
(t), P
assign
(t) t
P
binop
(t
1
, t
2
) t
1
t
2
P
mem
(t
a
, t
v
) t
v
P
condcheck
(t
e
, t
a
) ¬t
a
P
goto check
(t
a
) ¬t
a
Table III: A typical tainted jump target policy for detecting
attacks. A dot (·) denotes an argument that is ignored. A
taint status is converted to a boolean value in the natural
way, e.g., T maps to true, and F maps to false.
Taint Checking. Taint status values are often used to
determine the runtime behavior of a program, e.g., an attack
detector may halt execution if a jump target address is
tainted. In SIMPIL, we perform checking by adding the
policy to the premise of the operational semantics. For
instance, the T-GOTO rule uses the P
goto check
(t) policy.
P
goto check
(t) returns T if it is safe to perform a jump
operation when the target address has taint value t, and
returns F otherwise. If F is returned, the premise for the
rule is not met and the machine terminates abnormally
(signifying an exception).
C. A Typical Taint Policy
A prototypical application of dynamic taint analysis is
attack detection. Table III shows a typical attack detection
policy which we call the tainted jump policy. In order to be
concrete when discussing the challenges and opportunities in
taint analysis, we often contrast implementation choices with
respect to this policy. We stress that although the policy is
designed to detect attacks, other applications of taint analysis
are typically very similar.
The goal of the tainted jump policy is to protect a
potentially vulnerable program from control flow hijacking
attacks. The main idea in the policy is that an input-derived
value will never overwrite a control-flow value such as a
return address or function pointer. A control flow exploit,
however, will overwrite jump targets (e.g., return addresses)
with input-derived values. The tainted jump policy ensures
safety against such attacks by making sure tainted jump
targets are never used.
The policy introduces taint into the system by marking
all values returned by get
input(·) as tainted. Taint is then
propagated through the program in a straightforward manner,
e.g., the result of a binary operation is tainted if either
operand is tainted, an assigned variable is tainted if the right-
hand side value is tainted, and so on.
Example 3. Table IV shows the taint calculations at each
step of the execution for the following program:
1 x := 2 g e t i n p u t ( · )
2 y := 5 + x
3 goto y
On line 1, the executing program receives input, assumed
to be 20, and multiplies by 2. Since all input is marked as
tainted, 2 get input(·) is also tainted via T-BINOP, and
x is marked in τ
as tainted via T-ASSIGN. On line 2,
x (tainted) is added to y (untainted). Since one operand is
tainted, y is marked as tainted in τ
. On line 3, the program
jumps to y. Since y is tainted, the T-GOTO premise for P
is not satisfied, and the machine halts abnormally.
Different Policies for Different Applications. Different
applications of taint analysis can use different policy de-
cisions. As we will see in the next section, the typical
taint policy described in Table III is not appropriate for
all application domains, since it does not consider whether
memory addresses are tainted. Thus, it may miss some
attacks. We discuss alternatives to this policy in the next
section.
D. Dynamic Taint Analysis Challenges and Opportunities
There are several challenges to using dynamic taint
analysis correctly, including:
Tainted Addresses. Distinguishing between memory
addresses and cells is not always appropriate.
Undertainting. Dynamic taint analysis does not prop-
erly handle some types of information flow.

Citations
More filters
Journal ArticleDOI

TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones

TL;DR: TaintDroid as mentioned in this paper is an efficient, system-wide dynamic taint tracking and analysis system capable of simultaneously tracking multiple sources of sensitive data by leveraging Android's virtualized execution environment.
Proceedings ArticleDOI

TaintDroid: an information-flow tracking system for realtime privacy monitoring on smartphones

TL;DR: Using TaintDroid to monitor the behavior of 30 popular third-party Android applications, this work found 68 instances of misappropriation of users' location and device identification information across 20 applications.
Proceedings ArticleDOI

SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis

TL;DR: This paper presents a binary analysis framework that implements a number of analysis techniques that have been proposed in the past and implements these techniques in a unifying framework, which allows other researchers to compose them and develop new approaches.
Journal ArticleDOI

Symbolic execution for software testing: three decades later

TL;DR: The challenges---and great promise---of modern symbolic execution techniques, and the tools to help implement them.
Proceedings ArticleDOI

Unleashing Mayhem on Binary Code

TL;DR: This paper proposes two novel techniques: 1) hybrid symbolic execution for combining online and offline (concolic) execution to maximize the benefits of both techniques, and 2) index-based memory modeling, a technique that allows Mayhem to efficiently reason about symbolic memory at the binary level.
References
More filters
Proceedings ArticleDOI

KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs

TL;DR: A new symbolic execution tool, KLEE, capable of automatically generating tests that achieve high coverage on a diverse set of complex and environmentally-intensive programs, and significantly beat the coverage of the developers' own hand-written test suite is presented.
Journal ArticleDOI

The program dependence graph and its use in optimization

TL;DR: An intermediate program representation, called the program dependence graph (PDG), that makes explicit both the data and control dependences for each operation in a program, allowing transformations to be triggered by one another and applied only to affected dependences.
Journal ArticleDOI

DART: directed automated random testing

TL;DR: DART is a new tool for automatically testing software that combines three main techniques, automated extraction of the interface of a program with its external environment using static source-code parsing, and dynamic analysis of how the program behaves under random testing and automatic generation of new test inputs to direct systematically the execution along alternative program paths.
Journal ArticleDOI

Language-based information-flow security

TL;DR: A structured view of research on information-flow security is given, particularly focusing on work that uses static program analysis to enforce information- flow policies, and some important open challenges are identified.
Related Papers (5)
Frequently Asked Questions (8)
Q1. What are the contributions mentioned in the paper "All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask)" ?

The contributions of this paper are two-fold. First, the authors precisely describe the algorithms for dynamic taint analysis and forward symbolic execution as extensions to the run-time semantics of a general language. Second, the authors highlight important implementation choices, common pitfalls, and considerations when using these techniques in a security context. 

Due to the high overhead of binary instrumentation techniques, more efficient compiler-based [42, 64] and hardware-based [25, 26, 59, 60] approaches were later proposed. 

The central advantages of a concolic-based approach is it is simple, easy to implement, and sidesteps the problem of reasoning about how a program interacts with its environment. 

The reason is simple: reasoning about control dependencies requires reasoning about multiple paths, and dynamic analysis executes on a single path at a time. 

Note that several new contexts were introduced to support functions, including a stack context (λ) to store return addresses, a scope context (ζ) to store function-local variable contexts and a map from function names to addresses (φ). 

The execution context is described by five parameters: the list of program statements (Σ), the current memory state (µ), the current value for variables (∆), the program counter (pc), and the current statement (ι). 

This leads to the problem of taint spread: as the program executes, more and more values become tainted, often with less and less taint precision. 

Three standard ways to handle symbolic jumps are: 1) Use concrete and symbolic (concolic) analysis [57]to run the program and observe an indirect jump target.