Why did binary instrumentation techniques become so popular?

Due to the high overhead of binary instrumentation techniques, more efficient compiler-based [42, 64] and hardware-based [25, 26, 59, 60] approaches were later proposed.

What are the advantages of a concolic-based approach?

The central advantages of a concolic-based approach is it is simple, easy to implement, and sidesteps the problem of reasoning about how a program interacts with its environment.

Why does dynamic analysis not compute control dependencies?

The reason is simple: reasoning about control dependencies requires reasoning about multiple paths, and dynamic analysis executes on a single path at a time.

What is the problem of taint spread?

This leads to the problem of taint spread: as the program executes, more and more values become tainted, often with less and less taint precision.

What are the three standard ways to handle symbolic jumps?

Three standard ways to handle symbolic jumps are: 1) Use concrete and symbolic (concolic) analysis [57]to run the program and observe an indirect jump target.

(Open Access) All You Ever Wanted to Know about Dynamic Taint Analysis and Forward Symbolic Execution (but Might Have Been Afraid to Ask) (2010) | Edward J. Schwartz

Q: What are the contributions mentioned in the paper "All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask)" ?

The contributions of this paper are two-fold. First, the authors precisely describe the algorithms for dynamic taint analysis and forward symbolic execution as extensions to the run-time semantics of a general language. Second, the authors highlight important implementation choices, common pitfalls, and considerations when using these techniques in a security context.

Q: What are the execution contexts of a language?

The execution context is described by five parameters: the list of program statements (Σ), the current memory state (µ), the current value for variables (∆), the program counter (pc), and the current statement (ι).

All You Ever Wanted to Know About

Dynamic Taint Analysis and Forward Symbolic Execution

(but might have been afraid to ask)

Edward J. Schwartz, Thanassis Avgerinos, David Brumley

Carnegie Mellon University

Pittsburgh, PA

{edmcman, thanassis, dbrumley}@cmu.edu

Abstract—Dynamic taint analysis and forward symbolic

execution are quickly becoming staple techniques in security

analyses. Example applications of dynamic taint analysis and

forward symbolic execution include malware analysis, input

ﬁlter generation, test case generation, and vulnerability dis-

covery. Despite the widespread usage of these two techniques,

there has been little effort to formally deﬁne the algorithms and

summarize the critical issues that arise when these techniques

are used in typical security contexts.

The contributions of this paper are two-fold. First, we

precisely describe the algorithms for dynamic taint analysis and

forward symbolic execution as extensions to the run-time se-

mantics of a general language. Second, we highlight important

implementation choices, common pitfalls, and considerations

when using these techniques in a security context.

Keywords-taint analysis, symbolic execution, dynamic

analysis

I. INTRODUCTION

Dynamic analysis — the ability to monitor code as it

executes — has become a fundamental tool in computer

security research. Dynamic analysis is attractive because

it allows us to reason about actual executions, and thus

can perform precise security analysis based upon run-time

information. Further, dynamic analysis is simple: we need

only consider facts about a single execution at a time.

Two of the most commonly employed dynamic analysis

techniques in security research are dynamic taint analysis

and forward symbolic execution. Dynamic taint analysis runs

a program and observes which computations are affected

by predeﬁned taint sources such as user input. Dynamic

forward symbolic execution automatically builds a logical

formula describing a program execution path, which reduces

the problem of reasoning about the execution to the domain

of logic. The two analyses can be used in conjunction to

build formulas representing only the parts of an execution

that depend upon tainted values.

The number of security applications utilizing these two

techniques is enormous. Example security research areas

employing either dynamic taint analysis, forward symbolic

execution, or a mix of the two, are:

1) Unknown Vulnerability Detection. Dynamic taint

analysis can look for misuses of user input during an

execution. For example, dynamic taint analysis can be

used to prevent code injection attacks by monitoring

whether user input is executed [23–25, 50, 59].

2) Automatic Input Filter Generation. Forward sym-

bolic execution can be used to automatically generate

input ﬁlters that detect and remove exploits from the

input stream [14, 22, 23]. Filters generated in response

to actual executions are attractive because they provide

strong accuracy guarantees [14].

3) Malware Analysis. Taint analysis and forward sym-

bolic execution are used to analyze how information

ﬂows through a malware binary [7, 8, 65], explore

trigger-based behavior [12, 45], and detect emula-

tors [58].

4) Test Case Generation. Taint analysis and forward

symbolic execution are used to automatically generate

inputs to test programs [17, 19, 36, 57], and can

generate inputs that cause two implementations of the

same protocol to behave differently [10, 17].

Given the large number and variety of application do-

mains, one would imagine that implementing dynamic taint

analysis and forward symbolic execution would be a text-

book problem. Unfortunately this is not the case. Previous

work has focused on how these techniques can be applied

to solve security problems, but has left it as out of scope to

give exact algorithms, implementation choices and pitfalls.

As a result, researchers seeking to use these techniques often

rediscover the same limitations, implementation tricks, and

trade-offs.

The goals and contributions of this paper are two-fold.

First, we formalize dynamic taint analysis and forward

symbolic execution as found in the security domain. Our

formalization rests on the intuition that run-time analyses

can precisely and naturally be described in terms of the

formal run-time semantics of the language. This formal-

ization provides a concise and precise way to deﬁne each

analysis, and suggests a straightforward implementation. We

program ::= stmt*

stmt s ::= var := exp | store(exp, exp)

| goto exp | assert exp

| if exp then goto exp

else goto exp

exp e ::= load(exp) | exp ♦

exp | ♦

exp

| var | get input(src) | v

♦

::= typical binary operators

♦

::= typical unary operators

value v ::= 32-bit unsigned integer

Table I: A simple intermediate language (SIMPIL).

then show how our formalization can be used to tease out

and describe common implementation details, caveats, and

choices as found in various security applications.

II. FIRST STEPS: A GENERAL LANGUAGE

A. Overview

A precise deﬁnition of dynamic taint analysis or forward

symbolic execution must target a speciﬁc language. For

the purposes of this paper, we use SIMPIL: a Simple

Intermediate Language. The grammar of SIMPIL is pre-

sented in Table I. Although the language is simple, it is

powerful enough to express typical languages as varied as

Java [31] and assembly code [1, 2]. Indeed, the language is

representative of internal representations used by compilers

for a variety of programming languages [3].

A program in our language consists of a sequence of

numbered statements. Statements in our language consist

of assignments, assertions, jumps, and conditional jumps.

Expressions in SIMPIL are side-effect free (i.e., they do

not change the program state). We use “♦

” to represent

typical binary operators, e.g., you can ﬁll in the box with

operators such as addition, subtraction, etc. Similarly, ♦

represents unary operators such as logical negation. The

statement get input(src) returns input from source src. We

use a dot (·) to denote an argument that is ignored, e.g.,

we will write get input(·) when the exact input source is

not relevant. For simplicity, we consider only expressions

(constants, variables, etc.) that evaluate to 32-bit integer

values; extending the language and rules to additional types

is straightforward.

For the sake of simplicity, we omit the type-checking

semantics of our language and assume things are well-typed

in the obvious way, e.g., that binary operands are integers

or variables, not memories, and so on.

B. Operational Semantics

The operational semantics of a language specify unam-

biguously how to execute a program written in that language.

Context Meaning

Σ Maps a statement number to a statement

µ Maps a memory address to the current value

at that address

∆ Maps a variable name to its value

pc The program counter

ι The next instruction

Figure 2: The meta-syntactic variables used in the execution

context.

Because dynamic program analyses are deﬁned in terms

of actual program executions, operational semantics also

provide a natural way to deﬁne a dynamic analysis. However,

before we can specify program analyses, we must ﬁrst deﬁne

the base operational semantics.

The complete operational semantics for SIMPIL are

shown in Figure 1. Each statement rule is of the form:

computation

hcurrent statei, stmt hend statei, stmt’

Rules are read bottom to top, left to right. Given a statement,

we pattern-match the statement to ﬁnd the applicable rule,

e.g., given the statement x := e, we match to the ASSIGN

rule. We then apply the computation given in the top of

the rule, and if successful, transition to the end state. If

no rule matches (or the computation in the premise fails),

then the machine halts abnormally. For instance, jumping to

an address not in the domain of Σ would cause abnormal

termination.

The execution context is described by ﬁve parameters: the

list of program statements (Σ), the current memory state (µ),

the current value for variables (∆), the program counter (pc),

and the current statement (ι). The Σ, µ, and ∆ contexts are

maps, e.g., ∆[x] denotes the current value of variable x. We

denote updating a context variable x with value v as x ← v,

e.g., ∆[x ← 10] denotes setting the value of variable x to the

value 10 in context ∆. A summary of the ﬁve meta-syntactic

variables is shown in Figure 2.

In our evaluation rules, the program context Σ does not

change between transitions. The implication is that our oper-

ational semantics do not allow programs with dynamically

generated code. However, adding support for dynamically

generated code is straightforward. We discuss how SIMPIL

can be augmented to support dynamically generated code

and other higher-level language features in Section II-C.

The evaluation rules for expressions use a similar notation.

We denote by µ, ∆ ` e ⇓ v evaluating an expression e

to a value v in the current state given by µ and ∆. The

expression e is evaluated by matching e to an expression

evaluation rule and performing the attached computation.

v is input from src

µ, ∆ ` get input(src) ⇓ v

INPUT

µ, ∆ ` e ⇓ v

v = µ[v

]

µ, ∆ ` load e ⇓ v

LOAD

µ, ∆ ` var ⇓ ∆[var]

VAR

µ, ∆ ` e ⇓ v v

= ♦

µ, ∆ ` ♦

e ⇓ v

UNOP

µ, ∆ ` e

⇓ v

µ, ∆ ` e

⇓ v

= v

♦

µ, ∆ ` e

♦

⇓ v

BINOP

µ, ∆ ` v ⇓ v

CONST

µ, ∆ ` e ⇓ v ∆

= ∆[var ← v] ι = Σ[pc + 1]

Σ, µ, ∆, pc, var := e Σ, µ, ∆

, pc + 1, ι

ASSIGN

µ, ∆ ` e ⇓ v

ι = Σ[v

]

Σ, µ, ∆, pc, goto e Σ, µ, ∆, v

, ι

GOTO

µ, ∆ ` e ⇓ 1 ∆ ` e

⇓ v

ι = Σ[v

]

Σ, µ, ∆, pc, if e then goto e

else goto e

Σ, µ, ∆, v

, ι

TCOND

µ, ∆, ` e ⇓ 0 ∆ ` e

⇓ v

ι = Σ[v

]

Σ, µ, ∆, pc, if e then goto e

else goto e

Σ, µ, ∆, v

, ι

FCOND

µ, ∆ ` e

⇓ v

µ, ∆ ` e

⇓ v

ι = Σ[pc + 1] µ

= µ[v

← v

]

Σ, µ, ∆, pc, store(e

, e

) Σ, µ

, ∆, pc + 1, ι

STORE

µ, ∆ ` e ⇓ 1 ι = Σ[pc + 1]

Σ, µ, ∆, pc, assert(e) Σ, µ, ∆, pc + 1, ι

ASSERT

Figure 1: Operational semantics of SIMPIL.

Most of the evaluation rules break the expression down into

simpler expressions, evaluate the subexpressions, and then

combine the resulting evaluations.

Example 1. Consider evaluating the following program:

1 x := 2 ∗ g e t i n p u t ( ·)

The evaluation for this program is shown in Figure 3 for

the input of 20. Notice that since the ASSIGN rule requires

the expression e in var := e to be evaluated, we had to

recurse to other rules (BINOP, INPUT, CONST) to evaluate

the expression 2∗get input(·) to the value 40.

C. Language Discussion

We have designed our language to demonstrate the critical

aspects of dynamic taint analysis and forward symbolic

execution. We do not include some high-level language

constructs such as functions or scopes for simplicity and

space reasons. This omission does not fundamentally limit

the capability of our language or our results. Adding such

constructs is straightforward. For example, two approaches

are:

1) Compile missing high-level language constructs down

to our language. For instance, functions, buffers and

user-level abstractions can be compiled down to

SIMPIL statements instead of assembly-level instruc-

tions. Tools such as BAP [1] and BitBlaze [2] already

use a variant of SIMPIL to perform analyses. BAP is

freely available at http://bap.ece.cmu.edu.

Example 2. Function calls in high-level code can

be compiled down to SIMPIL by storing the return

address and transferring control ﬂow. The following

code calls and returns from the function at line 9.

1 / ∗ C a l l e r f u n c t i o n ∗ /

2 esp := e sp + 4

3 s t o r e ( esp , 6 ) / ∗ r e t a d d r i s 6 ∗ /

4 goto 9

5 / ∗ The c a l l w i l l r e t u r n h e r e ∗ /

6 h a l t

8 / ∗ C a l l e e f u n c t i o n ∗ /

9 . . .

10 goto l o a d ( e s p )

We assume this choice throughout the paper since

previous dynamic analysis work has already demon-

strated that such languages can be used to reason about

programs written in any language.

2) Add higher-level constructs to SIMPIL. For instance,

it might be useful for our language to provide di-

rect support for functions or dynamically generated

code. This could slightly enhance our analyses (e.g.,

allowing us to reason about function arguments), while

requiring only small changes to our semantics and

analyses. Figure 4 presents the CALL and RET rules

that need to be added to the semantics of SIMPIL to

provide support for call-by-value function calls. Note

that several new contexts were introduced to support

functions, including a stack context (λ) to store return

addresses, a scope context (ζ) to store function-local

variable contexts and a map from function names to

addresses (φ).

In a similar manner we can enhance SIMPIL to support

µ, ∆ ` 2 ⇓ 2

CONST

20 is input

µ, ∆ ` get input(·) ⇓ 20

INPUT

= 2 ∗ 20

µ, ∆ ` 2*get input(·) ⇓ 40

BINOP

∆

= ∆[x ← 40] ι = Σ[pc + 1]

Σ, µ, ∆, pc, x := 2*get input(·) Σ, µ, ∆

, pc + 1, ι

ASSIGN

Figure 3: Evaluation of the program in Listing 1.

dynamically generated code. We redeﬁne the abstract

machine transition to allow updates to the program

context (Σ Σ

) and provide the rules for adding

generated code to Σ. An example GENCODE rule is

shown in Figure 4.

III. DYNAMIC TAINT ANALYSIS

The purpose of dynamic taint analysis is to track in-

formation ﬂow between sources and sinks. Any program

value whose computation depends on data derived from a

taint source is considered tainted (denoted T). Any other

value is considered untainted (denoted F). A taint policy

P determines exactly how taint ﬂows as a program ex-

ecutes, what sorts of operations introduce new taint, and

what checks are performed on tainted values. While the

speciﬁcs of the taint policy may differ depending upon the

taint analysis application, e.g., taint tracking policies for

unpacking malware may be different than attack detection,

the fundamental concepts stay the same.

Two types of errors can occur in dynamic taint analysis.

First, dynamic taint analysis can mark a value as tainted

when it is not derived from a taint source. We say that such

a value is overtainted. For example, in an attack detection

application overtainting will typically result in reporting

an attack when no attack occurred. Second, dynamic taint

analysis can miss the information ﬂow from a source to a

sink, which we call undertainting. In the attack detection

scenario, undertainting means the system missed a real

attack. A dynamic taint analysis system is precise if no

undertainting or overtainting occurs.

In this section we ﬁrst describe how dynamic taint analysis

is implemented by monitoring the execution of a program.

We then describe various taint analysis policies and trade-

offs. Finally, we describe important issues and caveats that

often result in dynamic taint analysis systems that overtaint,

undertaint, or both.

A. Dynamic Taint Analysis Semantics

Since dynamic taint analysis is performed on code at

runtime, it is natural to express dynamic taint analysis in

terms of the operational semantics of the language. Taint

policy actions, whether it be taint propagation, introduction,

or checking, are added to the operational semantics rules.

To keep track of the taint status of each program value, we

redeﬁne values in our language to be tuples of the form

hv, τ i, where v is a value in the initial language, and τ is

taint t ::= T | F

value ::= hv, ti

∆

::= Maps variables to taint status

::= Maps addresses to taint status

Table II: Additional changes to SIMPIL to enable dynamic

taint analysis.

the taint status of v. A summary of the necessary changes

to SIMPIL is provided in Table II.

Figure 5 shows how a taint analysis policy P is added to

SIMPIL. The semantics show where the taint policy is used;

the semantics are independent of the policy itself. In order

to support taint policies, the semantics introduce two new

contexts: τ

∆

and τ

. τ

∆

keeps track of the taint status of

scalar variables. τ

keeps track of the taint status of memory

cells. τ

∆

and τ

are initialized so that all values are marked

untainted. Together, τ

∆

and τ

keep the taint status for all

variables and memory cells, and are used to derive the taint

status for all values during execution.

B. Dynamic Taint Policies

A taint policy speciﬁes three properties: how new taint is

introduced to a program, how taint propagates as instructions

execute, and how taint is checked during execution.

Taint Introduction. Taint introduction rules specify how

taint is introduced into a system. The typical convention is

to initialize all variables, memory cells, etc. as untainted.

In SIMPIL, we only have a single source of user input:

the get input(·) call. In a real implementation, get input(·)

represents values returned from a system call, return values

from a library call, etc. A taint policy will also typically

distinguish between different input sources. For example, an

internet-facing network input source may always introduce

taint, while a ﬁle descriptor that reads from a trusted

conﬁguration ﬁle may not [2, 50, 65]. Further, speciﬁc taint

sources can be tracked independently, e.g., τ

∆

can map not

just the bit indicating taint status, but also the source.

Taint Propagation. Taint propagation rules specify the taint

status for data derived from tainted or untainted operands.

Since taint is a bit, propositional logic is usually used to

express the propagation policy, e.g., t

∨ t

indicates the

result is tainted if t

is tainted or t

is tainted.

µ, ∆ ` e

⇓ v

. . . µ, ∆ ` e

⇓ v

∆

= ∆[x

← v

, . . . , x

← v

] pc

= φ[f] ι = Σ[pc

]

λ, Σ, φ, µ, ∆, ζ, pc, call f(e

,. . . ,e

) (pc + 1) :: λ, Σ, φ, µ, ∆

, ∆ :: ζ, pc

, ι

CALL

ι = Σ[pc

]

:: λ

, Σ, φ, µ, ∆, ∆

:: ζ

, pc, return λ

, Σ, φ, µ, ∆

, ζ

, pc

, ι

RET

µ, ∆ ` e ⇓ v v 6∈ dom(Σ) s = disassemble(µ[v]) Σ

= Σ[v ← s] ι = Σ

[v]

Σ, µ, ∆, pc, jmp e Σ

, µ, ∆, v, ι

GENCODE

Figure 4: Example operational semantics for adding support for call-by-value function calls and dynamically generated code.

Component Policy Check

input

(·), P

bincheck

(·), P

memcheck

(·) T

const

() F

unop

(t), P

assign

(t) t

binop

, t

) t

∨ t

mem

, t

) t

condcheck

, t

) ¬t

goto check

) ¬t

Table III: A typical tainted jump target policy for detecting

attacks. A dot (·) denotes an argument that is ignored. A

taint status is converted to a boolean value in the natural

way, e.g., T maps to true, and F maps to false.

Taint Checking. Taint status values are often used to

determine the runtime behavior of a program, e.g., an attack

detector may halt execution if a jump target address is

tainted. In SIMPIL, we perform checking by adding the

policy to the premise of the operational semantics. For

instance, the T-GOTO rule uses the P

goto check

(t) policy.

goto check

(t) returns T if it is safe to perform a jump

operation when the target address has taint value t, and

returns F otherwise. If F is returned, the premise for the

rule is not met and the machine terminates abnormally

(signifying an exception).

C. A Typical Taint Policy

A prototypical application of dynamic taint analysis is

attack detection. Table III shows a typical attack detection

policy which we call the tainted jump policy. In order to be

concrete when discussing the challenges and opportunities in

taint analysis, we often contrast implementation choices with

respect to this policy. We stress that although the policy is

designed to detect attacks, other applications of taint analysis

are typically very similar.

The goal of the tainted jump policy is to protect a

potentially vulnerable program from control ﬂow hijacking

attacks. The main idea in the policy is that an input-derived

value will never overwrite a control-ﬂow value such as a

return address or function pointer. A control ﬂow exploit,

however, will overwrite jump targets (e.g., return addresses)

with input-derived values. The tainted jump policy ensures

safety against such attacks by making sure tainted jump

targets are never used.

The policy introduces taint into the system by marking

all values returned by get

input(·) as tainted. Taint is then

propagated through the program in a straightforward manner,

e.g., the result of a binary operation is tainted if either

operand is tainted, an assigned variable is tainted if the right-

hand side value is tainted, and so on.

Example 3. Table IV shows the taint calculations at each

step of the execution for the following program:

1 x := 2∗ g e t i n p u t ( · )

2 y := 5 + x

3 goto y

On line 1, the executing program receives input, assumed

to be 20, and multiplies by 2. Since all input is marked as

tainted, 2 ∗ get input(·) is also tainted via T-BINOP, and

x is marked in τ

∆

as tainted via T-ASSIGN. On line 2,

x (tainted) is added to y (untainted). Since one operand is

tainted, y is marked as tainted in τ

∆

. On line 3, the program

jumps to y. Since y is tainted, the T-GOTO premise for P

is not satisﬁed, and the machine halts abnormally.

Different Policies for Different Applications. Different

applications of taint analysis can use different policy de-

cisions. As we will see in the next section, the typical

taint policy described in Table III is not appropriate for

all application domains, since it does not consider whether

memory addresses are tainted. Thus, it may miss some

attacks. We discuss alternatives to this policy in the next

section.

D. Dynamic Taint Analysis Challenges and Opportunities

There are several challenges to using dynamic taint

analysis correctly, including:

• Tainted Addresses. Distinguishing between memory

addresses and cells is not always appropriate.

• Undertainting. Dynamic taint analysis does not prop-

erly handle some types of information ﬂow.

All You Ever Wanted to Know about Dynamic Taint Analysis and Forward Symbolic Execution (but Might Have Been Afraid to Ask)

Figures

Citations

TaintDroid: An Information-Flow Tracking System for Realtime Privacy Monitoring on Smartphones

TaintDroid: an information-flow tracking system for realtime privacy monitoring on smartphones

SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis

Symbolic execution for software testing: three decades later

Unleashing Mayhem on Binary Code

References

A Discipline of Programming

KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs

The program dependence graph and its use in optimization

DART: directed automated random testing

Language-based information-flow security

Related Papers (5)

Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software

KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs

Symbolic execution and program testing

TaintDroid: an information-flow tracking system for realtime privacy monitoring on smartphones

DART: directed automated random testing

Frequently Asked Questions (8)

Q1. What are the contributions mentioned in the paper "All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask)" ?

Q2. Why did binary instrumentation techniques become so popular?

Q3. What are the advantages of a concolic-based approach?

Q4. Why does dynamic analysis not compute control dependencies?

Q5. What is the context used to store function-local variables?

Q6. What are the execution contexts of a language?

Q7. What is the problem of taint spread?

Q8. What are the three standard ways to handle symbolic jumps?