An In-Depth Analysis of Disassembly on Full-Scale x86/x64 Binaries

This paper is included in the Proceedings of the

25th USENIX Security Symposium

August 10–12, 2016 • Austin, TX

ISBN 978-1-931971-32-4

Open access to the Proceedings of the

25th USENIX Security Symposium

is sponsored by USENIX

An In-Depth Analysis of Disassembly

on Full-Scale x86/x64 Binaries

Dennis Andriesse, Xi Chen, and Victor van der Veen, Vrije Universiteit Amsterdam;

Asia Slowinska, Lastline, Inc.; Herbert Bos, Vrije Universiteit Amsterdam

https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/andriesse

USENIX Association 25th USENIX Security Symposium 583

An In-Depth Analysis of Disassembly on Full-Scale x86/x64 Binaries

Dennis Andriesse

†§

, Xi Chen

†§

, Victor van der Veen

†§

, Asia Slowinska

‡

, and Herbert Bos

†§

†

{d.a.andriesse,x.chen,v.vander.veen,h.j.bos}@vu.nl

Computer Science Institute, Vrije Universiteit Amsterdam

§

Amsterdam Department of Informatics

‡

asia@lastline.com

Lastline, Inc.

Abstract

It is well-known that static disassembly is an unsolved

problem, but how much of a problem is it in real software—

for instance, for binary protection schemes? This work

studies the accuracy of nine state-of-the-art disassemblers

on 981 real-world compiler-generated binaries with a

wide variety of properties. In contrast, prior work focuses

on isolated corner cases; we show that this has led to a

widespread and overly pessimistic view on the prevalence

of complex constructs like inline data and overlapping

code, leading reviewers and researchers to underestimate

the potential of binary-based research. On the other hand,

some constructs, such as function boundaries, are much

harder to recover accurately than is reﬂected in the litera-

ture, which rarely discusses much needed error handling

for these primitives. We study 30 papers recently pub-

lished in six major security venues, and reveal a mismatch

between expectations in the literature, and the actual ca-

pabilities of modern disassemblers. Our ﬁndings help

improve future research by eliminating this mismatch.

1 Introduction

The capabilities and limitations of disassembly are not

always clearly deﬁned or understood, making it difﬁcult

for researchers and reviewers to judge the practical fea-

sibility of techniques based on it. At the same time, dis-

assembly is the backbone of research in static binary

instrumentation [5, 19, 32], binary code lifting to LLVM

IR (for reoptimization or analysis) [

38

], binary-level vul-

nerability search [

27

], and binary-level anti-exploitation

systems [

1

,

8

,

29

,

46

]. Disassembly is thus crucial for

analyzing or securing untrusted or proprietary binaries,

where source code is simply not available.

The accuracy of disassembly strongly depends on the

type of binary under analysis. In the most general case,

the disassembler can make very few assumptions on the

structure of a binary—high-level concepts like functions

and loops have no real signiﬁcance at the binary level [

3

].

Moreover, the binary may contain complex constructs,

such as overlapping or self-modifying code, or inline

data in executable regions. This is especially true for ob-

fuscated binaries, making disassembly of such binaries

extremely challenging. Disassembly in general is unde-

cidable [

43

]. On the other hand, one might expect that

compilers emit code with more predictable properties,

containing a limited set of patterns that the disassembler

may try to identify.

Whether this is true is not well recognized, leading

to a wide range of views on disassembly. These vary

from the stance that disassembly of benign binaries is

a solved problem [

48

], to the stance that complex cases

are rampant [

23

]. It is unclear which view is justiﬁed in

a given situation. The aim of our work is thus to study

binary disassembly in a realistic setting, and more clearly

delineate the capabilities of modern disassemblers.

It is clear from prior work that obfuscated code may

complicate disassembly in a myriad of ways [

18

,

21

].

We therefore limit our study to non-obfuscated binaries

compiled on modern x86 and x64 platforms (the most

common in binary analysis and security research). Specif-

ically, we focus on binaries generated with the popular

gcc

,

clang

and Visual Studio compilers. We explore a

wide variety of 981 realistic binaries, including stripped,

optimized, statically linked, and link-time optimized bi-

naries, as well as library code that includes handcrafted

assembly. We disassemble these binaries using nine state-

of-the-art research and industry disassemblers, studying

their ability to recover all disassembly primitives com-

monly used in the literature: instructions, function start ad-

dresses, function signatures, Control Flow Graphs (CFG)

and callgraphs. In contrast, prior studies focus strongly

on complex corner cases in isolation [

23

,

25

]. Our results

show that such cases are exceedingly rare, even in opti-

mized code, and that focusing on them leads to an overly

pessimistic view on disassembly.

We show that many disassembly primitives can be re-

covered with better accuracy than previously thought. For

1

584 25th USENIX Security Symposium USENIX Association

instance, instruction accuracy often approaches 100%,

even using linear disassembly. On the other hand, we

also identify some primitives which are more difﬁcult to

recover—most notably, function start information.

To facilitate a better match between the capabilities of

disassemblers and the expectations in the literature, we

comprehensively study all binary-based papers published

in six major security conferences in the last three years.

Ironically, this study shows a focus in the literature on

rare complex constructs, while little attention is devoted

to error handling for primitives that really are prone to in-

accuracies. For instance, only 25% of Windows-targeted

papers that rely on function information discuss potential

inaccuracies, even though the accuracy of function detec-

tion regularly drops to 80% or less. Moreover, less than

half of all papers implement mechanisms to deal with

inaccuracies, even though in most cases errors can lead to

malignant failures like crashes.

Contributions & Outline

The contributions of our work are threefold.

(1)

We study disassembly on 981 full-scale compiler-

generated binaries, to clearly deﬁne the true capa-

bilities of modern disassemblers (Section 3) and the

implications on binary-based research (Section 4).

(2)

Our results allow researchers and reviewers to ac-

curately judge future binary-based research—a task

currently complicated by the myriad of differing opin-

ions on the subject. To this end, we release all our raw

results and ground truth for use in future evaluations

of binary-based research

1

.

(3)

We analyze the quality of all recent binary-based

work published in six major security venues by com-

paring our results to the requirements and assump-

tions of this work (Section 5). This shows where

disassembler capabilities and the literature are mis-

matched, and how this mismatch can be resolved

moving forward (Section 6).

2 Evaluating Real-World Disassembly

This section outlines our disassembly evaluation approach.

We discuss our results, and the implications on binary-

based research, in Sections 3–4. Sections 5–6 discuss how

closely expectations in the literature match our results.

2.1 Binary Test Suite

We focus our analysis on non-obfuscated x86 and x64 bi-

naries generated with modern compilers. Our experiments

are based on Linux (ELF) and Windows (PE) binaries,

generated with the popular

gcc

v5.1.1,

clang

v3.7.0 and

1

https://www.vusec.net/projects/disassembly/

Visual Studio 2015 compilers—the most recent versions

at the time of writing. The x86/x64 instruction set is

the most common target in binary-based research. More-

over, x86/x64 is a variable-length instruction set, allowing

unique constructs such as overlapping and “misaligned”

instructions which can be difﬁcult to disassemble. We

exclude obfuscated binaries, as there is no doubt that they

can wreak havoc on disassembler performance and we

hardly need conﬁrm this in our experiments.

We base our disassembly experiments on a test suite

composed of the SPEC CPU2006 C and C++ benchmarks,

the widely used and highly optimized

glibc-2.22

li-

brary, and a set of popular server applications consisting

of

nginx

v1.8.0,

lighttpd

v1.4.39,

opensshd

v7.1p2,

vsftpd

v3.0.3 and

exim

v4.86. This test suite has several

properties which make it representative: (1) It contains a

wide variety of realistic C and C++ binaries, ranging from

very small to large; (2) These correspond to binaries used

in evaluations of other work, making it easier to relate

our results to the literature; (3) The tests include highly

optimized library code, containing handwritten assembly

and complex corner cases which regular applications do

not; (4) SPEC CPU2006 compiles on both Linux and

Windows, allowing a fair comparison of results between

gcc, clang, and Visual Studio.

To study the impact of compiler options on disassembly,

we compile the SPEC CPU2006 part of our test suite

multiple times with a variety of popular conﬁgurations.

Speciﬁcally: (1) Optimization levels

O0

,

O1

,

O2

and

O3

for

gcc

,

clang

and Visual Studio; (2) Optimization for

size (

Os

) on

gcc

and

clang

; (3) Static linking and link-

time optimization (

-flto

) on 64-bit

gcc

; (4) Stripped

binaries, as well as binaries with symbols. We compile the

servers for both x86 and x64 with

gcc

and

clang

, leaving

all remaining settings at the Makeﬁle defaults. Finally,

we compile

glibc-2.22

with 64-bit

gcc

, to which it is

speciﬁcally tailored. In total, our test suite contains 981

binaries and shared objects.

2.2 Disassembly Primitives

We test all ﬁve common disassembly primitives used in

the literature (see Section 5). Some of these go well

beyond basic instruction recovery, and are only supported

by a subset of the disassemblers we test.

(1) Instructions: The pure assembly-level instructions.

(2) Function starts: Start addresses of the functions

originally deﬁned in the source code.

(3) Function signatures: Parameter lists for functions

found by the disassembler.

(4) Control Flow Graph (CFG) accuracy: The sound-

ness and completeness of the CFG digraphs

G

cfg

=

(V

bb

, E

cf

), which describe how control ﬂow edges E

cf

⊆

V

bb

×V

bb

connect the basic blocks

V

bb

. In practice, dis-

2

USENIX Association 25th USENIX Security Symposium 585

assemblers deviate from the traditional CFG; typically

by omitting indirect edges, and sometimes by deﬁning

a global CFG rather than per-function CFGs. Therefore,

we deﬁne the Interprocedural CFG (ICFG): the union of

all function-level CFGs, connected through interprocedu-

ral call and jump edges. This allows us to abstract from

the disassemblers’ varying CFG deﬁnitions, by focusing

our measurement on the coverage of basic blocks in the

ICFG. We pay special attention to hard-to-resolve basic

blocks, such as the heads of address-taken functions and

switch/case blocks reached via jump tables.

(5) Callgraph accuracy: The correctness of the digraph

G =(V

cs

∪ V

f

, E

call

)

linking the set

V

cs

of call sites to

the function starts

V

f

through call edges

E

call

⊆ V

cs

×V

f

.

Similarly to the CFG, disassemblers deviate from the

traditional callgraph deﬁnition by including only direct

call edges. In our experiments, we therefore measure the

completeness of this direct callgraph, considering indirect

calls and tailcalls separately in our complex case analysis.

2.3 Complex Constructs

We also study the prevalence in real-world binaries of

complex corner cases which are often cited as particularly

harmful to disassembly [5, 23, 34].

(1) Overlapping/shared basic blocks: Basic blocks may

be shared between different functions, hindering disas-

semblers from properly separating these functions.

(2) Overlapping instructions: Since x86/x64 uses

variable-length instructions without any enforced memory

alignment, jumps can target any offset within a multi-byte

instruction. This allows the same code bytes to be in-

terpreted as multiple overlapping instructions, some of

which may be missed by disassemblers.

(3) Inline data and jump tables: Data bytes may be

mixed in with instructions in a code section. Examples of

potential inline data include jump tables or local constants.

Such data can cause false positive instructions, and can

desynchronize the instruction stream if the last few data

bytes are mistakenly interpreted as the start of a multi-

byte instruction. Disassembly then continues parsing this

instruction into the actual code bytes, losing track of the

instruction stream alignment.

(4) Switches/case blocks: Switches are a challenge for

basic block discovery, because the switch case blocks are

typically indirect jump targets (encoded in jump tables).

(5) Alignment bytes: Some code (i.e.,

nop

) or data

bytes may have no semantic meaning, serving only to

align other code for optimization of memory accesses.

Alignment bytes may cause desynchronization if they do

not encode valid instructions.

(6) Multi-entry functions: Functions may have multiple

basic blocks used as entry points, which can complicate

function start recognition.

<BB

0

>

cmp ecx, edx

jl <BB

2

>

jmp <BB

1

>

<BB

1

>

mov eax,[fptr+ecx]

call eax

<BB

2

>

mov eax,[fptr+edx]

call eax

<f

1

>

<f

2

>

<f

0

>

<BB

0

>

cmp ecx, edx

jl <BB

2

>

jmp <BB

1

>

<BB

1

>

mov eax,[fptr+ecx]

call eax

<BB

2

>

mov eax,[fptr+edx]

call eax

<f

1

>

<f

2

>

<f

0

>

Recursive Linear

Figure 1: Disassembly methods. Arrows show disassem-

bly ﬂow. Gray blocks show missed or corrupted code.

(7) Tail calls: In this common optimization, a function

ends not with a return, but with a jump to another function.

This makes it more difﬁcult for disassemblers to detect

where the optimized function ends.

2.4 Disassembly & Testing Environment

We conducted all disassembly experiments on an Intel

Core i5 4300U machine with 8GB of RAM, running

Ubuntu 15.04 with kernel 3.19.0-47. We compiled our

gcc

and

clang

test cases on this same machine. The

Visual Studio binaries were compiled on an Intel Core i7

3770 machine with 8GB of RAM, running Windows 10.

We tested nine popular industry and research dis-

assemblers: IDA Pro v6.7, Hopper v3.11.5, Dyninst

v9.1.0 [

5

], BAP v0.9.9 [

7

], ByteWeight v0.9.9 [

4

], Jakstab

v0.8.4 [

17

], angr v4.6.1.4 [

36

], PSI v1.1 [

47

] (the suc-

cessor of BinCFI [

48

]), and objdump v2.22. ByteWeight

yields only function starts, while Dyninst and PSI sup-

port only ELF binaries (for Dyninst, this is due to our

Linux testing environment). Jakstab supports only x86

PE binaries. We omit angr results for x86, as angr is opti-

mized for x64. PSI is based on objdump, with added error

correction. Section 3 shows that PSI (and all linear dis-

assemblers) perform equivalently to objdump; therefore,

we group these under the name linear disassembly.

All others are recursive descent disassemblers, illus-

trated in Figure 1. These follow control ﬂow to avoid

desynchronization by inline data, and to discover com-

plex cases like overlapping instructions. In contrast, linear

disassemblers like objdump simply decode all code bytes

consecutively, and may be confused by inline data, possi-

bly causing garbled code like

BB

1

in the ﬁgure. Recursive

disassemblers avoid this problem, but may miss indirect

control ﬂow targets, such as f

1

and f

2

in the ﬁgure.

3

586 25th USENIX Security Symposium USENIX Association

2.5 Ground Truth

Our disassembly experiments require precise ground truth

on instructions, basic blocks and function starts, call sites,

function signatures and switch/case addresses. This in-

formation is normally only available at the source level.

Clearly, we cannot obtain our ground truth from any dis-

assembler, as this would bias our experiments.

We base our ELF ground truth on information collected

by an LLVM analysis pass, and on DWARF v3 debug-

ging information. Speciﬁcally, we use LLVM to collect

source-level information, such as the source lines belong-

ing to functions and switch statements. We then compile

our test binaries with DWARF information, and link the

source-level line numbers to the binary-level addresses us-

ing the DWARF line number table. We also use DWARF

information on function parameters for our function sig-

nature analysis. We strip the DWARF information from

the binaries before our disassembly experiments.

The line number table provides a full mapping of source

lines to binary, but not all instructions correspond directly

to a source line. To ﬁnd these instructions, we use Cap-

stone v3.0.4 to start a conservative linear disassembly

sweep from each known instruction address, stopping

at control ﬂow instructions unless we can guarantee the

validity of their destination and fall-through addresses.

For instance, the target of a direct unconditional jump

instruction can be guaranteed, while its fall-through block

cannot (as it might contain inline data).

This approach yields ground truth for over 98% of

code bytes in the tested binaries. We manually analyze

the remaining bytes, which are typically alignment code

unreachable by control ﬂow. The result is a ground truth

ﬁle for each binary test case, that speciﬁes the type of

each code byte, as well as instruction and function starts,

switch/case addresses, and function signatures.

We use a similar method for the Windows PE tests,

but based on information from PDB (Program Database)

ﬁles produced by Visual Studio instead of DWARF. This

produces ﬁles analogous to our ELF ground truth format.

We release all our ground truth ﬁles and our test suite,

to aid in future evaluations of binary-based research and

disassembly.

3 Disassembly Results

This section describes the results of our disassembly ex-

periments, using the methodology outlined in Section 2.

We ﬁrst discuss application binaries (SPEC and servers),

followed by a separate discussion on highly optimized

libraries. Finally, we discuss the impact of static linking

and link-time optimization. We release all our raw results,

and present aggregated results here for space reasons.

3.1 Application Binaries

This section presents disassembly results for application

code. We discuss accuracy results for all primitives, and

also analyze the prevalence of complex cases.

3.1.1 SPEC CPU2006 Results

Figures 2a–2e show the accuracy for the SPEC CPU2006

C and C++ benchmarks of the recovered instructions,

function starts, function signatures, CFGs and callgraphs,

respectively. We show the percentage of correctly recov-

ered (true positive) primitives for each tested compiler

at optimization levels

O0

–

O3

. Note that the legend in

Figure 2a applies to Figures 2a–2e. All lines are geo-

metric mean results (simply referred to as “mean” from

this point); arithmetic means and standard deviations are

discussed in the text where they differ signiﬁcantly. We

show separate results for the C and C++ benchmarks, to

expose variations in disassembly accuracy that may result

from different code patterns.

Some disassemblers support only a subset of the tested

primitives. For instance, linear disassembly provides only

instructions, and IDA Pro is the only tested disassembler

that provides function signatures. Moreover, some disas-

semblers only support a subset of the tested binary types,

and are therefore only shown in the plots where they are

applicable. For clarity, the graphs only show results for

stripped binaries; our tests with standard symbols (not

DWARF information) are discussed in the text.

3.1.1.1 Instruction boundaries

Figure 2a shows the percentage of correctly recovered

instructions. Interestingly, linear disassembly consistently

outperforms all other disassemblers, ﬁnding 100% of the

instructions for

gcc

and

clang

binaries (without false

positives), and 99.92% in the worst case for Visual Studio.

Linear disassembly.

The perfect accuracy for linear

disassembly with

gcc

and

clang

owes to the fact that

these compilers never produce inline data, not even for

jump tables. Instead, jump tables and other data are placed

in the .rodata section.

Visual Studio does produce inline data, typically jump

tables. This leads to some false positives with linear disas-

sembly (data treated as code), amounting to a worst-case

mean of 989 false positive instructions (0.56% of the dis-

assembled code) for the x86 C++ tests at

O3

. The number

of missed instructions (false negatives, due to desynchro-

nization) is much lower, at a worst-case mean of 0.09%.

This is because x86/x64 disassembly automatically resyn-

chronizes within two or three instructions [21].

4

An In-Depth Analysis of Disassembly on Full-Scale x86/x64 Binaries

Citations

Ramblr: Making Reassembly Great Again.

Neural Nets Can Learn Function Type Signatures From Binaries

SPAIN: security patch analysis for binaries towards understanding the pain and pills

RetroWrite: Statically Instrumenting COTS Binaries for Fuzzing and Sanitization

Compiler-Agnostic Function Detection in Binaries

References

Obfuscation of executable code to improve resistance to static disassembly

Practical Control Flow Integrity and Randomization for Binary Executables

Control flow integrity for COTS binaries

BAP: a binary analysis platform

Static disassembly of obfuscated binaries

Related Papers (5)

SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis

Control flow integrity for COTS binaries

BAP: a binary analysis platform

Pin: building customized program analysis tools with dynamic instrumentation

Practical Control Flow Integrity and Randomization for Binary Executables