scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Finding and understanding bugs in C compilers

04 Jun 2011-Vol. 46, Iss: 6, pp 283-294
TL;DR: Csmith, a randomized test-case generation tool, is created and spent three years using it to find compiler bugs, and a collection of qualitative and quantitative results about the bugs it found are presented.
Abstract: Compilers should be correct. To improve the quality of C compilers, we created Csmith, a randomized test-case generation tool, and spent three years using it to find compiler bugs. During this period we reported more than 325 previously unknown bugs to compiler developers. Every compiler we tested was found to crash and also to silently generate wrong code when presented with valid input. In this paper we present our compiler-testing tool and the results of our bug-hunting study. Our first contribution is to advance the state of the art in compiler testing. Unlike previous tools, Csmith generates programs that cover a large subset of C while avoiding the undefined and unspecified behaviors that would destroy its ability to automatically find wrong-code bugs. Our second contribution is a collection of qualitative and quantitative results about the bugs we have found in open-source C compilers.

Summary (5 min read)

1. Introduction

  • The theory of compilation is well developed, and there are compiler frameworks in which many optimizations have been proved correct.
  • It should be no surprise that optimizing compilers—like all complex software systems—contain bugs.
  • This is the author’s version of the work.

3 unsigned char y = 255;

  • The authors created Csmith, a randomized test-case generator that supports compiler bug-hunting using differential testing.
  • For the past three years, the authors have used Csmith to discover bugs in C compilers.
  • This is a significant problem for complex systems.
  • Large-scale source-code verification efforts such as the seL4 OS kernel [12] and Airbus’s verification of fly-by-wire software [24] can be undermined by an incorrect C compiler.

2. Csmith

  • Csmith began as a fork of Randprog [27], an existing random C program generator about 1,600 lines long.
  • The authors previous paper showed that in many cases, these bugs could be worked around by turning volatile-object accesses into calls to helper functions.
  • For some test programs generated by Randprog, their rewriting procedure was insufficient to correct a defect that the authors had found in the C compiler.
  • The authors turned Randprog into Csmith, a 40,000-line C++ program for randomly generating C programs.
  • Most of Csmith’s complexity arises from the requirement that it interleave static analysis with code generation in order to produce meaningful test cases, as described below.

2.1 Randomized Differential Testing using Csmith

  • Random testing [9], also called fuzzing [17], is a black-box testing method in which test inputs are generated randomly.
  • Randomized differential testing [16] has the advantage that no oracle for test results is needed.
  • It exploits the idea that if one has multiple, deterministic implementations of the same specification, all implementations must produce the same result from the same valid input.
  • When two implementations produce different outputs, one of them must be faulty.
  • Given three or more implementations, a tester can use voting to heuristically determine which implementations are wrong.

2.2 Design Goals

  • First and most important, every generated program must be well formed and have a single meaning according to the C standard.
  • The C99 language [11] has 191 undefined behaviors—e.g., dereferencing a null pointer or overflowing a signed integer—that destroy the meaning of a program.
  • Programs emitted by Csmith must avoid all of these behaviors or, in certain cases such as argument-evaluation order, be independent of the choices that will be made by the compiler.
  • Section 2.4 describes the hazards that Csmith must avoid and its strategies for avoiding them.
  • Csmith’s second design goal is to maximize expressiveness subject to constraints imposed by the first goal.

2.3 Randomly Generating Programs

  • Assignments are modeled as statements—not expressions—which reflects the most common idiom for assignments in C code.
  • Third, the local environment carries points-to facts about all in-scope pointers.
  • After choosing a production from the table, Csmith executes the filter, which decides if the choice is acceptable in the current context.
  • It calls a function to generate the program fragment that corresponds to the nonterminal production.
  • Thus, when the top-level function has been completely generated, Csmith is finished.

2.4 Safety Mechanisms

  • Table 1 lists the mechanisms that Csmith uses to avoid generating C programs that execute undefined behaviors or depend on unspecified behaviors.
  • Integer safety More and more, compilers are aggressively exploiting the undefined nature of integer behaviors such as signed overflow and shift-past-bitwidth.
  • This was not difficult, but had a few tricky aspects.
  • The aspect of C’s type system that required the most care was qualifier safety: ensuring that const and volatile qualifiers attached to pointers at various levels of indirection are not removed by implicit casts.
  • As fragments of code are generated, Csmith tests if the new code has a read/write or write/write conflict with the current effect.

2.5 Efficient Global Safety

  • Loops and function calls threaten to invalidate previously validated code.
  • Consider the following code, in which Csmith has just added the loop back-edge at line 7.

3 while (...) {

  • The newly added line 7 makes line 4 unsafe, due to the back-edge carrying a null-valued p.
  • The authors therefore restrict the analysis to local scope except when function calls and loops are involved.
  • The global fixpoint analysis is run when a loop is closed by adding its back-edge.
  • If Csmith finds that the program contains unsafe statements, it deletes code starting from the tail of the loop until the program becomes globally safe.
  • This strategy is about three times faster than pessimistically running the global dataflow analysis before adding every piece of code.

2.6 Design Trade-offs

  • An ideally portable test program would be “strictly conforming” to the C language standard.
  • In summary, despite the fact that Knight and Leveson [13] found a substantial number of correlated errors in an experiment on Nversion programming, Csmith has yielded no evidence of correlated failures among unrelated C compilers.
  • It is not difficult to generate random programs that always terminate.
  • The authors goal was to make the generated programs “look right”—to contain a balanced mix of arithmetic and bitwise operations, of references to scalars and aggregates, of loops and straight-line code, of single-level and multi-level indirections, and so on.
  • In summary, many aspects of Csmith’s design and implementation were informed by their understanding of how modern compilers work and how they break.

3. Results

  • The authors conducted five experiments using Csmith, their random program generator.
  • The authors first experiment was uncontrolled and unstructured: over a three-year period, the authors opportunistically found and reported bugs in a variety of C compilers.
  • (§3.1) In the second experiment, the authors compiled and ran one million random programs using several years’ worth of versions of GCC and LLVM, to understand how their robustness is evolving over time.
  • The authors found that these metrics did not significantly improve when they added randomly generated programs to the compilers’ existing test suites.
  • Nevertheless, as shown by their other results, Csmith-generated programs allowed us to discover bugs that are missed by the compilers’ standard test suites.

3.1 Opportunistic Bug Finding

  • Five of these compilers (GCC, LLVM, CIL, TCC, and Open64) were open source and five were commercial products.
  • Errors that manifest at run time include the computation of a wrong result; a crash or other abnormal termination of the generated code; termination of a program that should have executed forever; and non-termination of a program that should have terminated.
  • Thus, for the most part, the authors simply tested these compilers until they found a few crash errors and a few wrong-code errors, reported them, and moved on.
  • Both the GCC and LLVM teams were responsive to their bug reports.
  • The second reason the authors prefer dealing with opensource compilers is that their development process is transparent: they can watch the mailing lists, participate in discussions, and see fixes as they are committed.

1 int bar (unsigned x) {

  • This bug and five others like it were in CompCert’s unverified front-end code.
  • Here, a large PowerPC stack frame is being allocated.
  • CompCert’s PPC semantics failed to specify a constraint on the width of this immediate value, on the assumption that the assembler would catch out-of-range values.
  • The striking thing about their CompCert results is that the middleend bugs the authors found in all other compilers are absent.
  • This is not for lack of trying: the authors have devoted about six CPU-years to the task.

3.2 Quantitative Comparison of GCC and LLVM Versions

  • Running these tests took about 1.5 weeks on 20 machines in the Utah Emulab testbed [28].
  • (Note that the y-axes of these graphs are logarithmic.).
  • These graphs also indicate the number of crash bugs that were fixed in response to their bug reports.
  • The middle row of graphs in Figure 3 shows the number of distinct assertion failures in LLVM and the number of distinct internal compiler errors in GCC induced by their tests.

3.3 Bug-Finding Performance as a Function of Test-Case Size

  • There are many ways in which a random test-case generator might be “tuned” for particular goals, e.g., to focus on certain kinds of compiler defects.
  • Other factors being equal, small test cases are preferable because they are closer to being reportable to compiler developers.
  • The authors repeated this for various ranges of test-input sizes.
  • First, throughput is increased because compiler startup costs are better amortized.

3.4 Bug-Finding Performance Compared to Other Tools

  • And otherwise-idle machines, using one CPU on each host.
  • Each generator repeatedly produced programs that the authors compiled and tested using the same compilers and optimization options that were used for the experiments in Section 3.2.
  • Figure 5 plots the cumulative number of distinct crash errors found by these program generators during the one-week test.

3.5 Code Coverage

  • Because the authors find many bugs, they hypothesized that randomly generated programs exercise large parts of the compilers that were not covered by existing test suites.
  • To test this, the authors enabled code-coverage monitoring in GCC and LLVM.
  • The authors then used each compiler to build its own test suite, and also to build its test suite plus 10,000 Csmith-generated programs.
  • The authors best guess is that these metrics are too shallow to capture Csmith’s effects, and that the authors would generate useful additional coverage in terms of deeper metrics such as path or value coverage.

3.6 Where Are the Bugs?

  • Table 4 characterizes the GCC and LLVM bugs the authors found by compiler part.
  • Tables 5 and 6 show the ten buggiest files in LLVM and GCC as measured by their experiment in Section 3.1.
  • Most of the bugs the authors found in GCC were in the middle end: the machineindependent optimizers.
  • LLVM is a younger compiler and their testing shook out some front-end and back-end bugs that would probably not be present in a more mature software base.

3.7 Examples of Wrong-Code Bugs

  • This section characterizes a few of the bugs that were revealed by miscompilation of programs generated by Csmith.
  • These bugs fit into a simple model in which optimizations are structured like this: analysis if (safety check) { transformation }.
  • If x is variable and c1 and c2 are constants, the expression (x/c1)!=c2 can be profitably rewritten as (x-(c1*c2))>(c1-1), using unsigned arithmetic to avoid problems with negative values.
  • Prior to performing the transformation, expressions such as c1*c2 and (c1*c2)+(c1-1) are checked for overflow.
  • The authors found a bug that caused GCC to miscompile this code:.

2 static int *p = &g[0];

  • The problem occurred when the compiler failed to recognize that p and q are aliases; this happened because q was mistakenly identified as a read-only memory location, which is defined not to alias a mutable location.
  • The wrong not-alias fact caused the store in line 7 to be marked as dead so that a subsequent dead-store elimination pass removed it.
  • A version of GCC miscompiled this function:.

4. Discussion

  • One might suspect that random testing finds bugs that do not matter in practice.
  • Undoubtedly this happens sometimes, but in a number of instances the authors have direct confirmation that Csmith is finding bugs that matter, because bugs that they have found and reported have been independently rediscovered and re-reported by application developers.
  • By a very conservative estimate—counting only the times that a compiler.

6. Conclusion

  • Using randomized differential testing, the authors found and reported hundreds of previously unknown bugs in widely used C compilers, both commercial and open source.
  • Most of their reported defects have been fixed, meaning that compiler implementers found them important enough to track down, and 25 of the bugs the authors reported against GCC were classified as release-blocking.
  • To create a random program generator with high bug-finding power, the key problem the authors solved was the expressive generation of C programs that are free of undefined behavior and independent of unspecified behavior.
  • The incremental cost of a new bug that the authors find today is much lower.
  • Software Csmith is open source and available for download at http://embed.cs.utah.edu/csmith/.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Finding and Understanding Bugs in C Compilers
Xuejun Yang Yang Chen Eric Eide John Regehr
University of Utah, School of Computing
{ jxyang, chenyang, eeide, regehr }@cs.utah.edu
Abstract
Compilers should be correct. To improve the quality of C compilers,
we created Csmith, a randomized test-case generation tool, and
spent three years using it to find compiler bugs. During this period
we reported more than 325 previously unknown bugs to compiler
developers. Every compiler we tested was found to crash and also
to silently generate wrong code when presented with valid input.
In this paper we present our compiler-testing tool and the results
of our bug-hunting study. Our first contribution is to advance the
state of the art in compiler testing. Unlike previous tools, Csmith
generates programs that cover a large subset of C while avoiding the
undefined and unspecified behaviors that would destroy its ability
to automatically find wrong-code bugs. Our second contribution is a
collection of qualitative and quantitative results about the bugs we
have found in open-source C compilers.
Categories and Subject Descriptors
D.2.5 [Software Engineer-
ing]: Testing and Debugging—testing tools; D.3.2 [Programming
Languages]: Language Classifications—C; D.3.4 [Programming
Languages]: Processors—compilers
General Terms Languages, Reliability
Keywords
compiler testing, compiler defect, automated testing,
random testing, random program generation
1. Introduction
The theory of compilation is well developed, and there are compiler
frameworks in which many optimizations have been proved correct.
Nevertheless, the practical art of compiler construction involves a
morass of trade-offs between compilation speed, code quality, code
debuggability, compiler modularity, compiler retargetability, and
other goals. It should be no surprise that optimizing compilers—like
all complex software systems—contain bugs.
Miscompilations often happen because optimization safety
checks are inadequate, static analyses are unsound, or transfor-
mations are flawed. These bugs are out of reach for current and
future automated program-verification tools because the specifica-
tions that need to be checked were never written down in a precise
way, if they were written down at all. Where verification is imprac-
tical, however, other methods for improving compiler quality can
succeed. This paper reports our experience in using testing to make
C compilers better.
c
ACM, 2011. This is the author’s version of the work. It is posted here by permission
of ACM for your personal use. Not for redistribution.
The definitive version was published in Proceedings of the 2011 ACM SIGPLAN
Conference on Programming Language Design and Implementation (PLDI), San Jose,
CA, Jun. 2011, http://doi.acm.org/10.1145/NNNNNNN.NNNNNNN
1 int foo (void) {
2 signed char x = 1;
3 unsigned char y = 255;
4 return x > y;
5 }
Figure 1.
We found a bug in the version of GCC that shipped with
Ubuntu Linux 8.04.1 for x86. At all optimization levels it compiles
this function to return
1
; the correct result is
0
. The Ubuntu compiler
was heavily patched; the base version of GCC did not have this bug.
We created Csmith, a randomized test-case generator that sup-
ports compiler bug-hunting using differential testing. Csmith gen-
erates a C program; a test harness then compiles the program us-
ing several compilers, runs the executables, and compares the out-
puts. Although this compiler-testing approach has been used be-
fore [
6
,
16
,
23
], Csmith’s test-generation techniques substantially
advance the state of the art by generating random programs that
are expressive—containing complex code using many C language
features—while also ensuring that every generated program has a
single interpretation. To have a unique interpretation, a program
must not execute any of the 191 kinds of undefined behavior, nor
depend on any of the 52 kinds of unspecified behavior, that are
described in the C99 standard.
For the past three years, we have used Csmith to discover bugs
in C compilers. Our results are perhaps surprising in their extent: to
date, we have found and reported more than 325 bugs in mainstream
C compilers including GCC, LLVM, and commercial tools.
Figure
1
shows a representative example. Every compiler that we have tested,
including several that are routinely used to compile safety-critical
embedded systems, has been crashed and also shown to silently
miscompile valid inputs. As measured by the responses to our bug
reports, the defects discovered by Csmith are important. Most of
the bugs we have reported against GCC and LLVM have been
fixed. Twenty-five of our reported GCC bugs have been classified as
P1, the maximum, release-blocking priority for GCC defects. Our
results suggest that fixed test suites—the main way that compilers
are tested—are an inadequate mechanism for quality control.
We claim that Csmith is an effective bug-finding tool in part
because it generates tests that explore atypical combinations of C
language features. Atypical code is not unimportant code, how-
ever; it is simply underrepresented in fixed compiler test suites.
Developers who stray outside the well-tested paths that represent
a compiler’s “comfort zone”—for example by writing kernel code
or embedded systems code, using esoteric compiler options, or au-
tomatically generating code—can encounter bugs quite frequently.
This is a significant problem for complex systems. Wolfe [
30
], talk-
ing about independent software vendors (ISVs) says: An ISV with
a complex code can work around correctness, turn off the optimizer
in one or two files, and usually they have to do that for any of the
compilers they use (emphasis ours). As another example, the front
1

page of the Web site for GMP, the GNU Multiple Precision Arith-
metic Library, states, “Most problems with compiling GMP these
days are due to problems not in GMP, but with the compiler.
Improving the correctness of C compilers is a worthy goal:
C code is part of the trusted computing base for almost every modern
computer system including mission-critical financial servers and life-
critical pacemaker firmware. Large-scale source-code verification
efforts such as the seL4 OS kernel [
12
] and Airbus’s verification
of fly-by-wire software [
24
] can be undermined by an incorrect
C compiler. The need for correct compilers is amplified because
operating systems are almost always written in C and because C
is used as a portable assembly language. It is targeted by code
generators from a wide variety of high-level languages including
Matlab/Simulink, which is used to generate code for industrial
control systems.
Despite recent advances in compiler verification, testing is still
needed. First, a verified compiler is only as good as its specification
of the source and target language semantics, and these specifications
are themselves complex and error-prone. Second, formal verification
seldom provides end-to-end guarantees: “details” such as parsers,
libraries, and file I/O usually remain in the trusted computing
base. This second point is illustrated by our experience in testing
CompCert [
14
], a verified C compiler. Using Csmith, we found
previously unknown bugs in unproved parts of CompCert—bugs
that cause this compiler to silently produce incorrect code.
Our goal was to discover serious, previously unknown bugs:
in mainstream C compilers like GCC and LLVM;
that manifest when compiling core language constructs such as
arithmetic, arrays, loops, and function calls;
targeting ubiquitous architectures such as x86 and x86-64; and
using mundane optimization flags such as –O and –O2.
This paper reports our experience in achieving this goal. Our first
contribution is to advance the state of the art in compiler test-case
generation, finding—as far as we know—many more previously
unknown compiler bugs than any similar effort has found. Our
second contribution is to qualitatively and quantitatively characterize
the bugs found by Csmith: What do they look like? In what parts of
the compilers are they primarily found? How are they distributed
across a range of compiler versions?
2. Csmith
Csmith began as a fork of Randprog [
27
], an existing random
C program generator about 1,600 lines long. In earlier work, we
extended and adapted Randprog to find bugs in C compilers’
translation of accesses to volatile-qualified objects [
6
], resulting
in a 7,000-line program. Our previous paper showed that in many
cases, these bugs could be worked around by turning volatile-object
accesses into calls to helper functions. The key observation was this:
while the rules regarding the addition, elimination, and reordering
of accesses to volatile objects are not at all like the rules governing
ordinary variable accesses in C, they are almost identical to the rules
governing function calls.
For some test programs generated by Randprog, our rewriting
procedure was insufficient to correct a defect that we had found in
the C compiler. Our hypothesis was that this was always due to “reg-
ular” compiler bugs not related to the volatile qualifier. To investigate
these compiler defects, we shifted our research emphasis toward
looking for generic wrong-code bugs. We turned Randprog into
Csmith, a 40,000-line C++ program for randomly generating C pro-
grams. Compared to Randprog, Csmith can generate C programs
that utilize a much wider range of C features including complex
control flow and data structures such as pointers, arrays, and structs.
Most of Csmith’s complexity arises from the requirement that it
output
execute
compiler 1
Csmith
no bug
majorityminority
bug
execute
compiler 3
compare
execute
compiler 2
Figure 2.
Finding bugs in three compilers using randomized differ-
ential testing
interleave static analysis with code generation in order to produce
meaningful test cases, as described below.
2.1 Randomized Differential Testing using Csmith
Random testing [9], also called fuzzing [17], is a black-box testing
method in which test inputs are generated randomly. Randomized
differential testing [
16
] has the advantage that no oracle for test
results is needed. It exploits the idea that if one has multiple, deter-
ministic implementations of the same specification, all implementa-
tions must produce the same result from the same valid input. When
two implementations produce different outputs, one of them must
be faulty. Given three or more implementations, a tester can use
voting to heuristically determine which implementations are wrong.
Figure 2 shows how we use these ideas to find compiler bugs.
2.2 Design Goals
Csmith has two main design goals. First and most important, every
generated program must be well formed and have a single meaning
according to the C standard. The meaning of a C program is the
sequence of side effects it performs. The principal side effect of a
Csmith-generated program is to print a value summarizing the com-
putation performed by the program.
1
This value is a checksum of the
program’s non-pointer global variables at the end of the program’s
execution. Thus, if changing the compiler or compiler options causes
the checksum emitted by a Csmith-generated program to change, a
compiler bug has been found.
The C99 language [
11
] has 191 undefined behaviors—e.g.,
dereferencing a null pointer or overflowing a signed integer—that
destroy the meaning of a program. It also has 52 unspecified
behaviors—e.g., the order of evaluation of arguments to a function—
where a compiler may choose from a set of options with no
requirement that the choice be made consistently. Programs emitted
by Csmith must avoid all of these behaviors or, in certain cases
such as argument-evaluation order, be independent of the choices
that will be made by the compiler. Many undefined and unspecified
behaviors can be avoided structurally by generating programs in
such a way that problems never arise. However, a number of
important undefined and unspecified behaviors are not easy to avoid
in a structural fashion. In these cases, Csmith solves the problem
using static analysis and by adding run-time checks to the generated
code.
Section
2.4 describes the hazards that Csmith must avoid and
its strategies for avoiding them.
Csmith’s second design goal is to maximize expressiveness
subject to constraints imposed by the first goal. An “expressive”
generator supports many language features and combinations of
features. Our hypothesis is that expressiveness is correlated with
bug-finding power.
1
Accesses to volatile objects are also side effects as described in the C
standard. We do not discuss these “secondary” side effects of Csmith-
generated programs further in this paper.
2

Csmith creates programs with the following features:
function definitions, and global and local variable definitions
most kinds of C expressions and statements
control flow:
if
/
else
, function calls,
for
loops,
return
,
break, continue, goto
signed and unsigned integers of all standard widths
arithmetic, logical, and bitwise operations on integers
structs: nested, and with bit-fields
arrays of and pointers to all supported types, including pointers
and arrays
the
const
and
volatile
type qualifiers, including at different
levels of indirection for pointer-typed variables
The most important language features not currently supported
by Csmith are strings, dynamic memory allocation, floating-point
types, unions, recursion, and function pointers. We plan to add some
of these features to future versions of our tool.
2.3 Randomly Generating Programs
The shape of a program generated by Csmith is governed by a
grammar for a subset of C. A program is a collection of type,
variable, and function definitions; a function body is a block; a
block contains a list of declarations and a list of statements; and a
statement is an expression, control-flow construct (e.g.,
if
,
return
,
goto
, or
for
), assignment, or block. Assignments are modeled
as statements—not expressions—which reflects the most common
idiom for assignments in C code. We leverage our grammar to
produce other idiomatic code as well: in particular, we include a
statement kind that represents a loop iterating over an array. The
grammar is implemented by a collection of hand-coded C++ classes.
Csmith maintains a global environment that holds top-level
definitions: i.e., types, global variables, and functions. The global
environment is extended as new entities are defined during program
generation. To hold information relevant to the current program-
generation point, Csmith also maintains a local environment with
three primary kinds of information. First, the local environment
describes the current call chain, supporting context-sensitive pointer
analysis. Second, it contains effect information describing objects
that may have been read or written since (1) the start of the current
function, (2) the start of the current statement, and (3) the previous
sequence point.
2
Third, the local environment carries points-to
facts about all in-scope pointers. These elements and their roles
in program generation are further described in Section 2.4.
Csmith begins by randomly creating a collection of struct type
declarations. For each, it randomly decides on a number of members
and the type of each member. The type of a member may be
a (possibly qualified) integral type, a bit-field, or a previously
generated struct type.
After the preliminary step of producing type definitions, Csmith
begins to generate C program code. Csmith generates a program
top-down, starting from a single function called by
main
. Each step
of the program generator involves the following sub-steps:
1.
Csmith randomly selects an allowable production from its gram-
mar for the current program point. To make the choice, it consults
2
As explained in Section 3.8 of the C FAQ [
25
], A sequence point is a
point in time at which the dust has settled and all side effects which have
been seen so far are guaranteed to be complete. The sequence points listed
in the C standard are at the end of the evaluation of a full expression (a full
expression is an expression statement, or any other expression which is not a
subexpression within any larger expression); at the
||
,
&&
,
?:
, and comma
operators; and at a function call (after the evaluation of all the arguments,
and just before the actual call).
a probability table and a filter function specific to the current
point: there is a table/filter pair for statements, another for ex-
pressions, and so on. The table assigns a probability to each
of the alternatives, where the sum of the probabilities is one.
After choosing a production from the table, Csmith executes the
filter, which decides if the choice is acceptable in the current con-
text. Filters enforce basic semantic restrictions (e.g.,
continue
can only appear within a loop), user-controllable limits (e.g.,
maximum statement depth and number of functions), and other
user-controllable options. If the filter rejects the selected pro-
duction, Csmith simply loops back, making selections from the
table until the filter succeeds.
2.
If the selected production requires a target—e.g., a variable or
function—then the generator randomly selects an appropriate
target or defines a new one. In essence, Csmith dynamically
constructs a probability table for the potential targets and in-
cludes an option to create a new target. Function and variable
definitions are thus created “on demand” at the time that Csmith
decides to refer to them.
3.
If the selected production allows the generator to select a type,
Csmith randomly chooses one. Depending on the current context,
the choice may be restricted (e.g., while generating the operands
of an integral-typed expression) or unrestricted (e.g., while
generating the types of parameters to a new function). Random
choices are guided by the grammar, probability tables, and filters
as already described.
4.
If the selected production is nonterminal, the generator recurses.
It calls a function to generate the program fragment that corre-
sponds to the nonterminal production. More generally, Csmith
recurses for each nonterminal element of the current production:
e.g., for each subcomponent of a compound statement, or for
each parameter in a function call.
5.
Csmith executes a collection of dataflow transfer functions. It
passes the points-to facts from the local environment to the
transfer functions, which produce a new set of points-to facts.
Csmith updates the local environment with these facts.
6.
Csmith executes a collection of safety checks. If the checks
succeed, the new code fragment is committed to the generated
program. Otherwise, the fragment is dropped and any changes
to the local environment are rolled back.
When Csmith creates a call to a new function—one whose body
does not yet exist—generation of the current function is suspended
until the new function is finished. Thus, when the top-level function
has been completely generated, Csmith is finished. At that point
it pretty-prints all of the randomly generated definitions in an
appropriate order: types, globals, prototypes, and functions. Finally,
Csmith outputs a
main
function. The
main
function calls the top-
level randomly generated function, computes a checksum of the
non-pointer global variables, prints the checksum, and exits.
2.4 Safety Mechanisms
Table
1 lists the mechanisms that Csmith uses to avoid generating C
programs that execute undefined behaviors or depend on unspecified
behaviors. This section provides additional detail about the hazards
that Csmith must avoid and its strategies for avoiding them.
Integer safety
More and more, compilers are aggressively ex-
ploiting the undefined nature of integer behaviors such as signed
overflow and shift-past-bitwidth. For example, recent versions of
Intel CC, GCC, and LLVM evaluate
(x+1)>x
to
1
while also eval-
uating
(INT_MAX+1)
to
INT_MIN
. In another example, discovered
by the authors of Google’s Native Client software [
3
], routine refac-
toring of C code caused the expression
1<<32
to be evaluated on a
3

Code-Generation- Code-Execution-
Problem Time Solution Time Solution
use without initialization explicit initializers,
avoid jumping over
initializers
qualifier mismatch static analysis
infinite recursion disallow recursion
signed integer overflow bounded loop vars safe math wrappers
OOB array access bounded loop vars force index in bounds
unspecified eval. order effect analysis
of function arguments
R/W and W/W conflicts effect analysis
betw. sequence points
access to out-of-scope pointer analysis
stack variable
null pointer dereference pointer analysis null pointer checks
Table 1.
Summary of Csmith’s strategies for avoiding undefined
and unspecified behaviors. When both a code-generation-time and
code-execution-time solution are listed, Csmith uses both.
platform with 32-bit integers. The compiler exploited this undefined
behavior to turn a sandboxing safety check into a nop.
To keep Csmith-generated programs from executing integer
undefined behaviors, we implemented a family of wrapper functions
for arithmetic operators whose (promoted) operands might overflow.
This was not difficult, but had a few tricky aspects. For example,
the C99 standard does not explicitly identify the evaluation of
INT_MIN%-1
as being an undefined behavior, but most compilers
treat it as such. The C99 standard also has very restrictive semantics
for signed left-shift: it is illegal (for implementations using 2’s
complement integers) to shift a 1-bit into or past the sign bit. Thus,
evaluating
1<<31
destroys the meaning of a C99 program on a
platform with 32-bit ints.
Several safe math libraries for C that we examined themselves ex-
ecute operations with undefined behavior while performing checks.
Apparently, avoiding such behavior is indeed a tricky business.
Type safety
The aspect of C’s type system that required the
most care was qualifier safety: ensuring that
const
and
volatile
qualifiers attached to pointers at various levels of indirection are not
removed by implicit casts. Accessing a const- or volatile-qualified
object through a non-qualified pointer results in undefined behavior.
Pointer safety
Null-pointer dereferences are easy to avoid using
dynamic checks. There is, on the other hand, no portable run-time
method for detecting references to a function-scoped variable whose
lifetime has ended. (Hacks involving the stack pointer are not robust
under inlining.) Although there are obvious ways to structurally
avoid this problem, such as using a type system to ensure that a
pointer to a function-scoped variable never outlives the function, we
judged this kind of strategy to be too restrictive. Instead, Csmith
freely permits pointers to local variables to escape (e.g., into global
variables) but uses a whole-program pointer analysis to ensure that
such pointers are not dereferenced or used in comparisons once they
become invalid.
Csmith’s pointer analysis is flow sensitive, field sensitive, context
sensitive, path insensitive, and array-element insensitive. A
points-to
fact is an explicit set of locations that may be referenced, and may
include two special elements: the null pointer and the invalid (out-
of-scope) pointer.
Points-to
sets containing a single element serve as
must-alias facts unless the
pointed-to
object is an array element.
Because Csmith does not generate programs that use the heap,
assigning names to storage locations is trivial.
Effect safety
The C99 standard states that “[t]he order of evalua-
tion of the function designator, the actual arguments, and subexpres-
sions within the actual arguments is unspecified. Also, undefined
behavior occurs if “[b]etween two sequence points, an object is
modified more than once, or is modified and the prior value is read
other than to determine the value to be stored.
To avoid these problems, Csmith uses its pointer analysis to
perform a conservative interprocedural analysis and determine the
effect of every expression, statement, and function that it generates.
An effect consists of two sets: locations that may be read and
locations that may be written. Csmith ensures that no location is
both read and written, or written more than once, between any pair
of sequence points. As a special case, in an assignment, a location
can be read on the RHS and also written on the LHS.
Effects are computed, and effect safety guaranteed, incrementally.
At each sequence point, Csmith resets the current effect (i.e., may-
read and may-write sets). As fragments of code are generated,
Csmith tests if the new code has a read/write or write/write conflict
with the current effect. If a conflict is detected, the new code is
thrown away and the process restarts. For example, if Csmith is
generating an expression
p + func()
and it happens that
func
may
modify
p
, the call to
func
is discarded and a new subexpression is
generated. If there is no conflict, the read and write sets are updated
and the process continues. Probabilistic progress is guaranteed: by
design, Csmith always has a non-zero chance of generating code
that introduces no new conflicts, such as a constant expression.
Array safety
Csmith uses several methods to ensure that array
indices are in bounds. First, it generates index variables that are
modified only in the “increment” parts of
for
loops and whose
values never exceed the bounds of the arrays being indexed. Second,
variables with arbitrary value are forced to be in bounds using the
modulo operator. Finally, as needed, Csmith emits explicit checks
against array lengths.
Initializer safety
A C program must not use an uninitialized
function-scoped variable. For the most part, initializer safety is
easy to ensure structurally by initializing variables close to where
they are declared. Gotos introduce the possibility that initializers
may be jumped over; Csmith solves this by forbidding gotos from
spanning initialization code.
2.5 Efficient Global Safety
Csmith never commits to a code fragment unless it has been shown
to be safe. However, loops and function calls threaten to invalidate
previously validated code. For example, consider the following code,
in which Csmith has just added the loop back-edge at line 7.
1 int i;
2 int *p = &i;
3 while (...) {
4 *p = 3;
5 ...
6 p = 0;
7 }
The assignment through
p
at line 4 was safe when it was
generated. However, the newly added line 7 makes line 4 unsafe,
due to the back-edge carrying a null-valued p.
One solution to this problem is to be conservative: run the whole-
program dataflow analysis before committing any new statement to
the program. This is not efficient. We therefore restrict the analysis
to local scope except when function calls and loops are involved. For
a function call, the callee is re-analyzed at each call site immediately.
Csmith uses a different strategy for loops. This is because so
many statements are inside loops, and the extra calls to the dataflow
analysis add substantial overhead to the code generator. Csmith’s
strategy is to optimistically generate code that is locally safe. Local
safety includes running a single step of the dataflow engine (which
reaches a sound result when generating code not inside any loop).
4

The global fixpoint analysis is run when a loop is closed by adding
its back-edge. If Csmith finds that the program contains unsafe
statements, it deletes code starting from the tail of the loop until
the program becomes globally safe. This strategy is about three
times faster than pessimistically running the global dataflow analysis
before adding every piece of code.
2.6 Design Trade-offs
Allow implementation-defined behavior
An ideally portable test
program would be “strictly conforming” to the C language standard.
This means that the program’s output would be independent of all
unspecified and unspecified behaviors and, in addition, be indepen-
dent of any implementation-defined behavior. C99 has 114 kinds of
implementation-defined behavior, and they have pervasive impact
on the behavior of real C programs. For example, the result of per-
forming a bitwise operation on a signed integer is implementation-
defined, and operands to arithmetic operations are implicitly cast to
int
(which has implementation-defined width) before performing
the operation. We believe it is impossible to generate realistically ex-
pressive C code that retains a single interpretation across all possible
choices of implementation-defined behaviors.
Programs generated by Csmith do not generate the same output
across compilers that differ in (1) the width and representation of
integers, (2) behavior when casting to a signed integer type when
the value cannot be represented in an object of the target type, and
(3) the results of bitwise operations on signed integers. In practice
there is not much diversity in how C implementations define these
behaviors. For mainstream desktop and embedded targets, there
are roughly three equivalence classes of compiler targets: those
where
int
is 32 bits and
long
is 64 bits (e.g., x86-64), those where
int
and
long
are 32 bits (e.g., x86, ARM, and PowerPC), and
those where
int
is 16 bits and
long
is 32 bits (e.g., MSP430 and
AVR). Using Csmith, we can perform differential testing within an
equivalence class but not across classes.
No ground truth
Csmith’s programs are not self-checking: we are
unable to predict their outputs without running them. This is not a
problem when we use Csmith for randomized differential testing.
We have never seen an “interesting” split vote where randomized
differential testing of a collection of C compilers fails to produce
a clear consensus answer, nor have we seen any cases in which a
majority of tested compilers produces the same incorrect result.
(We would catch the problem by hand as part of verifying the
failure-inducing program.) In fact, we have not seen even two
unrelated compilers produce the same incorrect output for a Csmith-
generated test case. It therefore seems unlikely that all compilers
under test would produce the same incorrect output for a test case.
Of course, if that did happen we would not detect that problem; this
is an inherent limitation of differential testing without an oracle.
In summary, despite the fact that Knight and Leveson [
13
] found
a substantial number of correlated errors in an experiment on N-
version programming, Csmith has yielded no evidence of correlated
failures among unrelated C compilers. Our hypothesis is that the
observed lack of correlation stems from the fact that most compiler
bugs are in passes that operate on an intermediate representation
and there is substantial diversity among IRs.
No guarantee of termination
It is not difficult to generate random
programs that always terminate. However, we judged that this would
limit Csmith’s expressiveness too much: for example, it would force
loops to be highly structured. Additionally, always-terminating
tests cannot find compiler bugs that wrongfully terminate a non-
terminating program. (We have found bugs of this kind.) About
10% of the programs generated by Csmith are (apparently) non-
terminating. In practice, during testing, they are easy to deal with
using timeouts.
Target middle-end bugs
Commercial test suites for C compil-
ers [
1
,
19
,
20
] are primarily aimed at checking standards confor-
mance. Csmith, on the other hand, is mainly intended to find bugs in
the parts of a compiler that perform transformations on an interme-
diate representation—the so-called “middle end” of a compiler. As a
result, we have found large numbers of middle-end bugs missed by
existing testing techniques (
Section
3.6). At the same time, Csmith
is rather poor at finding gaps in standards conformance. For example,
it makes no attempt to test a compiler’s handling of trigraphs, long
identifier names, or variadic functions.
Targeting the middle end has several aspects. First, all generated
programs pass the lexer, parser, and typechecker. Second, we per-
formed substantial manual tuning of the 80 probabilities that govern
Csmith’s random choices. Our goal was to make the generated pro-
grams “look right”—to contain a balanced mix of arithmetic and
bitwise operations, of references to scalars and aggregates, of loops
and straight-line code, of single-level and multi-level indirections,
and so on. Third, Csmith specifically generates idiomatic code (e.g.,
loops that access all elements of an array) to stress-test parts of the
compiler we believe to be error-prone. Fourth, we designed Csmith
with an eye toward generating programs that exercise the constructs
of a compiler’s intermediate representation, and we decided to avoid
generating source-level diversity that is unlikely to improve the
“coverage” of a compiler’s intermediate representations. For exam-
ple, since additional levels of parentheses around expressions are
stripped away early in the compilation process, we do not generate
them, nor do we generate all of C’s syntactic loop forms since they
are typically all lowered to the same IR constructs. Finally, Csmith
was designed to be fast enough that it can generate programs that
are a few tens of thousands of lines long in a few seconds. Large
programs are preferred because (empirically—see
Section
3.3) they
find more bugs. In summary, many aspects of Csmith’s design and
implementation were informed by our understanding of how modern
compilers work and how they break.
3. Results
We conducted five experiments using Csmith, our random program
generator. This section summarizes our findings.
Our first experiment was uncontrolled and unstructured: over a
three-year period, we opportunistically found and reported bugs in
a variety of C compilers. We found bugs in all the compilers we
tested—hundreds of defects, many classified as high-priority bugs.
(§3.1)
In the second experiment, we compiled and ran one million
random programs using several years’ worth of versions of GCC
and LLVM, to understand how their robustness is evolving over time.
As measured by our tests over the programs that Csmith produces,
the quality of both compilers is generally improving. (§3.2)
Third, we evaluated Csmith’s bug-finding power as a function of
the size of the generated C programs. The largest number of bugs is
found at a surprisingly large program size: about 81 KB. (§3.3)
Fourth, we compared Csmith’s bug-finding power to that of four
previous random C program generators. Over a week, Csmith was
able to find significantly more distinct compiler crash errors than
previous program generators could. (§3.4)
Finally, we investigated the effect of testing random programs on
branch, function, and line coverage of the GCC and LLVM source
code. We found that these metrics did not significantly improve
when we added randomly generated programs to the compilers’
existing test suites. Nevertheless, as shown by our other results,
Csmith-generated programs allowed us to discover bugs that are
missed by the compilers’ standard test suites. (§3.5)
We conclude the presentation of results by analyzing some of
the bugs we found in GCC and LLVM. (§3.6, §3.7)
5

Citations
More filters
Proceedings ArticleDOI
14 Oct 2017
TL;DR: DeepXplore efficiently finds thousands of incorrect corner case behaviors in state-of-the-art DL models with thousands of neurons trained on five popular datasets including ImageNet and Udacity self-driving challenge data.
Abstract: Deep learning (DL) systems are increasingly deployed in safety- and security-critical domains including self-driving cars and malware detection, where the correctness and predictability of a system's behavior for corner case inputs are of great importance Existing DL testing depends heavily on manually labeled data and therefore often fails to expose erroneous behaviors for rare inputs We design, implement, and evaluate DeepXplore, the first whitebox framework for systematically testing real-world DL systems First, we introduce neuron coverage for systematically measuring the parts of a DL system exercised by test inputs Next, we leverage multiple DL systems with similar functionality as cross-referencing oracles to avoid manual checking Finally, we demonstrate how finding inputs for DL systems that both trigger many differential behaviors and achieve high neuron coverage can be represented as a joint optimization problem and solved efficiently using gradient-based search techniques DeepXplore efficiently finds thousands of incorrect corner case behaviors (eg, self-driving cars crashing into guard rails and malware masquerading as benign software) in state-of-the-art DL models with thousands of neurons trained on five popular datasets including ImageNet and Udacity self-driving challenge data For all tested DL models, on average, DeepXplore generated one test input demonstrating incorrect behavior within one second while running only on a commodity laptop We further show that the test inputs generated by DeepXplore can also be used to retrain the corresponding DL model to improve the model's accuracy by up to 3%

884 citations


Cites methods from "Finding and understanding bugs in C..."

  • ...Such differential testing techniques have been applied successfully in the past for detecting logic bugs without manual specifications in a wide variety of traditional software [6, 11, 14, 15, 45, 86]....

    [...]

  • ...Differential testing has been widely used for successfully testing various types of traditional software including JVMs [14], C compilers [45, 86], SSL/TLS certification validation logic [11, 15, 56, 67], PDF viewers [56], space flight software [28], mobile applications [36], and Web application firewalls [6]....

    [...]

Proceedings ArticleDOI
TL;DR: DeepXplore as discussed by the authors is a white box framework for systematically testing real-world deep learning (DL) systems, which leverages multiple DL systems with similar functionality as cross-referencing oracles to avoid manual checking.
Abstract: Deep learning (DL) systems are increasingly deployed in safety- and security-critical domains including self-driving cars and malware detection, where the correctness and predictability of a system's behavior for corner case inputs are of great importance. Existing DL testing depends heavily on manually labeled data and therefore often fails to expose erroneous behaviors for rare inputs. We design, implement, and evaluate DeepXplore, the first whitebox framework for systematically testing real-world DL systems. First, we introduce neuron coverage for systematically measuring the parts of a DL system exercised by test inputs. Next, we leverage multiple DL systems with similar functionality as cross-referencing oracles to avoid manual checking. Finally, we demonstrate how finding inputs for DL systems that both trigger many differential behaviors and achieve high neuron coverage can be represented as a joint optimization problem and solved efficiently using gradient-based search techniques. DeepXplore efficiently finds thousands of incorrect corner case behaviors (e.g., self-driving cars crashing into guard rails and malware masquerading as benign software) in state-of-the-art DL models with thousands of neurons trained on five popular datasets including ImageNet and Udacity self-driving challenge data. For all tested DL models, on average, DeepXplore generated one test input demonstrating incorrect behavior within one second while running only on a commodity laptop. We further show that the test inputs generated by DeepXplore can also be used to retrain the corresponding DL model to improve the model's accuracy by up to 3%.

651 citations

Proceedings ArticleDOI
09 Jun 2014
TL;DR: This work introduces equivalence modulo inputs (EMI), a simple, widely applicable methodology for validating optimizing compilers, and profiles a program's test executions and stochastically prune its unexecuted code to create a practical implementation.
Abstract: We introduce equivalence modulo inputs (EMI), a simple, widely applicable methodology for validating optimizing compilers. Our key insight is to exploit the close interplay between (1) dynamically executing a program on some test inputs and (2) statically compiling the program to work on all possible inputs. Indeed, the test inputs induce a natural collection of the original program's EMI variants, which can help differentially test any compiler and specifically target the difficult-to-find miscompilations. To create a practical implementation of EMI for validating C compilers, we profile a program's test executions and stochastically prune its unexecuted code. Our extensive testing in eleven months has led to 147 confirmed, unique bug reports for GCC and LLVM alone. The majority of those bugs are miscompilations, and more than 100 have already been fixed. Beyond testing compilers, EMI can be adapted to validate program transformation and analysis systems in general. This work opens up this exciting, new direction.

363 citations


Cites background from "Finding and understanding bugs in C..."

  • ...Others, however, can cause compilers to silently miscompile a program and produce wrong code, subverting the programmer’s intent....

    [...]

  • ...It is clear that EMI is a relaxed notion of semantic equivalence: JPK = JQK =⇒ JPK =I JQK. 3 Note that we may also force a non-deterministic language to assume deterministic behavior....

    [...]

  • ...C; H.3.4 [Programming Languages]: Processors—compilers General Terms Algorithms, Languages, Reliability, Verification Keywords Compiler testing, miscompilation, equivalent program variants, automated testing...

    [...]

Proceedings Article
08 Aug 2012
TL;DR: LangFuzz is an effective tool for security testing: Applied on the Mozilla JavaScript interpreter, it discovered a total of 105 new severe vulnerabilities within three months of operation (and thus became one of the top security bug bounty collectors within this period); applied on the PHP interpreter, It discovered 18 new defects causing crashes.
Abstract: Fuzz testing is an automated technique providing random data as input to a software system in the hope to expose a vulnerability. In order to be effective, the fuzzed input must be common enough to pass elementary consistency checks; a JavaScript interpreter, for instance, would only accept a semantically valid program. On the other hand, the fuzzed input must be uncommon enough to trigger exceptional behavior, such as a crash of the interpreter. The LangFuzz approach resolves this conflict by using a grammar to randomly generate valid programs; the code fragments, however, partially stem from programs known to have caused invalid behavior before. LangFuzz is an effective tool for security testing: Applied on the Mozilla JavaScript interpreter, it discovered a total of 105 new severe vulnerabilities within three months of operation (and thus became one of the top security bug bounty collectors within this period); applied on the PHP interpreter, it discovered 18 new defects causing crashes.

316 citations


Cites background or methods from "Finding and understanding bugs in C..."

  • ...To address this issue, fuzzing frameworks include strategies to model the structure of the desired input data; for fuzz testing a JavaScript interpreter, this would require a built-in JavaScript grammar....

    [...]

  • ...Section 6 describes the application of LangFuzz on PHP. Section 7 discusses threats to validity, and Section 8 closes with conclusion and future work....

    [...]

  • ...All other software defects (e.g. defect that produce false output without abnormal termination) will be disregarded, although such defects might be detected under certain circumstances....

    [...]

Proceedings ArticleDOI
03 Sep 2018
TL;DR: FairFuzz as discussed by the authors proposes a two-pronged approach to increase the coverage achieved by American Fuzzy Lop (AFL) by automatically identifying branches exercised by few AFL-produced inputs (rare branches).
Abstract: In recent years, fuzz testing has proven itself to be one of the most effective techniques for finding correctness bugs and security vulnerabilities in practice. One particular fuzz testing tool, American Fuzzy Lop (AFL), has become popular thanks to its ease-of-use and bug-finding power. However, AFL remains limited in the bugs it can find since it simply does not cover large regions of code. If it does not cover parts of the code, it will not find bugs there. We propose a two-pronged approach to increase the coverage achieved by AFL. First, the approach automatically identifies branches exercised by few AFL-produced inputs (rare branches), which often guard code that is empirically hard to cover by naively mutating inputs. The second part of the approach is a novel mutation mask creation algorithm, which allows mutations to be biased towards producing inputs hitting a given rare branch. This mask is dynamically computed during fuzz testing and can be adapted to other testing targets. We implement this approach on top of AFL in a tool named FairFuzz. We conduct evaluation on real-world programs against state-of-the-art versions of AFL. We find that on these programs FairFuzz achieves high branch coverage at a faster rate that state-of-the-art versions of AFL. In addition, on programs with nested conditional structure, it achieves sustained increases in branch coverage after 24 hours (average 10.6% increase). In qualitative analysis, we find that FairFuzz has an increased capacity to automatically discover keywords.

280 citations

References
More filters
Proceedings ArticleDOI
11 Oct 2009
TL;DR: To the knowledge, this is the first formal proof of functional correctness of a complete, general-purpose operating-system kernel.
Abstract: Complete formal verification is the only known way to guarantee that a system is free of programming errorsWe present our experience in performing the formal, machine-checked verification of the seL4 microkernel from an abstract specification down to its C implementation We assume correctness of compiler, assembly code, and hardware, and we used a unique design approach that fuses formal and operating systems techniques To our knowledge, this is the first formal proof of functional correctness of a complete, general-purpose operating-system kernel Functional correctness means here that the implementation always strictly follows our high-level abstract specification of kernel behaviour This encompasses traditional design and implementation safety properties such as the kernel will never crash, and it will never perform an unsafe operation It also proves much more: we can predict precisely how the kernel will behave in every possible situationseL4, a third-generation microkernel of L4 provenance, comprises 8,700 lines of C code and 600 lines of assembler Its performance is comparable to other high-performance L4 kernels

1,629 citations


"Finding and understanding bugs in C..." refers methods in this paper

  • ...As measured by the responses to our bug reports, the defects discovered by Csmith are important....

    [...]

Journal ArticleDOI
09 Dec 2002
TL;DR: The overall design and implementation of Netbed is presented and its ability to improve experimental automation and efficiency is demonstrated, leading to new methods of experimentation, including automated parameter-space studies within emulation and straightforward comparisons of simulated, emulated, and wide-area scenarios.
Abstract: Three experimental environments traditionally support network and distributed systems research: network emulators, network simulators, and live networks. The continued use of multiple approaches highlights both the value and inadequacy of each. Netbed, a descendant of Emulab, provides an experimentation facility that integrates these approaches, allowing researchers to configure and access networks composed of emulated, simulated, and wide-area nodes and links. Netbed's primary goals are ease of use, control, and realism, achieved through consistent use of virtualization and abstraction.By providing operating system-like services, such as resource allocation and scheduling, and by virtualizing heterogeneous resources, Netbed acts as a virtual machine for network experimentation. This paper presents Netbed's overall design and implementation and demonstrates its ability to improve experimental automation and efficiency. These, in turn, lead to new methods of experimentation, including automated parameter-space studies within emulation and straightforward comparisons of simulated, emulated, and wide-area scenarios.

1,398 citations


"Finding and understanding bugs in C..." refers methods in this paper

  • ...5 weeks on 20 machines in the Utah Emulab testbed [28]....

    [...]

Journal ArticleDOI
TL;DR: This paper reports on the development and formal verification of CompCert, a compiler from Clight (a large subset of the C programming language) to PowerPC assembly code, using the Coq proof assistant both for programming the compiler and for proving its correctness.
Abstract: This paper reports on the development and formal verification (proof of semantic preservation) of CompCert, a compiler from Clight (a large subset of the C programming language) to PowerPC assembly code, using the Coq proof assistant both for programming the compiler and for proving its correctness. Such a verified compiler is useful in the context of critical software and its formal verification: the verification of the compiler guarantees that the safety properties proved on the source code hold for the executable compiled code as well.

1,124 citations


"Finding and understanding bugs in C..." refers background in this paper

  • ...Wolfe [30], talk­ ing about independent software vendors (ISVs) says: An ISV with a complex code can work around correctness, turn off the optimizer in one or two .les, and usually they have to do that for any of the compilers they use (emphasis ours)....

    [...]

Journal ArticleDOI
TL;DR: The following section describes the tools built to test the utilities, including the fuzz (random character) generator, ptyjig (to test interactive utilities), and scripts to automate the testing process.
Abstract: The following section describes the tools we built to test the utilities. These tools include the fuzz (random character) generator, ptyjig (to test interactive utilities), and scripts to automate the testing process. Next, we will describe the tests we performed, giving the types of input we presented to the utilities. Results from the tests will follow along with an analysis of the results, including identification and classification of the program bugs that caused the crashes. The final section presents concluding remarks, including suggestions for avoiding the types of problems detected by our study and some commentary on the bugs we found. We include an Appendix with the user manual pages for fuzz and ptyjig.

1,110 citations

Journal ArticleDOI
TL;DR: The delta debugging algorithm generalizes and simplifies the failing test case to a minimal test case that still produces the failure, and isolates the difference between a passing and a failingTest case.
Abstract: Given some test case, a program fails. Which circumstances of the test case are responsible for the particular failure? The delta debugging algorithm generalizes and simplifies the failing test case to a minimal test case that still produces the failure. It also isolates the difference between a passing and a failing test case. In a case study, the Mozilla Web browser crashed after 95 user actions. Our prototype implementation automatically simplified the input to three relevant user actions. Likewise, it simplified 896 lines of HTML to the single line that caused the failure. The case study required 139 automated test runs or 35 minutes on a 500 MHz PC.

980 citations