Finding and understanding bugs in C compilers
Summary (5 min read)
1. Introduction
- The theory of compilation is well developed, and there are compiler frameworks in which many optimizations have been proved correct.
- It should be no surprise that optimizing compilers—like all complex software systems—contain bugs.
- This is the author’s version of the work.
3 unsigned char y = 255;
- The authors created Csmith, a randomized test-case generator that supports compiler bug-hunting using differential testing.
- For the past three years, the authors have used Csmith to discover bugs in C compilers.
- This is a significant problem for complex systems.
- Large-scale source-code verification efforts such as the seL4 OS kernel [12] and Airbus’s verification of fly-by-wire software [24] can be undermined by an incorrect C compiler.
2. Csmith
- Csmith began as a fork of Randprog [27], an existing random C program generator about 1,600 lines long.
- The authors previous paper showed that in many cases, these bugs could be worked around by turning volatile-object accesses into calls to helper functions.
- For some test programs generated by Randprog, their rewriting procedure was insufficient to correct a defect that the authors had found in the C compiler.
- The authors turned Randprog into Csmith, a 40,000-line C++ program for randomly generating C programs.
- Most of Csmith’s complexity arises from the requirement that it interleave static analysis with code generation in order to produce meaningful test cases, as described below.
2.1 Randomized Differential Testing using Csmith
- Random testing [9], also called fuzzing [17], is a black-box testing method in which test inputs are generated randomly.
- Randomized differential testing [16] has the advantage that no oracle for test results is needed.
- It exploits the idea that if one has multiple, deterministic implementations of the same specification, all implementations must produce the same result from the same valid input.
- When two implementations produce different outputs, one of them must be faulty.
- Given three or more implementations, a tester can use voting to heuristically determine which implementations are wrong.
2.2 Design Goals
- First and most important, every generated program must be well formed and have a single meaning according to the C standard.
- The C99 language [11] has 191 undefined behaviors—e.g., dereferencing a null pointer or overflowing a signed integer—that destroy the meaning of a program.
- Programs emitted by Csmith must avoid all of these behaviors or, in certain cases such as argument-evaluation order, be independent of the choices that will be made by the compiler.
- Section 2.4 describes the hazards that Csmith must avoid and its strategies for avoiding them.
- Csmith’s second design goal is to maximize expressiveness subject to constraints imposed by the first goal.
2.3 Randomly Generating Programs
- Assignments are modeled as statements—not expressions—which reflects the most common idiom for assignments in C code.
- Third, the local environment carries points-to facts about all in-scope pointers.
- After choosing a production from the table, Csmith executes the filter, which decides if the choice is acceptable in the current context.
- It calls a function to generate the program fragment that corresponds to the nonterminal production.
- Thus, when the top-level function has been completely generated, Csmith is finished.
2.4 Safety Mechanisms
- Table 1 lists the mechanisms that Csmith uses to avoid generating C programs that execute undefined behaviors or depend on unspecified behaviors.
- Integer safety More and more, compilers are aggressively exploiting the undefined nature of integer behaviors such as signed overflow and shift-past-bitwidth.
- This was not difficult, but had a few tricky aspects.
- The aspect of C’s type system that required the most care was qualifier safety: ensuring that const and volatile qualifiers attached to pointers at various levels of indirection are not removed by implicit casts.
- As fragments of code are generated, Csmith tests if the new code has a read/write or write/write conflict with the current effect.
2.5 Efficient Global Safety
- Loops and function calls threaten to invalidate previously validated code.
- Consider the following code, in which Csmith has just added the loop back-edge at line 7.
3 while (...) {
- The newly added line 7 makes line 4 unsafe, due to the back-edge carrying a null-valued p.
- The authors therefore restrict the analysis to local scope except when function calls and loops are involved.
- The global fixpoint analysis is run when a loop is closed by adding its back-edge.
- If Csmith finds that the program contains unsafe statements, it deletes code starting from the tail of the loop until the program becomes globally safe.
- This strategy is about three times faster than pessimistically running the global dataflow analysis before adding every piece of code.
2.6 Design Trade-offs
- An ideally portable test program would be “strictly conforming” to the C language standard.
- In summary, despite the fact that Knight and Leveson [13] found a substantial number of correlated errors in an experiment on Nversion programming, Csmith has yielded no evidence of correlated failures among unrelated C compilers.
- It is not difficult to generate random programs that always terminate.
- The authors goal was to make the generated programs “look right”—to contain a balanced mix of arithmetic and bitwise operations, of references to scalars and aggregates, of loops and straight-line code, of single-level and multi-level indirections, and so on.
- In summary, many aspects of Csmith’s design and implementation were informed by their understanding of how modern compilers work and how they break.
3. Results
- The authors conducted five experiments using Csmith, their random program generator.
- The authors first experiment was uncontrolled and unstructured: over a three-year period, the authors opportunistically found and reported bugs in a variety of C compilers.
- (§3.1) In the second experiment, the authors compiled and ran one million random programs using several years’ worth of versions of GCC and LLVM, to understand how their robustness is evolving over time.
- The authors found that these metrics did not significantly improve when they added randomly generated programs to the compilers’ existing test suites.
- Nevertheless, as shown by their other results, Csmith-generated programs allowed us to discover bugs that are missed by the compilers’ standard test suites.
3.1 Opportunistic Bug Finding
- Five of these compilers (GCC, LLVM, CIL, TCC, and Open64) were open source and five were commercial products.
- Errors that manifest at run time include the computation of a wrong result; a crash or other abnormal termination of the generated code; termination of a program that should have executed forever; and non-termination of a program that should have terminated.
- Thus, for the most part, the authors simply tested these compilers until they found a few crash errors and a few wrong-code errors, reported them, and moved on.
- Both the GCC and LLVM teams were responsive to their bug reports.
- The second reason the authors prefer dealing with opensource compilers is that their development process is transparent: they can watch the mailing lists, participate in discussions, and see fixes as they are committed.
1 int bar (unsigned x) {
- This bug and five others like it were in CompCert’s unverified front-end code.
- Here, a large PowerPC stack frame is being allocated.
- CompCert’s PPC semantics failed to specify a constraint on the width of this immediate value, on the assumption that the assembler would catch out-of-range values.
- The striking thing about their CompCert results is that the middleend bugs the authors found in all other compilers are absent.
- This is not for lack of trying: the authors have devoted about six CPU-years to the task.
3.2 Quantitative Comparison of GCC and LLVM Versions
- Running these tests took about 1.5 weeks on 20 machines in the Utah Emulab testbed [28].
- (Note that the y-axes of these graphs are logarithmic.).
- These graphs also indicate the number of crash bugs that were fixed in response to their bug reports.
- The middle row of graphs in Figure 3 shows the number of distinct assertion failures in LLVM and the number of distinct internal compiler errors in GCC induced by their tests.
3.3 Bug-Finding Performance as a Function of Test-Case Size
- There are many ways in which a random test-case generator might be “tuned” for particular goals, e.g., to focus on certain kinds of compiler defects.
- Other factors being equal, small test cases are preferable because they are closer to being reportable to compiler developers.
- The authors repeated this for various ranges of test-input sizes.
- First, throughput is increased because compiler startup costs are better amortized.
3.4 Bug-Finding Performance Compared to Other Tools
- And otherwise-idle machines, using one CPU on each host.
- Each generator repeatedly produced programs that the authors compiled and tested using the same compilers and optimization options that were used for the experiments in Section 3.2.
- Figure 5 plots the cumulative number of distinct crash errors found by these program generators during the one-week test.
3.5 Code Coverage
- Because the authors find many bugs, they hypothesized that randomly generated programs exercise large parts of the compilers that were not covered by existing test suites.
- To test this, the authors enabled code-coverage monitoring in GCC and LLVM.
- The authors then used each compiler to build its own test suite, and also to build its test suite plus 10,000 Csmith-generated programs.
- The authors best guess is that these metrics are too shallow to capture Csmith’s effects, and that the authors would generate useful additional coverage in terms of deeper metrics such as path or value coverage.
3.6 Where Are the Bugs?
- Table 4 characterizes the GCC and LLVM bugs the authors found by compiler part.
- Tables 5 and 6 show the ten buggiest files in LLVM and GCC as measured by their experiment in Section 3.1.
- Most of the bugs the authors found in GCC were in the middle end: the machineindependent optimizers.
- LLVM is a younger compiler and their testing shook out some front-end and back-end bugs that would probably not be present in a more mature software base.
3.7 Examples of Wrong-Code Bugs
- This section characterizes a few of the bugs that were revealed by miscompilation of programs generated by Csmith.
- These bugs fit into a simple model in which optimizations are structured like this: analysis if (safety check) { transformation }.
- If x is variable and c1 and c2 are constants, the expression (x/c1)!=c2 can be profitably rewritten as (x-(c1*c2))>(c1-1), using unsigned arithmetic to avoid problems with negative values.
- Prior to performing the transformation, expressions such as c1*c2 and (c1*c2)+(c1-1) are checked for overflow.
- The authors found a bug that caused GCC to miscompile this code:.
2 static int *p = &g[0];
- The problem occurred when the compiler failed to recognize that p and q are aliases; this happened because q was mistakenly identified as a read-only memory location, which is defined not to alias a mutable location.
- The wrong not-alias fact caused the store in line 7 to be marked as dead so that a subsequent dead-store elimination pass removed it.
- A version of GCC miscompiled this function:.
4. Discussion
- One might suspect that random testing finds bugs that do not matter in practice.
- Undoubtedly this happens sometimes, but in a number of instances the authors have direct confirmation that Csmith is finding bugs that matter, because bugs that they have found and reported have been independently rediscovered and re-reported by application developers.
- By a very conservative estimate—counting only the times that a compiler.
6. Conclusion
- Using randomized differential testing, the authors found and reported hundreds of previously unknown bugs in widely used C compilers, both commercial and open source.
- Most of their reported defects have been fixed, meaning that compiler implementers found them important enough to track down, and 25 of the bugs the authors reported against GCC were classified as release-blocking.
- To create a random program generator with high bug-finding power, the key problem the authors solved was the expressive generation of C programs that are free of undefined behavior and independent of unspecified behavior.
- The incremental cost of a new bug that the authors find today is much lower.
- Software Csmith is open source and available for download at http://embed.cs.utah.edu/csmith/.
Did you find this useful? Give us your feedback
Citations
884 citations
Cites methods from "Finding and understanding bugs in C..."
...Such differential testing techniques have been applied successfully in the past for detecting logic bugs without manual specifications in a wide variety of traditional software [6, 11, 14, 15, 45, 86]....
[...]
...Differential testing has been widely used for successfully testing various types of traditional software including JVMs [14], C compilers [45, 86], SSL/TLS certification validation logic [11, 15, 56, 67], PDF viewers [56], space flight software [28], mobile applications [36], and Web application firewalls [6]....
[...]
651 citations
363 citations
Cites background from "Finding and understanding bugs in C..."
...Others, however, can cause compilers to silently miscompile a program and produce wrong code, subverting the programmer’s intent....
[...]
...It is clear that EMI is a relaxed notion of semantic equivalence: JPK = JQK =⇒ JPK =I JQK. 3 Note that we may also force a non-deterministic language to assume deterministic behavior....
[...]
...C; H.3.4 [Programming Languages]: Processors—compilers General Terms Algorithms, Languages, Reliability, Verification Keywords Compiler testing, miscompilation, equivalent program variants, automated testing...
[...]
316 citations
Cites background or methods from "Finding and understanding bugs in C..."
...To address this issue, fuzzing frameworks include strategies to model the structure of the desired input data; for fuzz testing a JavaScript interpreter, this would require a built-in JavaScript grammar....
[...]
...Section 6 describes the application of LangFuzz on PHP. Section 7 discusses threats to validity, and Section 8 closes with conclusion and future work....
[...]
...All other software defects (e.g. defect that produce false output without abnormal termination) will be disregarded, although such defects might be detected under certain circumstances....
[...]
280 citations
References
1,629 citations
"Finding and understanding bugs in C..." refers methods in this paper
...As measured by the responses to our bug reports, the defects discovered by Csmith are important....
[...]
1,398 citations
"Finding and understanding bugs in C..." refers methods in this paper
...5 weeks on 20 machines in the Utah Emulab testbed [28]....
[...]
1,124 citations
"Finding and understanding bugs in C..." refers background in this paper
...Wolfe [30], talk ing about independent software vendors (ISVs) says: An ISV with a complex code can work around correctness, turn off the optimizer in one or two .les, and usually they have to do that for any of the compilers they use (emphasis ours)....
[...]
1,110 citations
980 citations