A Concrete Memory Model for CompCert
Summary (4 min read)
1 Introduction
- Yet, a theorem about the source code of a safety critical software is not sufficient.
- The CompCert compiler [17] fills this verification gap: its semantics preservation theorem ensures that when the source program has a defined semantics, program invariants proved at source level still hold for the compiled code.
- Yet, these approaches are, by essence, limited by the formal semantics of CompCert C: programs exhibiting undefined behaviours cannot benefit from any semantic preservation guarantee.
- The authors prove that the existing memory model of CompCert is an abstraction of their model thus validating the soundness of the existing semantics.
- The authors adapt the proof of CompCert’s front-end passes, from CompCert C until Cminor, thus demonstrating the feasibility of their endeavour.
2 A More Concrete Memory Model for CompCert
- In previous work [3], the authors propose an enhanced memory model (with symbolic expressions) for CompCert.
- The authors empirically verify, using the reference interpreter of CompCert, that their extension is sound with respect to the existing semantics and that it captures low-level C idioms out of reach of the existing memory model.
- This section first recalls the main features of the current CompCert memory model and then explains their extension to this memory model.
2.1 CompCert’s Memory Model
- Leroy et al. [18] give a thorough presentation of the existing memory model of CompCert, that is shared by all the languages of the compiler.
- The authors give a brief overview of its design in order to highlight the differences with their own model.
- Pointer arithmetic modifies the offset part of a location, keeping its block identifier part unchanged.
- The free operation may also fail (e.g. when the locations to be freed have been freed already).
- In the memory model, the byte-level, in-memory representation of integers and floats is exposed, while pointers are kept abstract [18].
2.2 Motivation for an Enhanced Memory Model
- The authors memory model with symbolic expressions [3] gives a precise semantics to low-level C idioms which cannot be modelled by the existing memory model.
- Other examples are robust implementations of malloc: for the sake of checking the integrity of pointers, their trailing bits store a checksum.
- This is possible because those pointers are also aligned and therefore the trailing bits are necessarily 0s.
- The expected semantics is therefore that the program returns 1.
- The transformation is correct and the target code generated by CompCert correctly returns 1.
2.3 A Memory Model with Symbolic Expressions
- This model lacks an essential property of CompCert’s semantics: determinism.
- Determinism is instrumental for the simulation proofs of the compiler passes and its absence is a show stopper.
- The authors define the evaluation of expressions as the function J·Kcm, parametrised by the concrete mapping cm.
- Pointers are turned into their concrete value, as dictated by cm.
- The value of the expression is 1 whatever the value of undef and therefore the normalisation succeeds and returns, as expected, the value 1.
3 Proving the Operations of the Memory Model
- CompCert’s memory model exports an interface summarising all the properties of the memory operations necessary to prove the compiler passes.
- This section details how the properties and the proofs need to be adapted to accommodate for symbolic expressions.
- Second, the authors introduce an equivalence relation between symbolic expressions.
3.1 Precise Handling of Undefined Values
- Symbolic expressions (as presented in Section 2.3) feature a unique undef token.
- This is a shortcoming that the authors have identified during the proof.
- With a single undef, the authors do not capture the fact that different occurrences of undef may represent the same unknown value, or different ones.
- To overcome this problem, each byte of a newly allocated memory chunk is initialised with a fresh undef value.
- Hence, x − x constructs the symbolic expression undef(b, o)− undef(b, o) for some b and o which obviously normalises to 0, because undef(b, o) now represents a unique value rather than the set of all values.
3.2 Memory Allocation
- CompCert’s alloc operation always allocates a memory chunk of the requested size and returns a fresh block to the newly allocated memory (i.e. it models an infinite memory).
- The first guarantee is that for every memory m there exists at least a concrete memory compatible with the abstract CompCert block-based memory.
- To get this property, the alloc function runs a greedy algorithm constructing a compatible cm mapping.
- Given a memory m, size_mem(m) returns the size of the constructed memory (i.e. the first fresh address as computed by the allocation).
- The algorithm makes the pessimistic assumption that the allocated blocks are maximally aligned – for CompCert, this maximum is 3 bits (addresses are divisible by 23).
3.3 Good Variable Properties
- In CompCert, the so-called good variable properties axiomatise the behaviour of the memory operations.
- The reverse operation is the concatenation of a symbolic expression sv1 with a symbolic expression sv2 representing a byte.
- The authors have generalised and proved the axioms of the memory model using the same principle.
- Moreover, if the structure of the proofs is similar, their proofs are complicated by the fact that the authors reason modulo normalisation of expressions.
4 Cross-validation of Memory Models
- The semantics of the CompCert C language is part of the trusted computing base of the compiler.
- If the resulting offset is outside the bounds, their normalisation returns undef.
- After the easy fix, the authors found two interesting semantics discrepancies with the current semantics of CompCert C. However, when running the compiled program, the pointer is a mere integer, the integer eventually overflows; wraps around and becomes 0.
- After adjusting both memory models, the authors are able to prove that both semantics agree when the existing CompCert C semantics is defined thus cross-validating the semantics of operators.
5 Redesign of Memory Injections
- Memory injections are instrumental for proving the correctness of several compiler passes of CompCert.
- A memory injection defines a mapping between memories; it is a versatile tool to explain how passes reorganise the memory (e.g. construct an activation record from local variables).
- This section explains how to generalise this concept for symbolic expressions.
- It requires a careful handling of undefined values undef(l) which are absent from the existing memory model.
5.1 Memory Injections in CompCert
- The injection relation is defined over values (and called val_inject) and then lifted to memories (and called inject).
- The val_inject relation distinguishes three cases: 1. For concrete values (i.e. integers or floating-point numbers), the relation is reflexive: e.g. int(i) is in relation with int(i) ; 2. ptr(b, i) is in relation with ptr(b′, i+ δ) when f(b) = b(b′, δ)c; 3. undef is in relation with any value (including undef).
- The purpose of the injection is twofold: it establishes a relation between pointers using the function f but it can also specialise undef by a defined value.
- In CompCert, so-called generic memory injections state that every valid location in memory m1 is mapped by function f into a valid location in memory m2; the corresponding location in m2 must be properly aligned with respect to the size of the block; and the values stored at corresponding locations must be in injection.
- Among other conditions, the authors have that if several blocks in m1 are mapped to the same block in m2, the mapping ensures the absence of overlapping.
5.2 Memory Injection with Symbolic Expressions
- The function f is still present and serves the same purpose.
- The authors injection expr_inject is therefore defined as the composition of the function apply_spe spe which specialises undef(l) into concrete bytes, and the function apply_inj f which injects locations.
- This model makes the implicit assumption that memory blocks are always sufficiently aligned.
- The existing formalisation of inject has a property mi_representable which states that the offset o+ δ obtained after injection does not overflow.
5.3 Memory Injection and Normalisation
- The authors normalisation is defined w.r.t. all the concrete memories compatible with the CompCert block-based memory (see Section 2.3).
- Theorem norm_inject shows that under the condition that all blocks are injected, if e and e′ are in injection, then their normalisations are in injection too.
- Thus, the normalisation can only get more defined after injection.
- This is expected as the injection can merge blocks and therefore makes pointer arithmetic more defined.
- A consequence of this theorem is that the compiler is not allowed to reduce the memory usage.
6 Proving the Front-end of the CompCert Compiler
- Later compiler passes are architecture dependent and are therefore part of the back-end.
- This section explains how to adapt the semantics preservation proofs of the front-end to their memory model with symbolic expressions.
6.1 CompCert Front-end with Symbolic Expressions
- The semantics of all intermediate languages need to be modified in order to account for symbolic expressions.
- In reality, the transformation is more subtle because, for instance, certain intermediate semantic functions explicitly require locations represented as pairs (b, o).
- This solution proves wrong and breaks semantics preservation proofs because introduced normalisations may be absent in subsequent intermediate languages.
- This pass does not transform the memory and therefore the existing proof can be reused.
- The pass also performs type-directed transformations and removes redundant casts.
2. allocation of local variables
- This relation is too weak and fails to pass the induction step.
- The problem is related with the preservation of the memory injection when allocating and de-allocating the variables in C]minor and the stack frame in Cminor.
- Once again, the authors adapt the two-step proof with a direct induction over the number of variables.
- To carry out this proof and establish an injection the authors have to reason about the relative sizes of the memories.
- Here, the authors have to deal with the opposite situation where the stack frame could use less memory than the variables.
8 Conclusion
- This work is a milestone towards a CompCert compiler proved correct with respect to a more concrete memory model.
- A side-product of their work is that the authors have uncovered and fixed a problem in the existing semantics of the comparison with the null pointer.
- The authors are confident that program optimisations based on static analyses will not be problematic.
- Withstanding the remaining difficulties, the authors believe that the full CompCert compiler can be ported to their novel memory model.
- This would improve further the confidence in the generated code.
Did you find this useful? Give us your feedback
Citations
4 citations
3 citations
3 citations
Cites background from "A Concrete Memory Model for CompCer..."
...A concrete memory model for CompCert....
[...]
...Conclusion Dans ce chapitre, nous avons utilisé une méthode de définition de sémantiques opérationnelles de langages par systèmes de transitions étiquetées synchronisés pour étendre l’approche de CompCertTSO en donnant le moyen de 76 Chapitre 4....
[...]
...Il est à noter que plusieurs travaux traitent du modèle mémoire de CompCert [12, 5, 11] : il s’agit ici essentiellement de différentes façons de représenter l’organisation de la mémoire, et non pas, comme dans le cas des modèles mé- 3.2....
[...]
...CompCert est un compilateur optimisant....
[...]
...CompCert est organisé en de multiples passes et pour un langage machine cible donné (il peut générer du code assembleur pour trois architectures différentes: ARM, PowerPC et x86-32), il manipule dix langages : un léger sousensemble du langage C tel que défini par la norme, les langages Clight, C#minor, Cminor, CminorSel, RTL, LTL, Linear, Mach, et Asm (dépendant de l’architecture) qui sont des langages intermédiaires au compilateur et enfin l’assembleur de l’architecture considérée....
[...]
References
1,124 citations
"A Concrete Memory Model for CompCer..." refers background or methods in this paper
...The CompCert C semantics [5] provides the specification for the correctness of the CompCert compiler [17]....
[...]
...[9,15,17])....
[...]
...The CompCert compiler [17] fills this verification gap: its semantics preservation theorem ensures that when the source program has a defined semantics, program invariants proved at source level still hold for the compiled code....
[...]
799 citations
"A Concrete Memory Model for CompCer..." refers methods in this paper
...With this respect, the CompCert C semantics successfully run hundreds of random test programs generated by CSmith [23]....
[...]
584 citations
"A Concrete Memory Model for CompCer..." refers methods in this paper
...VCC [7] generates verification conditions using an abstract typed memory model [8] where the memory is a mapping from typed pointers to structured C values....
[...]
209 citations
Additional excerpts
...[9,15,17])....
[...]
188 citations
Related Papers (5)
Frequently Asked Questions (2)
Q2. What have the authors stated for future works in "A concrete memory model for compcert" ?
As future work, the authors shall study how to adapt the back-end of CompCert. Withstanding the remaining difficulties, the authors believe that the full CompCert compiler can be ported to their novel memory model. This would improve further the confidence in the generated code.