scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Exploring C semantics and pointer provenance

TL;DR: This paper aims to reconcile the ISO C standard, mainstream compiler behaviour, and the semantics relied on by the corpus of existing C code, and presents two coherent proposals, tracking provenance via integers and not; both address many design questions.
Abstract: The semantics of pointers and memory objects in C has been a vexed question for many years. C values cannot be treated as either purely abstract or purely concrete entities: the language exposes their representations, but compiler optimisations rely on analyses that reason about provenance and initialisation status, not just runtime representations. The ISO WG14 standard leaves much of this unclear, and in some respects differs with de facto standard usage --- which itself is difficult to investigate. In this paper we explore the possible source-language semantics for memory objects and pointers, in ISO C and in C as it is used and implemented in practice, focussing especially on pointer provenance. We aim to, as far as possible, reconcile the ISO C standard, mainstream compiler behaviour, and the semantics relied on by the corpus of existing C code. We present two coherent proposals, tracking provenance via integers and not; both address many design questions. We highlight some pros and cons and open questions, and illustrate the discussion with a library of test cases. We make our semantics executable as a test oracle, integrating it with the Cerberus semantics for much of the rest of C, which we have made substantially more complete and robust, and equipped with a web-interface GUI. This allows us to experimentally assess our proposals on those test cases. To assess their viability with respect to larger bodies of C code, we analyse the changes required and the resulting behaviour for a port of FreeBSD to CHERI, a research architecture supporting hardware capabilities, which (roughly speaking) traps on the memory safety violations which our proposals deem undefined behaviour. We also develop a new runtime instrumentation tool to detect possible provenance violations in normal C code, and apply it to some of the SPEC benchmarks. We compare our proposal with a source-language variant of the twin-allocation LLVM semantics proposal of Lee et al. Finally, we describe ongoing interactions with WG14, exploring how our proposals could be incorporated into the ISO standard.
Citations
More filters
Proceedings ArticleDOI
04 Apr 2019
TL;DR: This work describes the first adaptation of a full C-language operating system (FreeBSD) with an enterprise database (PostgreSQL) for complete spatial and referential memory safety and shows that awareness of abstract capabilities, coupled with CHERI architectural capabilities, can provide more complete protection, strong compatibility, and acceptable performance overhead compared with the pre-CHERI baseline and software-only approaches.
Abstract: The CHERI architecture allows pointers to be implemented as capabilities (rather than integer virtual addresses) in a manner that is compatible with, and strengthens, the semantics of the C language. In addition to the spatial protections offered by conventional fat pointers, CHERI capabilities offer strong integrity, enforced provenance validity, and access monotonicity. The stronger guarantees of these architectural capabilities must be reconciled with the real-world behavior of operating systems, run-time environments, and applications. When the process model, user-kernel interactions, dynamic linking, and memory management are all considered, we observe that simple derivation of architectural capabilities is insufficient to describe appropriate access to memory. We bridge this conceptual gap with a notional abstract capability that describes the accesses that should be allowed at a given point in execution, whether in the kernel or userspace. To investigate this notion at scale, we describe the first adaptation of a full C-language operating system (FreeBSD) with an enterprise database (PostgreSQL) for complete spatial and referential memory safety. We show that awareness of abstract capabilities, coupled with CHERI architectural capabilities, can provide more complete protection, strong compatibility, and acceptable performance overhead compared with the pre-CHERI baseline and software-only approaches. Our observations also have potentially significant implications for other mitigation techniques.

43 citations

Proceedings ArticleDOI
19 Jun 2021
TL;DR: RefinedC as mentioned in this paper is a type system that combines ownership types for modular reasoning about shared state and concurrency with refinement types for encoding precise invariants on C data types and Hoare-style specifications for C functions.
Abstract: Given the central role that C continues to play in systems software, and the difficulty of writing safe and correct C code, it remains a grand challenge to develop effective formal methods for verifying C programs. In this paper, we propose a new approach to this problem: a type system we call RefinedC, which combines ownership types (for modular reasoning about shared state and concurrency) with refinement types (for encoding precise invariants on C data types and Hoare-style specifications for C functions). RefinedC is both automated (requiring minimal user intervention) and foundational (producing a proof of program correctness in Coq), while at the same time handling a range of low-level programming idioms such as pointer arithmetic. In particular, following the approach of RustBelt, the soundness of the RefinedC type system is justified semantically by interpretation into the Coq-based Iris framework for higher-order concurrent separation logic. However, the typing rules of RefinedC are also designed to be encodable in a new “separation logic programming” language we call Lithium. By restricting to a carefully chosen (yet expressive) fragment of separation logic, Lithium supports predictable, automatic, goal-directed proof search without backtracking. We demonstrate the effectiveness of RefinedC on a range of representative examples of C code.

33 citations

Journal ArticleDOI
20 Dec 2019
TL;DR: Stacked Borrows is proposed, an operational semantics for memory accesses in Rust that defines an aliasing discipline and declares programs violating it to have undefined behavior, meaning the compiler does not have to consider such programs when performing optimizations.
Abstract: Type systems are useful not just for the safety guarantees they provide, but also for helping compilers generate more efficient code by simplifying important program analyses. In Rust, the type system imposes a strict discipline on pointer aliasing, and it is an express goal of the Rust compiler developers to make use of that alias information for the purpose of program optimizations that reorder memory accesses. The problem is that Rust also supports unsafe code, and programmers can write unsafe code that bypasses the usual compiler checks to violate the aliasing discipline. To strike a balance between optimizations and unsafe code, the language needs to provide a set of rules such that unsafe code authors can be sure, if they are following these rules, that the compiler will preserve the semantics of their code despite all the optimizations it is doing. In this work, we propose Stacked Borrows, an operational semantics for memory accesses in Rust. Stacked Borrows defines an aliasing discipline and declares programs violating it to have undefined behavior, meaning the compiler does not have to consider such programs when performing optimizations. We give formal proofs (mechanized in Coq) showing that this rules out enough programs to enable optimizations that reorder memory accesses around unknown code and function calls, based solely on intraprocedural reasoning. We also implemented this operational model in an interpreter for Rust and ran large parts of the Rust standard library test suite in the interpreter to validate that the model permits enough real-world unsafe Rust code.

29 citations


Cites background from "Exploring C semantics and pointer p..."

  • ...…of work on formalizing the semantics of C or LLVM (as representative examples of highly optimized łlow-levelž languages) and in particular their handling of pointers and pointer provenance [Memarian et al. 2019; Krebbers 2015; Kang et al. 2015; Lee et al. 2018; Hathhorn et al. 2015; Norrish 1998]....

    [...]

Proceedings ArticleDOI
18 May 2020
TL;DR: This paper formalises key intended security properties of the design, and establishes that these hold with mechanised proof for CHERI, an architecture with hardware capabilities that supports fine-grained memory protection and scalable secure compartmentalisation, while offering a smooth adoption path for existing software.
Abstract: The root causes of many security vulnerabilities include a pernicious combination of two problems, often regarded as inescapable aspects of computing. First, the protection mechanisms provided by the mainstream processor architecture and C/C++ language abstractions, dating back to the 1970s and before, provide only coarse-grain virtual-memory-based protection. Second, mainstream system engineering relies almost exclusively on test-and-debug methods, with (at best) prose specifications. These methods have historically sufficed commercially for much of the computer industry, but they fail to prevent large numbers of exploitable bugs, and the security problems that this causes are becoming ever more acute.In this paper we show how more rigorous engineering methods can be applied to the development of a new security-enhanced processor architecture, with its accompanying hardware implementation and software stack. We use formal models of the complete instruction-set architecture (ISA) at the heart of the design and engineering process, both in lightweight ways that support and improve normal engineering practice - as documentation, in emulators used as a test oracle for hardware and for running software, and for test generation - and for formal verification. We formalise key intended security properties of the design, and establish that these hold with mechanised proof. This is for the same complete ISA models (complete enough to boot operating systems), without idealisation.We do this for CHERI, an architecture with hardware capabilities that supports fine-grained memory protection and scalable secure compartmentalisation, while offering a smooth adoption path for existing software. CHERI is a maturing research architecture, developed since 2010, with work now underway on an Arm industrial prototype to explore its possible adoption in mass-market commercial processors. The rigorous engineering work described here has been an integral part of its development to date, enabling more rapid and confident experimentation, and boosting confidence in the design.

22 citations

Proceedings ArticleDOI
23 Jun 2019
TL;DR: MS-Wasm is presented, an extension to Wasm that bridges this gap by allowing developers to capture low-level C/C++ memory semantics such as pointers and memory allocation in Wasm, at compile time.
Abstract: WebAssembly (Wasm) is a low-level platform-independent bytecode language. Today, developers can compile C/C++ to Wasm and run it everywhere, at almost native speeds. Unfortunately, this compilation from C/C++ to Wasm also preserves classic memory safety vulnerabilities, such as buffer overflows and use-after-frees.New processor features (e.g., tagged memory, pointer authentication, and fine grain capabilities) are making it increasingly possible to detect, mitigate, and prevent such vulnerabilities with low overhead. Unfortunately, Wasm JITs and compilers cannot exploit these features. Critical high-level information---e.g., the size of an array---is lost when lowering to Wasm.We present MS-Wasm, an extension to Wasm that bridges this gap by allowing developers to capture low-level C/C++ memory semantics such as pointers and memory allocation in Wasm, at compile time. At deployment time, Wasm compilers and JITs can leverage these added semantics to enforce different models of memory safety depending on user preferences and what hardware is available on the target platform. This way, MS-Wasm offers a range of security-performance trade-offs, and enables users to move to progressively stronger models of memory safety as hardware evolves.

19 citations


Cites background from "Exploring C semantics and pointer p..."

  • ...both for performance—it eliminates unnecessary checks during pointer arithmetic—and compatibility—as pointers that temporarily point out of bounds are common [11] and benign behavior in C programs [30, 31]....

    [...]

References
More filters
Book ChapterDOI
29 Mar 2008
TL;DR: Z3 is a new and efficient SMT Solver freely available from Microsoft Research that is used in various software verification and analysis applications.
Abstract: Satisfiability Modulo Theories (SMT) problem is a decision problem for logical first order formulas with respect to combinations of background theories such as: arithmetic, bit-vectors, arrays, and uninterpreted functions. Z3 is a new and efficient SMT Solver freely available from Microsoft Research. It is used in various software verification and analysis applications.

6,859 citations

Book ChapterDOI
08 Apr 2002
TL;DR: The structure of CIL is described, with a focus on how it disambiguates those features of C that were found to be most confusing for program analysis and transformation, allowing a complete project to be viewed as a single compilation unit.
Abstract: This paper describes the C Intermediate Language: a high-level representation along with a set of tools that permit easy analysis and source-to-source transformation of C programs.Compared to C, CIL has fewer constructs. It breaks down certain complicated constructs of C into simpler ones, and thus it works at a lower level than abstract-syntax trees. But CIL is also more high-level than typical intermediate languages (e.g., three-address code) designed for compilation. As a result, what we have is a representation that makes it easy to analyze and manipulate C programs, and emit them in a form that resembles the original source. Moreover, it comes with a front-end that translates to CIL not only ANSI C programs but also those using Microsoft C or GNU C extensions.We describe the structure of CIL with a focus on how it disambiguates those features of C that we found to be most confusing for program analysis and transformation. We also describe a whole-program merger based on structural type equality, allowing a complete project to be viewed as a single compilation unit. As a representative application of CIL, we show a transformation aimed at making code immune to stack-smashing attacks. We are currently using CIL as part of a system that analyzes and instruments C programs with run-time checks to ensure type safety. CIL has served us very well in this project, and we believe it can usefully be applied in other situations as well.

1,065 citations


"Exploring C semantics and pointer p..." refers background or methods in this paper

  • ...These are things that are handled by any full-fledged C front-end implementation, and by CIL [Necula et al. 2002], but Cerberus aims to have a clear relationship to the standard, to capture exactly what it says (where that is well-defined), and to report all undefined behaviours, and so we do not…...

    [...]

  • ...It is broadly similar to other dynamic analyses such as SoftBound [Nagarakatte et al. 2009] or Memcheck [Nethercote and Seward 2007], but it is implemented by source-to-source translation using CIL [Necula et al. 2002] (giving access to C source features)....

    [...]

  • ...Analysis tools such as tis-interpreter [Cuoq et al. 2017; TrustInSoft 2017] and CBMC [Kroening and Tautschnig 2014] also have to deal with much of the semantics of C, although with implicit rather than explicit semantic models, as did CIL [Necula et al. 2002]....

    [...]

Proceedings ArticleDOI
15 Jun 2009
TL;DR: Inspired by HardBound, a previously proposed hardware-assisted approach, SoftBound similarly records base and bound information for every pointer as disjoint metadata, which enables SoftBound to provide spatial safety without requiring changes to C source code.
Abstract: The serious bugs and security vulnerabilities facilitated by C/C++'s lack of bounds checking are well known, yet C and C++ remain in widespread use. Unfortunately, C's arbitrary pointer arithmetic, conflation of pointers and arrays, and programmer-visible memory layout make retrofitting C/C++ with spatial safety guarantees extremely challenging. Existing approaches suffer from incompleteness, have high runtime overhead, or require non-trivial changes to the C source code. Thus far, these deficiencies have prevented widespread adoption of such techniques.This paper proposes SoftBound, a compile-time transformation for enforcing spatial safety of C. Inspired by HardBound, a previously proposed hardware-assisted approach, SoftBound similarly records base and bound information for every pointer as disjoint metadata. This decoupling enables SoftBound to provide spatial safety without requiring changes to C source code. Unlike HardBound, SoftBound is a software-only approach and performs metadata manipulation only when loading or storing pointer values. A formal proof shows that this is sufficient to provide spatial safety even in the presence of arbitrary casts. SoftBound's full checking mode provides complete spatial violation detection with 67% runtime overhead on average. To further reduce overheads, SoftBound has a store-only checking mode that successfully detects all the security vulnerabilities in a test suite at the cost of only 22% runtime overhead on average.

563 citations


"Exploring C semantics and pointer p..." refers methods in this paper

  • ...It is broadly similar to other dynamic analyses such as SoftBound [Nagarakatte et al. 2009] or Memcheck [Nethercote and Seward 2007], but it is implemented by source-to-source translation using CIL [Necula et al. 2002] (giving access to C source features)....

    [...]

Proceedings ArticleDOI
07 Jun 2008
TL;DR: The simple model the effort to address issues by explicitly providing semantics for threads in the next revision of the C++ standard is described, and how this, together with some practical, but often under-appreciated implementation constraints, drives us towards the above decisions.
Abstract: Currently multi-threaded C or C++ programs combine a single-threaded programming language with a separate threads library. This is not entirely sound [7].We describe an effort, currently nearing completion, to address these issues by explicitly providing semantics for threads in the next revision of the C++ standard. Our approach is similar to that recently followed by Java [25], in that, at least for a well-defined and interesting subset of the language, we give sequentially consistent semantics to programs that do not contain data races. Nonetheless, a number of our decisions are often surprising even to those familiar with the Java effort:We (mostly) insist on sequential consistency for race-free programs, in spite of implementation issues that came to light after the Java work.We give no semantics to programs with data races. There are no benign C++ data races.We use weaker semantics for trylock than existing languages or libraries, allowing us to promise sequential consistency with an intuitive race definition, even for programs with trylock.This paper describes the simple model we would like to be able to provide for C++ threads programmers, and explain how this, together with some practical, but often under-appreciated implementation constraints, drives us towards the above decisions.

491 citations

Journal ArticleDOI
TL;DR: This article describes the development and formal verification of a compiler back-end from Cminor (a simple imperative intermediate language) to PowerPC assembly code, using the Coq proof assistant both for programming the compiler and for proving its soundness.
Abstract: This article describes the development and formal verification (proof of semantic preservation) of a compiler back-end from Cminor (a simple imperative intermediate language) to PowerPC assembly code, using the Coq proof assistant both for programming the compiler and for proving its soundness. Such a verified compiler is useful in the context of formal methods applied to the certification of critical software: the verification of the compiler guarantees that the safety properties proved on the source code hold for the executable compiled code as well.

489 citations


"Exploring C semantics and pointer p..." refers methods in this paper

  • ...The project page includes data for various compilers and other tools for these tests: GCC 8.1, Clang 6.0, ICC 19, UBSAN, ASAN, MSAN, CompCert [Leroy 2009; Leroy et al. 2018], RV-Match [Guth et al. 2016], CH2O [Krebbers 2015], and CHERI [Chisnall et al. 2015; Watson et al. 2018, 2015; Woodruff et…...

    [...]