TL;DR: This paper presents an interesting system-level co-design and co-verification case study for a non-trivial design where multiple high-performing x86 processors and custom hardware were connected through a coherent interconnection fabric and used a processor bus functional model to combine native software execution with a cycle-accurate interconnect simulator and an HDL simulator.
Abstract: This paper presents an interesting system-level co-design and co-verification case study for a non-trivial design where multiple high-performing x86 processors and custom hardware were connected through a coherent interconnection fabric. In functional verification of such a system, we used a processor bus functional model (BFM) to combine native software execution with a cycle-accurate interconnect simulator and an HDL simulator. However, we found that significant extensions need to be made to the conventional BFM methodology in order to capture various data-race cases in simulation, which eventually happen in modern multi-processor systems. Especially essential were faithful implementations of the memory consistency model and cache coherence protocol, as well as timing randomization. We demonstrate how such a co-simulation environment can be constructed from existing tools and software. Lessons from our study can similarly be applied to design and verification of other tightly-coupled systems.
Modern digital systems are moving increasingly towards heterogeneity.
The authors discuss which among the many conventional co-design/verification methods would best serve their purposes and why.
The authors show the effectiveness of their methodology and draw out general lessons from it (Section 4), before they conclude in Section 5.
The authors found that combining the software model with an interconnection simulator and HDL simulator via a processor BFM is the most effective method for functional verification.
The authors also explain, however, that conventional ways of constructing a processor BFM should be revised in accordance with modern multi-core processors and inter- connection architecture; it is especially important to accurately implement the memory consistency model and cache coherence protocol.
2.1 Target Design
This section outlines the design of their system for the sake of providing sufficient background context for one to understand their co-design and co-verification issues, while the detailed design of the system is outside scope of this paper.
A typical software transactional memory (STM), an implementation of such a runtime system solely with software, tends to exhibit huge performance overhead.
(6) The TM hardware, based on all the read/write address received from all the cores, now determines which cores have conflicting read and writes and sends out the requisite messages to those cores.
The system is composed of two quad-core x86 CPUs and an FPGA that is connected coherently via a chain of point-to-point links.
Messages from the CPU are sent to their HW via a non-coherent interface, while responses to the CPU go through the coherent cache.
2.2 Our Initial Failure and Issues with Co-Verification
Since their system was composed of tightly coupled hardware and software, the authors had to deal with a classic chickenand-egg co-verification problem:.
A crash observed after one last memory access from one core (e.g. de-referencing a dangling pointer), could be a result of a write, falsely allowed to commit, from another core millions of cycles before.
The new STM software needed to be intensely validated (with the new hardware), especially under the assumption of parallel execution, variable latency and out-of-order message delivery as shown in Figure 2.
An alternative was to use a detailed architecture simulator (e.g. [22]) but its simulation speed was insufficient.
This method brought its own challenges which the authors discuss in detail in the following section.
3.1 General Issues and Solutions
The authors discuss general issues that arise when using a software model for HW/SW co-verification and how they overcame those issues.
Hardware simulation can consist of two different components: a cycle-accurate interconnection network simulation and HDL simulation.
Instead, the authors rely on the single-threaded simulator to interleave multiple software execution contexts.
(Issue #4) The memory consistency model must be carefully considered.
Otherwise, the contents in the store buffer, up to the entry that has been matched, are flushed before the new packet is injected.
3.2 Our Co-simulation Environment: Implementation
This subsection details the implementation of their co-simulation environment where all the issues discussed in previous subsections are resolved.
Noticeably, the API provides separate methods for normal , non-coherent, and un accesses as well as flush and atomic operations.
Figure 5 shows how execution flows from a SW context (i.e. a fiber executing the SW model) to the simulator context (i.e. the main fiber for simulation).
Instead of actually sending a transactional read message to the FPGA, the HAL part of the STM calls into the BFM API (SIM_Noncoh_write) which eventually injects a packet into the simulator (BUS_Inject) and switches context to simulator execution (SIM_return_simulator).
The network simulator, which performs simple cycle-based simulation, calls clock function for each BFM at each simulation cycle.
4. RESULTS AND DISCUSSION
The authors new co-simulation environment (Section 3.2), was extremely useful for verifying the functional correctness of their system.
On one hand, randomized timing helped to explore corner cases in data-race conditions.
Since most of the software model was executed natively, there was no waste of valuable simulation cycles to execute instructions that were not necessary for functional verification.
Fourth, the co-simulation environment provided a very helpful error detection mechanism, which was impossible in native execution on FPGA.
Note that the last row points out which address is violating serialize-ability From this log, the authors were able to relate SW context and HW status, since the simulation cycle is shared by both SW simulation and HDL simulation.
5. CONCLUSION
The authors presented their HW/SW co-verification experience on a commodity multi-processor system with custom hardware.
For the sake of functional verification of such a system, it was most effective to combine native SW execution with cycle-based interconnect simulation and HDL simulation by means of a processor BFM.
Their experiences showed that such BFMs should faithfully reflect the memory consistency models of their target processors and would benefit greatly from randomized packet injection timing in their network simulations.
These requirements enable the co-simulation to generate a wide variety of data access interleavings, which is essential for co-verification of modern multi-processor systems.
TL;DR: This work proposes, selective caching, wherein it disallow GPU caching of any memory that would require coherence updates to propagate between the CPU and GPU, thereby decoupling the GPU from vendor-specific CPU coherence protocols.
Abstract: Cache coherence is ubiquitous in shared memory multiprocessors because it provides a simple, high performance memory abstraction to programmers. Recent work suggests extending hardware cache coherence between CPUs and GPUs to help support programming models with tightly coordinated sharing between CPU and GPU threads. However, implementing hardware cache coherence is particularly challenging in systems with discrete CPUs and GPUs that may not be produced by a single vendor. Instead, we propose, selective caching, wherein we disallow GPU caching of any memory that would require coherence updates to propagate between the CPU and GPU, thereby decoupling the GPU from vendor-specific CPU coherence protocols. We propose several architectural improvements to offset the performance penalty of selective caching: aggressive request coalescing, CPU-side coherent caching for GPU-uncacheable requests, and a CPU-GPU interconnect optimization to support variable-size transfers. Moreover, current GPU workloads access many read-only memory pages; we exploit this property to allow promiscuous GPU caching of these pages, relying on page-level protection, rather than hardware cache coherence, to ensure correctness. These optimizations bring a selective caching GPU implementation to within 93% of a hardware cache-coherent implementation without the need to integrate CPUs and GPUs under a single hardware coherence protocol.
44 citations
Cites background from "A case of system-level hardware/sof..."
...Such approaches also incur highly coordinated design and verification effort by both CPU and GPU vendors [24] that is challenging when multiple vendors wish to integrate existing CPU and GPU designs in a timely manner....
[...]
...Building scalable, high-performance cache coherence requires a holistic system that strikes a balance between directory storage overhead, cache probe bandwidth, and application characteristics [8, 24, 33, 36, 54, 55, 58]....
TL;DR: An ideal security verification solution naturally handling both hardware and software components is sketched, and an evaluation of formal verification methods that have already been proposed for mixed hardware/software systems are proposed with regards to the ideal method.
Abstract: Critical and privacy-sensitive applications of smart and connected objects such as health-related objects are now common, thus raising the need to design these objects with strong security guarantees. Many recent works offer practical hardware-assisted security solutions that take advantage of a tight cooperation between hardware and software to provide system-level security guarantees. Formally and consistently proving the efficiency of these solutions raises challenges since software and hardware verifications approaches generally rely on different representations. The paper first sketches an ideal security verification solution naturally handling both hardware and software components. Next, it proposes an evaluation of formal verification methods that have already been proposed for mixed hardware/software systems, with regards to the ideal method. At last, the paper presents a conceptual approach to this ideal method relying on ProVerif, and applies this approach to a remote attestation system (SMART).
14 citations
Cites methods from "A case of system-level hardware/sof..."
...As disjoint verification of hardware and software naturally suffers from the considerable manual effort needed for finding a good abstraction that could both be proved to be a refinement of the hardware and be used as a base for verifying the software, some research work has been done to verify hardware/software co-designs as a whole [20,23]....
TL;DR: A SystemC transaction level modelling wrapping library that can be used for the assertion of system properties, protocol compliance, or fault injection and has been successfully applied to the robustness verification of the on-board boot software of the Instrument Control Unit of the Solar Orbiter's Energetic Particle Detector.
Abstract: This paper presents the design of a SystemC transaction level modelling wrapping library that can be used for the assertion of system properties, protocol compliance, or fault injection. The library uses C++ virtual table hooks as a dynamic binary instrumentation technique to inline wrappers in the TLM2 transaction path. This technique can be applied after the elaboration phase and needs neither source code modifications nor recompilation of the top level SystemC modules. The proposed technique has been successfully applied to the robustness verification of the on-board boot software of the Instrument Control Unit of the Solar Orbiter's Energetic Particle Detector.
3 citations
Cites background from "A case of system-level hardware/sof..."
...Another work [16] presents a system-level codesign and coverification case study....
TL;DR: A software profiler called AddressTracer is proposed that is accurately able to evaluate performance matrices of any specific software portion and provides up to 50.15% improvement in accuracy of profiling software compared to Gprof and 6.89% compared to Airwolf.
Abstract: Embedded systems are a mixture of software running on a microprocessor and application-specific hardware. There are many co-design methodologies that are used to design embedded systems. One of them is Hardware/Software co-design methodology which requires an appropriate profiler to detect the software portions that contribute to a large percentage of program execution and cause performance bottleneck. Detecting these software portions improves the system efficiency where these portions are either reprogrammed to eliminate the performance bottleneck or moved to the hardware domain gaining the advantages of this domain. There are profiling tools used to profile software programs such as GNU Gprof profiler. GNU Gprof integrates an extra code with the software program to be profiled causing inaccurate results and a significant execution time overhead. To address these issues, this paper proposes a software profiler called AddressTracer that is accurately able to evaluate performance matrices of any specific software portion. A set of benchmarks, Dijkstra, Secure Hash Algorithm, and Bitcount are profiled using AddressTracer, Airwolf and GNU software profiling tool (Gprof), for a quantitative comparison. The achieved results show that AddressTracer gives accurate profiling results compared to Gprof and Airwolf profilers. AddressTracer provides up to 50.15% improvement in accuracy of profiling software compared to Gprof and 6.89% compared to Airwolf. Furthermore, AddressTracer is a non-intrusive profiler which does not cause any performance overhead.
2 citations
Additional excerpts
...modules of embedded systems are realized in software [1]....
TL;DR: In this paper, an ideal security verification solution for mixed hardware/software systems is presented, relying on ProVerif, and applied to a remote at-testation system (SMART).
Abstract: Critical and private applications of smart and connected objects such as health-related objects are now common, thus raising the need to design these objects with strong security guarantees. Many re- cent works offer practical hardware-assisted security solutions that take advantage of a tight cooperation between hardware and software to provide system-level security guarantees. Formally and consistently proving the efficiency of these solutions raises challenges since software and hardware verifications approaches generally rely on different representations. The paper first sketches an ideal security verification solution naturally handling both hardware and software components. Next, it proposes an evaluation of formal verification methods that have already been pro- posed for mixed hardware/software systems, with regards to the ideal method. At last, the paper presents a conceptual approach to this ideal method relying on ProVerif, and applies this approach to a remote at- testation system (SMART).
TL;DR: The goals are to provide easy-to-use, portable, transparent, and efficient instrumentation, and to illustrate Pin's versatility, two Pintools in daily use to analyze production software are described.
Abstract: Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pin's rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the application's original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pin's versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), Itanium®, and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website.
TL;DR: This book offers a detailed and comprehensive presentation of the basic principles of interconnection network design, clearly illustrating them with numerous examples, chapter exercises, and case studies, allowing a designer to see all the steps of the process from abstract design to concrete implementation.
Abstract: One of the greatest challenges faced by designers of digital systems is optimizing the communication and interconnection between system components. Interconnection networks offer an attractive and economical solution to this communication crisis and are fast becoming pervasive in digital systems. Current trends suggest that this communication bottleneck will be even more problematic when designing future generations of machines. Consequently, the anatomy of an interconnection network router and science of interconnection network design will only grow in importance in the coming years.
This book offers a detailed and comprehensive presentation of the basic principles of interconnection network design, clearly illustrating them with numerous examples, chapter exercises, and case studies. It incorporates hardware-level descriptions of concepts, allowing a designer to see all the steps of the process from abstract design to concrete implementation.
·Case studies throughout the book draw on extensive author experience in designing interconnection networks over a period of more than twenty years, providing real world examples of what works, and what doesn't.
·Tightly couples concepts with implementation costs to facilitate a deeper understanding of the tradeoffs in the design of a practical network.
·A set of examples and exercises in every chapter help the reader to fully understand all the implications of every design decision.
Table of Contents
Chapter 1 Introduction to Interconnection Networks
1.1 Three Questions About Interconnection Networks
1.2 Uses of Interconnection Networks
1.3 Network Basics
1.4 History
1.5 Organization of this Book
Chapter 2 A Simple Interconnection Network
2.1 Network Specifications and Constraints
2.2 Topology
2.3 Routing
2.4 Flow Control
2.5 Router Design
2.6 Performance Analysis
2.7 Exercises
Chapter 3 Topology Basics
3.1 Nomenclature
3.2 Traffic Patterns
3.3 Performance
3.4 Packaging Cost
3.5 Case Study: The SGI Origin 2000
3.6 Bibliographic Notes
3.7 Exercises
Chapter 4 Butterfly Networks
4.1 The Structure of Butterfly Networks
4.2 Isomorphic Butterflies
4.3 Performance and Packaging Cost
4.4 Path Diversity and Extra Stages
4.5 Case Study: The BBN Butterfly
4.6 Bibliographic Notes
4.7 Exercises
Chapter 5 Torus Networks
5.1 The Structure of Torus Networks
5.2 Performance
5.3 Building Mesh and Torus Networks
5.4 Express Cubes
5.5 Case Study: The MIT J-Machine
5.6 Bibliographic Notes
5.7 Exercises
Chapter 6 Non-Blocking Networks
6.1 Non-Blocking vs. Non-Interfering Networks
6.2 Crossbar Networks
6.3 Clos Networks
6.4 Benes Networks
6.5 Sorting Networks
6.6 Case Study: The Velio VC2002 (Zeus) Grooming Switch
6.7 Bibliographic Notes
6.8 Exercises
Chapter 7 Slicing and Dicing
7.1 Concentrators and Distributors
7.2 Slicing and Dicing
7.3 Slicing Multistage Networks
7.4 Case Study: Bit Slicing in the Tiny Tera
7.5 Bibliographic Notes
7.6 Exercises
Chapter 8 Routing Basics
8.1 A Routing Example
8.2 Taxonomy of Routing Algorithms
8.3 The Routing Relation
8.4 Deterministic Routing
8.5 Case Study: Dimension-Order Routing in the Cray T3D
8.6 Bibliographic Notes
8.7 Exercises
Chapter 9 Oblivious Routing
9.1 Valiant's Randomized Routing Algorithm
9.2 Minimal Oblivious Routing
9.3 Load-Balanced Oblivious Routing
9.4 Analysis of Oblivious Routing
9.5 Case Study: Oblivious Routing in the
Avici Terabit Switch Router(TSR)
9.6 Bibliographic Notes
9.7 Exercises
Chapter 10 Adaptive Routing
10.1 Adaptive Routing Basics
10.2 Minimal Adaptive Routing
10.3 Fully Adaptive Routing
10.4 Load-Balanced Adaptive Routing
10.5 Search-Based Routing
10.6 Case Study: Adaptive Routing in the
Thinking Machines CM-5
10.7 Bibliographic Notes
10.8 Exercises
Chapter 11 Routing Mechanics
11.1 Table-Based Routing
11.2 Algorithmic Routing
11.3 Case Study: Oblivious Source Routing in the
IBM Vulcan Network
11.4 Bibliographic Notes
11.5 Exercises
Chapter 12 Flow Control Basics
12.1 Resources and Allocation Units
12.2 Bufferless Flow Control
12.3 Circuit Switching
12.4 Bibliographic Notes
12.5 Exercises
Chapter 13 Buffered Flow Control
13.1 Packet-Buffer Flow Control
13.2 Flit-Buffer Flow Control
13.3 Buffer Management and Backpressure
13.4 Flit-Reservation Flow Control
13.5 Bibliographic Notes
13.6 Exercises
Chapter 14 Deadlock and Livelock
14.1 Deadlock
14.2 Deadlock Avoidance
14.3 Adaptive Routing
14.4 Deadlock Recovery
14.5 Livelock
14.6 Case Study: Deadlock Avoidance in the Cray T3E
14.7 Bibliographic Notes
14.8 Exercises
Chapter 15 Quality of Service
15.1 Service Classes and Service Contracts
15.2 Burstiness and Network Delays
15.3 Implementation of Guaranteed Services
15.4 Implementation of Best-Effort Services
15.5 Separation of Resources
15.6 Case Study: ATM Service Classes
15.7 Case Study: Virtual Networks in the Avici TSR
15.8 Bibliographic Notes
15.9 Exercises
Chapter 16 Router Architecture
16.1 Basic Router Architecture
16.2 Stalls
16.3 Closing the Loop with Credits
16.4 Reallocating a Channel
16.5 Speculation and Lookahead
16.6 Flit and Credit Encoding
16.7 Case Study: The Alpha 21364 Router
16.8 Bibliographic Notes
16.9 Exercises
Chapter 17 Router Datapath Components
17.1 Input Buffer Organization
17.2 Switches
17.3 Output Organization
17.4 Case Study: The Datapath of the IBM Colony
Router
17.5 Bibliographic Notes
17.6 Exercises
Chapter 18 Arbitration
18.1 Arbitration Timing
18.2 Fairness
18.3 Fixed Priority Arbiter
18.4 Variable Priority Iterative Arbiters
18.5 Matrix Arbiter
18.6 Queuing Arbiter
18.7 Exercises
Chapter 19 Allocation
19.1 Representations
19.2 Exact Algorithms
19.3 Separable Allocators
19.4 Wavefront Allocator
19.5 Incremental vs. Batch Allocation
19.6 Multistage Allocation
19.7 Performance of Allocators
19.8 Case Study: The Tiny Tera Allocator
19.9 Bibliographic Notes
19.10 Exercises
Chapter 20 Network Interfaces
20.1 Processor-Network Interface
20.2 Shared-Memory Interface
20.3 Line-Fabric Interface
20.4 Case Study: The MIT M-Machine Network Interface
20.5 Bibliographic Notes
20.6 Exercises
Chapter 21 Error Control 411
21.1 Know Thy Enemy: Failure Modes and Fault Models
21.2 The Error Control Process: Detection, Containment,
and Recovery
21.3 Link Level Error Control
21.4 Router Error Control
21.5 Network-Level Error Control
21.6 End-to-end Error Control
21.7 Bibliographic Notes
21.8 Exercises
Chapter 22 Buses
22.1 Bus Basics
22.2 Bus Arbitration
22.3 High Performance Bus Protocol
22.4 From Buses to Networks
22.5 Case Study: The PCI Bus
22.6 Bibliographic Notes
22.7 Exercises
Chapter 23 Performance Analysis
23.1 Measures of Interconnection Network Performance
23.2 Analysis
23.3 Validation
23.4 Case Study: Efficiency and Loss in the
BBN Monarch Network
23.5 Bibliographic Notes
23.6 Exercises
Chapter 24 Simulation
24.1 Levels of Detail
24.2 Network Workloads
24.3 Simulation Measurements
24.4 Simulator Design
24.5 Bibliographic Notes
24.6 Exercises
Chapter 25 Simulation Examples 495
25.1 Routing
25.2 Flow Control Performance
25.3 Fault Tolerance
Appendix A Nomenclature
Appendix B Glossary
Appendix C Network Simulator
TL;DR: The sources of these performance and energy overheads in general-purpose processing systems are explored by quantifying the overheads of a 720p HD H.264 encoder running on a general- Purpose CMP system and exploring methods to eliminate these overheads by transforming the CPU into a specialized system for H. 264 encoding.
Abstract: Due to their high volume, general-purpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in general-purpose processing systems by quantifying the overheads of a 720p HD H.264 encoder running on a general-purpose CMP system. It then explores methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. We evaluate the gains from customizations useful to broad classes of algorithms, such as SIMD units, as well as those specific to particular computation, such as customized storage and functional units. The ASIC is 500x more energy efficient than our original four-processor CMP. Broadly applicable optimizations improve performance by 10x and energy by 7x. However, the very low energy costs of actual core ops (100s fJ in 90nm) mean that over 90% of the energy used in these solutions is still "overhead". Achieving ASIC-like performance and efficiency requires algorithm-specific optimizations. For each sub-algorithm of H.264, we create a large, specialized functional unit that is capable of executing 100s of operations per instruction. This improves performance and energy by an additional 25x and the final customized CMP matches an ASIC solution's performance within 3x of its energy and within comparable area.
TL;DR: This book presents an overview of the state of the art in the design and implementation of transactional memory systems, as of early summer 2006.
Abstract: The advent of multicore processors has renewed interest in the idea of incorporating transactions into the programming model used to write parallel programs. This approach, known as transactional memory, offers an alternative, and hopefully better, way to coordinate concurrent threads. The ACI (atomicity, consistency, isolation) properties of transactions provide a foundation to ensure that concurrent reads and writes of shared data do not produce inconsistent or incorrect results. At a higher level, a computation wrapped in a transaction executes atomically – either it completes successfully and commits its result in its entirety or it aborts. In addition, isolation ensures the transaction produces the same result as if no other transactions were executing concurrently. Although transactions are not a parallel programming panacea, they shift much of the burden of synchronizing and coordinating parallel computations from a programmer to a compiler, runtime system, and hardware. The challenge for the system implementers is to build an efficient transactional memory infrastructure. This book presents an overview of the state of the art in the design and implementation of transactional memory systems, as of early summer 2006.
442 citations
"A case of system-level hardware/sof..." refers background in this paper
...Transactional Memory (TM) [9] is an abstract programming model that aims to greatly simplify parallel programming....
TL;DR: Why PTLsim's x86 focus is highly relevant, and the full system simulation results are used to demonstrate the pitfalls of userspace only simulation, are described.
Abstract: In this paper, we introduce PTLsim, a cycle accurate full system x86-64 microprocessor simulator and virtual machine. PTLsim models a modern superscalar out of order x86-64 processor core at a configurable level of detail ranging from RTL-level models of all key pipeline structures, caches and devices up to full-speed native execution on the host CPU. Unlike other microarchitectural simulators, PTLsim targets the real commercially available x86 ISA, rather than a discontinued architecture with limited tools and an uncertain future. PTLsim supports several flavors: a single threaded userspace version and a full system version providing an SMT model and the infrastructure for multi-core support. We first describe what it takes to perform cycle accurate modeling of a complete x86 machine at the muop (micro-operation) level, along with the challenges and requirements for effective full system multi-processor capable simulation. We then describe the internal architecture of full system PTLsim and how it interacts with the Xen hypervisor and PTLsim's native mode co-simulation technology. We experimentally evaluate PTLsim's real world accuracy by configuring it like an AMD Athlon 64 machine before running a demanding full system client-server networked benchmark inside PTLsim. We compare the statistics generated by our model with the actual numbers from the real processor to demonstrate PTLsim is accurate to within 5% across all major parameters. We provide a discussion of prior simulation tools, along with their strengths and weaknesses. We describe why PTLsim's x86 focus is highly relevant, and we use our full system simulation results to demonstrate the pitfalls of userspace only simulation. Finally, we conclude by detailing future work
Q1. What contributions have the authors mentioned in the paper "A case of system-level hardware/software co-design and co-verification of a commodity multi-processor system with custom hardware" ?
This paper presents an interesting system-level co-design and co-verification case study for a non-trivial design where multiple high-performing x86 processors and custom hardware were connected through a coherent interconnection fabric. The authors demonstrate how such a co-simulation environment can be constructed from existing tools and software. However, the authors found that significant extensions need to be made to the conventional BFM methodology in order to capture various data-race cases in simulation, which eventually happen in modern multiprocessor systems.