scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A case of system-level hardware/software co-design and co-verification of a commodity multi-processor system with custom hardware

07 Oct 2012-pp 513-520
TL;DR: This paper presents an interesting system-level co-design and co-verification case study for a non-trivial design where multiple high-performing x86 processors and custom hardware were connected through a coherent interconnection fabric and used a processor bus functional model to combine native software execution with a cycle-accurate interconnect simulator and an HDL simulator.
Abstract: This paper presents an interesting system-level co-design and co-verification case study for a non-trivial design where multiple high-performing x86 processors and custom hardware were connected through a coherent interconnection fabric. In functional verification of such a system, we used a processor bus functional model (BFM) to combine native software execution with a cycle-accurate interconnect simulator and an HDL simulator. However, we found that significant extensions need to be made to the conventional BFM methodology in order to capture various data-race cases in simulation, which eventually happen in modern multi-processor systems. Especially essential were faithful implementations of the memory consistency model and cache coherence protocol, as well as timing randomization. We demonstrate how such a co-simulation environment can be constructed from existing tools and software. Lessons from our study can similarly be applied to design and verification of other tightly-coupled systems.

Summary (2 min read)

1. INTRODUCTION

  • Modern digital systems are moving increasingly towards heterogeneity.
  • The authors discuss which among the many conventional co-design/verification methods would best serve their purposes and why.
  • The authors show the effectiveness of their methodology and draw out general lessons from it (Section 4), before they conclude in Section 5.
  • The authors found that combining the software model with an interconnection simulator and HDL simulator via a processor BFM is the most effective method for functional verification.
  • The authors also explain, however, that conventional ways of constructing a processor BFM should be revised in accordance with modern multi-core processors and inter- connection architecture; it is especially important to accurately implement the memory consistency model and cache coherence protocol.

2.1 Target Design

  • This section outlines the design of their system for the sake of providing sufficient background context for one to understand their co-design and co-verification issues, while the detailed design of the system is outside scope of this paper.
  • A typical software transactional memory (STM), an implementation of such a runtime system solely with software, tends to exhibit huge performance overhead.
  • (6) The TM hardware, based on all the read/write address received from all the cores, now determines which cores have conflicting read and writes and sends out the requisite messages to those cores.
  • The system is composed of two quad-core x86 CPUs and an FPGA that is connected coherently via a chain of point-to-point links.
  • Messages from the CPU are sent to their HW via a non-coherent interface, while responses to the CPU go through the coherent cache.

2.2 Our Initial Failure and Issues with Co-Verification

  • Since their system was composed of tightly coupled hardware and software, the authors had to deal with a classic chickenand-egg co-verification problem:.
  • A crash observed after one last memory access from one core (e.g. de-referencing a dangling pointer), could be a result of a write, falsely allowed to commit, from another core millions of cycles before.
  • The new STM software needed to be intensely validated (with the new hardware), especially under the assumption of parallel execution, variable latency and out-of-order message delivery as shown in Figure 2.
  • An alternative was to use a detailed architecture simulator (e.g. [22]) but its simulation speed was insufficient.
  • This method brought its own challenges which the authors discuss in detail in the following section.

3.1 General Issues and Solutions

  • The authors discuss general issues that arise when using a software model for HW/SW co-verification and how they overcame those issues.
  • Hardware simulation can consist of two different components: a cycle-accurate interconnection network simulation and HDL simulation.
  • Instead, the authors rely on the single-threaded simulator to interleave multiple software execution contexts.
  • (Issue #4) The memory consistency model must be carefully considered.
  • Otherwise, the contents in the store buffer, up to the entry that has been matched, are flushed before the new packet is injected.

3.2 Our Co-simulation Environment: Implementation

  • This subsection details the implementation of their co-simulation environment where all the issues discussed in previous subsections are resolved.
  • Noticeably, the API provides separate methods for normal , non-coherent, and un accesses as well as flush and atomic operations.
  • Figure 5 shows how execution flows from a SW context (i.e. a fiber executing the SW model) to the simulator context (i.e. the main fiber for simulation).
  • Instead of actually sending a transactional read message to the FPGA, the HAL part of the STM calls into the BFM API (SIM_Noncoh_write) which eventually injects a packet into the simulator (BUS_Inject) and switches context to simulator execution (SIM_return_simulator).
  • The network simulator, which performs simple cycle-based simulation, calls clock function for each BFM at each simulation cycle.

4. RESULTS AND DISCUSSION

  • The authors new co-simulation environment (Section 3.2), was extremely useful for verifying the functional correctness of their system.
  • On one hand, randomized timing helped to explore corner cases in data-race conditions.
  • Since most of the software model was executed natively, there was no waste of valuable simulation cycles to execute instructions that were not necessary for functional verification.
  • Fourth, the co-simulation environment provided a very helpful error detection mechanism, which was impossible in native execution on FPGA.
  • Note that the last row points out which address is violating serialize-ability From this log, the authors were able to relate SW context and HW status, since the simulation cycle is shared by both SW simulation and HDL simulation.

5. CONCLUSION

  • The authors presented their HW/SW co-verification experience on a commodity multi-processor system with custom hardware.
  • For the sake of functional verification of such a system, it was most effective to combine native SW execution with cycle-based interconnect simulation and HDL simulation by means of a processor BFM.
  • Their experiences showed that such BFMs should faithfully reflect the memory consistency models of their target processors and would benefit greatly from randomized packet injection timing in their network simulations.
  • These requirements enable the co-simulation to generate a wide variety of data access interleavings, which is essential for co-verification of modern multi-processor systems.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A Case of System-level Hardware/Software Co-design and
Co-verification of a Commodity Multi-Processor System
with Custom Hardware
Sungpack Hong
Oracle Labs
sungpack.hong@oracle.com
Tayo Oguntebi
Stanford University
tayo@stanford.edu
Jared Casper
Stanford University
jaredc@stanford.edu
Nathan Bronson*
Facebook, Inc.
nbronson@stanford.edu
Christos Kozyrakis
Stanford University
kozyraki@stanford.edu
Kunle Olukotun
Stanford University
kunle@stanford.edu
ABSTRACT
This paper presents an interesting system-level co-design
and co-verification case study for a non-trivial design where
multiple high-performing x86 processors and custom hard-
ware were connected through a coherent interconnection fab-
ric. In functional verification of such a system, we used a
processor bus functional model (BFM) to combine native
software execution with a cycle-accurate interconnect simu-
lator and an HDL simulator. However, we found that signif-
icant extensions need to be made to the conventional BFM
methodology in order to capture various data-race cases
in simulation, which eventually happen in modern multi-
processor systems. Especially essential were faithful im-
plementations of the memory consistency model and cache
coherence protocol, as well as timing randomization. We
demonstrate how such a co-simulation environment can be
constructed from existing tools and software. Lessons from
our study can similarly be applied to design and verification
of other tightly-coupled systems.
Categories and Subject Descriptors
B.4.4 [Performance Analysis and Design Aids]: Simu-
lation, Verification
Keywords
Co-Verification, Co-Simulation, Bus Functional Model, FPGA
Prototyping, Transactional Memory
1. INTRODUCTION
Modern digital systems are moving increasingly towards
heterogeneity. Today, many digital systems feature multiple
This work was done when the authors were at Stanford
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CODES+ISSS’12, October 7-12, 2012, Tampere, Finland.
Copyright 2012 ACM 978-1-4503-1426-8/12/09 ...$15.00.
(heterogeneous) processors, advanced interconnect topolo-
gies beyond simple buses, and often specialized hardware to
improve the performance-per-watt for specific tasks [5].
Development of such heterogeneous systems, however, re-
quires intensive system-level verification. System-wide co-
design and co-verification of hardware and software compo-
nents [16] is especially essential for systems where multiple
heterogeneous components are executing a single task in a
tightly-coupled manner, orchestrated by the software. In
such systems, for example, data races between multiple com-
putational components should be thoroughly verified since
they can induce various unexpected system behaviors.
In this paper, we present our experiences designing and
verifying a tightly-coupled heterogeneous system in which
multiple x86 processors are accompanied by specialized hard-
ware that accelerates software transactional memory (Sec-
tion 2.1). We discuss which among the many conventional
co-design/verification methods would best serve our pur-
poses and why. Our chosen method was one which combined
an un-timed software model with cycle-accurate intercon-
nection and HDL simulation through a processor bus func-
tional model (BFM), since this provided sufficient visibility
and simulation speed (Section 2.2). However, we found that
when using a BFM for verification, conventional method-
ology needs to be significantly extended in order to capture
data-race corner cases induced by modern multi-processor
architectures. Especially crucial were correct implementa-
tions of memory consistency and cache coherence as well as a
need to introduce timing randomization (Section 3.1). This
paper also demonstrates how such a co-simulation environ-
ment can be constructed on top of existing tools and software
(Section 3.2). We show the effectiveness of our methodology
and draw out general lessons from it (Section 4), before we
conclude in Section 5.
Our contributions can be summarized as follows:
We present a non-trivial system-level co-design and co-
verification experience with x86 processors connected
to custom hardware. In this scenario, we found that
combining the software model with an interconnection
simulator and HDL simulator via a processor BFM is
the most effective method for functional verification.
We also explain, however, that conventional ways of
constructing a processor BFM should be revised in ac-
cordance with modern multi-core processors and inter-

Core #1 Core #N
TM
Hardware
(1) Read A
(3) Write A (local)
(2) “#N: Read A”
(5) “#1: Write A”
(4) End TX
(5) “#1: OK to go?”
(6) “#N is violated”
(6) “#1 is OK to go”
(7) Commit A (7) Restart TX
……
Figure 1: Outline of the STM acceleration protocol.
connection architecture; it is especially important to
accurately implement the memory consistency model
and cache coherence protocol. Introducing timing ran-
domization is also very important.
2. DESIGN AND VERIFICATION
2.1 Target Design
This section outlines the design of our system for the sake
of providing sufficient background context for one to un-
derstand our co-design and co-verification issues, while the
detailed design of the system is outside scope of this paper.
Transactional Memory (TM) [9] is an abstract program-
ming model that aims to greatly simplify parallel program-
ming. In this model, the programmer declares atomicity
boundaries (transactions), while the runtime system ensures
a consistent ordering among concurrent reads and writes to
the shared memory. However, a typical software transac-
tional memory (STM), an implementation of such a runtime
system solely with software, tends to exhibit huge perfor-
mance overhead.
Our system is composed of specialized hardware which is
externally connected to commodity CPUs and accelerates an
STM system [2]. The motivation was reduction of the per-
formance overhead generally experienced by STM systems
without modifying the core and thus increasing processor
design complexity. The TM abstraction is enforced through
the following protocol, which is outlined in Figure 1:
(1) Whenever a core reads a shared variable, (2) the core
fires a notification message to the TM hardware. (3) Writes
to shared variables are kept in a software buffer, and are not
visible to other cores until (4) execution reaches the end of a
transaction. At this point, the core (5) sends the addresses
of all the writes that it wants to commit to the hardware
and waits for a response. (6) The TM hardware, based on
all the read/write address received from all the cores, now
determines which cores(transactions) have conflicting read
and writes and sends out the requisite messages to those
cores. (7) Depending on these messages, each core proceeds
to commit or restart its current transaction from the begin-
ning.
The above protocol is implemented as a tightly-coupled
software and hardware system. Software sends messages and
manages the local write buffer, while the external hardware
accelerator performs fast conflict detection. For our develop-
ment environment, we used an instance of an FPGA-based
Core #1 Core #N
TM
Hardware
(1) Read A
(3) Write A (local)
(2) “#N: Read A”
(5) “#1: Write A”
(4) End TX
(5) “#1: OK to go?”
……
Core #N
Violated?
Figure 2: A failure case in our first design.
CPU#1: AMD Barcelona CPU#2: AMD Barcelona FPGA #1: Altera Straitx II
Shared Cache (2MB)
(Coherent)
HyperTransport
Shared Cache (2MB)
(Coherent)
HyperTransport
Cache
Coherent HyperTransport
MMR
X86 Core
(1.8Ghz) x4
X86 Core
(1.8Ghz) x4
Transactional Memory
Accelerator
(Our Design)
(c)
(b)
(a)
Figure 3: Block Diagram of Implementation Envi-
ronment: Our design is connected to the rest of the
system via (a) non-coherent interface, (b) memory-
mapped register interface, and (c) coherent cache
interface.
rapid prototyping system [14]. Figure 3 illustrates the block
diagram of the prototyping system. The system is com-
posed of two quad-core x86 CPUs and an FPGA that is
connected coherently via a chain of point-to-point links. We
implemented our custom hardware (TM Accelerator) on the
FPGA of the system (Figure 3). Messages from the CPU are
sent to our HW via a non-coherent interface, while responses
to the CPU go through the coherent cache. Note that the
former non-coherent mechanism enables hiding communica-
tion latency between the CPU and FPGA while the latter
coherent communication avoids interrupts and long-latency
polling.
2.2 Our Initial Failure and
Issues with Co-Verification
Since our system was composed of tightly coupled hard-
ware and software, we had to deal with a classic chicken-
and-egg co-verification problem: How can we verify the cor-
rectness of the hardware without the correctly-working soft-
ware and vice versa? Our initial approach was to decouple
hardware and software verification; we verified the software
with an instruction set simulator (ISS) and virtual hard-
ware model while targeting the hardware using unit-level
HDL simulation and direct debugging on the FPGA.
It was only after the simulation in our ISS environment

Method Strength Weakness
(A) Prototyping Fastest execution on real HW Limited visibility regarding hardware status
(B) Full HW simulation Full visibility and control Extremely low simulation speed
(C) ISS + HW simulation [15, 12] Faster simulation than (B) Waste of simulation cycles for unrelated instructions
(D) ISS + Virtual HW model [7, 18] Faster simulation than (C); Same issues as (C); Fidelity of virtual HW model;
Can predate HDL development Requires a separate HW verification step
(E) SW model + HW sim. [20, 17, 21] Verification of target HDL with Overhead of SW re-writing;
(possibly through BFM) real SW execution Lack of timing information for SW execution
(F) Emulation [13, 1] Fast execution of target HDL Cost of the systems; Limited support of core types
(G) Binary Translation [10] Fast SW execution on host HW Lack of deterministic replay of concurrent execution.
Table 1: Comparisons of HW-SW co-verification methods.
and unit tests in FPGA environment had all passed success-
fully, but the whole system was tested altogether for the first
time, when we realized that there was a flaw in our original
design. The specific error scenario is shown in Figure 2. The
figure depicts the same read-write sequence as in Figure 1
except that in this case, the delivery of the read message
from core N is delayed until after delivery of the commit
message from core 1. As there was no consideration of such
case in our original design, the software execution was sim-
ply crashed after a failure of consistency enforcement. Note
that since the ISS assumed in-order instruction execution
and the HW model assumed in-order packet delivery, our en-
vironment was not able to generate the scenario described
in Figure 2. As it happened, this scenario occurred quite
frequently in real HW.
1
That being said, we had spent a significant amount of
engineering efforts to confirm that Figure 2 is really the rea-
son of the observed failure. Single-threaded SW executions
never failed, while multi-threaded ones crashed occasionally
but non-deterministically. A logic analyzer, tapped onto our
FPGA, was not very helpful because a typical execution in-
volves billions of memory accesses, each access generating
tens of memory packets it was extremely hard to identify
which of those packets were relevant to the error and how
those are related to SW execution. For example, a crash
observed after one last memory access from one core (e.g.
de-referencing a dangling pointer), could be a result of a
write, falsely allowed to commit, from another core millions
of cycles before. Simple step-by-step execution of a single
processor was not helpful either due to our parallel execu-
tion requirement. As a matter of fact, we identified this issue
through a deep speculation on our initial design, which was
confirmed later only after adapting another co-design/co-
verification methodology (Section 3).
Fixing up this issue, we properly augmented the proto-
col in Figure 1 and re-designed our hardware and software
accordingly. This time, however, in order to save ourselves
from repeating the same mistake, we considered applying
other methodologies and tools proposed for HW/SW co-
design and co-verification [16, 1, 18, 12, 20, 15, 7, 17, 13]
Our requirements could be summarized as follows:
The new STM software needed to be intensely vali-
dated (with the new hardware), especially under the
assumption of parallel execution, variable latency and
out-of-order message delivery as shown in Figure 2.
1
There are two major reasons for this. (1) The underly-
ing network enforces no delivery ordering between different
cores. (2) We used a non-temporal x86 store instruction to
implement asynchronous message transmission; however the
semantics of the instruction allowed the message to stay in
the store-buffer for an indefinite amount of time.
Such an intensive SW validation naturally demanded
fast execution.
We were more interested in functional verification of
the new hardware rather than architecture exploration.
Furthermore, as our new hardware was becoming avail-
able soon (after small modification from the initial ver-
sion), we wanted direct verification of target RTL as
well.
We wanted a mechanism for error analysis better than
manual inspection of GBs of logs/waveforms from sim-
ulation or logic analyzer. At a minimum, we wanted
to associate such logs with software execution context.
Table 1 summarizes the advantages and disadvantages
of conventional approaches that we considered. Our ini-
tial approach used prototyping (Method A in Table 1) and
ISS combined with virtual models (Method D), but as de-
scribed previously failed to serve our purpose. Full simula-
tion (Method B) was never an option since we didn’t have ac-
cess to the HDL source of the CPUs and it would have surely
failed to meet the fast execution requirement. We have al-
ready described how ISS (Method C) initially failed to find
the erroneous case in the first place. An alternative was to
use a detailed architecture simulator (e.g. [22]) but its simu-
lation speed was insufficient. Also, there were no emulators
(Method F) available to us which supported our core (AMD
x86) and interconnection type (HyperTransport). Finally,
we were not able to use fast software simulation techniques
based on binary instrumentation (Method G), because it is
not trivial with this method, to replay every load/store in-
struction of every processors in exact same order. Note that
such a feature is crucial for verifying multi-processor systems
like ours by nature.
The only remaining option was to construct a model that
faithfully reflected the behavior of the software while inter-
facing with the HDL simulation (Method E). However, this
method brought its own challenges which we discuss in detail
in the following section.
3. APPLYING SW MODEL-BASED
CO-SIMULATION
3.1 General Issues and Solutions
In this section, we discuss general issues that arise when
using a software model for HW/SW co-verification and how
we overcame those issues. For each issue, we either introduce
lessons learned from previous research or discuss how they
were not applicable in our case.

(Issue #1) Software has to be re-written into a
form that can interface with HDL simulation.
The first issue can be minimized by using the BFM of
the processor [17] (also known as processor abstraction with
transaction-level (TL) interface) which works as a bridge be-
tween software execution and hardware simulation. Specif-
ically, instead of creating a separate software model, the
whole software is executed natively; however every read and
write instruction that may potentially affect the target hard-
ware is replaced with a simul_read() and simul_write()
function call. These function calls invoke data transfers in
simulation, suspending the software context until the under-
lying hardware simulation finishes processing the requested
data transfer (in a cycle-accurate way). In the worst case,
every load and store should be replaced; in most cases, it
is enough to change only the Hardware Abstraction Layer
(HAL), a small well-encapsulated portion of software [21, 6].
Hardware simulation can consist of two different compo-
nents: a cycle-accurate interconnection network simulation
and HDL simulation. The interconnection simulator can
easily interact with HDL through simple function-to-signal
translators [3] as long as the simulation is packet-wise ac-
curate. This decoupled approach provides more flexibility
and better simulation speed than doing whole HDL simula-
tion [19].
In our case, only a small portion of code inside the STM
was identified as HAL the user application was entirely
built upon the STM interface. For the interconnect, we
used a cycle-accurate HyperTransport simulator, developed
by AMD, which can interact with the HDL simulator (e.g.
ModelSim) via PLI.
(Issue #2) The multi-threaded software should be
executed concurrently, but also be re-playable in a
determistic manner.
The second issue arises because we are combining execu-
tion of multi-threaded software with a simulator which is ex-
ecuted sequentially by nature. Note that we cannot let each
thread in the original application freely interact with the
simulator for the purpose of deterministic re-execution, i.e.
simulators based on binary instrumentation such as Pin [10]
cannot be used. Instead, we rely on the single-threaded sim-
ulator to interleave multiple software execution contexts.
SystemC [8] is a standard simulation engine that can han-
dle multiple execution contexts for such use cases [21, 17];
SystemC allows for trivial implementation of blocking calls.
Unfortunately, there are still plenty of (in-house) simulators
which do not adhere to the SystemC standard, including
popular CPU simulators [22] and interconnection simula-
tors [4]. Our interconnection simulator
2
was not compliant
with SystemC, either. We therefore implemented our own
blocking-call mechanism using co-routines, or fibers.
(Issue #3) Software models lose timing informa-
tion.
Being natively executed, software models lose timing in-
formation they simply continuously inject read/write re-
quests to the BFM simulator. The conventional solution
is to insert delay calls explicitly before each call-site of
simul_read and simul_write in the user application, which
compensates for the CPU cycles between the previous block-
ing call and the current one [21, 6].
2
The simulator is a proprietary implementation by AMD
and requires an NDA to access.
Application SW
STM (Algorithm)
STM (HAL)
Application SW
STM (Algorithm)
STM (HAL)
Abstract Memory Interface
Network Simulator
HDL Simulator
Library HW (HDL)
Our HW (HDL)
Timing-Randomized
Bus Functional Model
Application SW
STM (Algorithm)
STM (HAL)
SW Model
(x4/Node)
HW Model
(2 Nodes)
Cache
Simulator
Figure 4: Block diagram of our co-simulation envi-
ronment.
Our approach differs from the conventional solution in
two ways: (1) we insert the delay inside the simul_read
method rather than at the call-sites in the user-application
code, and (2) we (pseudo-)randomize the delay values. We
justify this for the following reasons: First, our method re-
quires no further modification of the user application code.
Second, user-provided delay information at the call-sites is
already inaccurate. For instance, CPU cycles between two
call-sites cannot be accurately compensated if there are (ex-
ponentially) many execution paths between them. Finally,
for the purpose of functional verification, the exact num-
ber of execution cycles for non-relevant SW sections is of
little interest. Rather, for functional verification, we want
to interleave packet injection from multiple cores in varying
orders as much as possible.
(Issue #4) The memory consistency model must
be carefully considered.
This is one issue that has not been discussed extensively in
the literature. In previous studies [17, 21, 6], the BFM sim-
ply injected network packets to the interconnection network
in the order requested by the software as is the behavior in
classic embedded processors. However, this does not closely
approximate the memory packet generation pattern of our
modern x86 processor; it fails to account for the aggressive
reordering of memory requests.
Instead, we implement a more realistic memory consis-
tency model in our processor BFM, namely Total Store Or-
dering (TSO) [11]. This is the model on which many modern
processors (e.g. x86 and SPARC) are based. To ensure TSO,
we keep a per-core store buffer inside our BFM. The write
request goes to the buffer without injecting a packet into the
network as long as there is an available slot in the buffer.
On a read request, we first search for the target address in
the store buffer. If not found, a new (read-request) packet
is injected into the network. Otherwise, the contents in the
store buffer, up to the entry that has been matched, are
flushed before the new packet is injected. The store buffer
is always flushed in FIFO order.
In addition, cache coherency should be implemented cor-
rectly as well, simply for the sake of correct parallel execu-
tion. However, we were able to leverage the cache simulator
already embedded in our network simulator.
(Issue #5) There should be an easy way of error
analysis.
In previous section, we explained how painful and unsuc-

...
int index = foo(x);
int v = TM_READ(&array[index])
...
Application
TM_Read(addr) {
if (write_buffer.check(addr))
return write_buffer.get(addr);
else {
Send_read_message(addr, tid);
return get_value(addr);
} }
STM (Algorithm)
SW
Send_read_message(...) {
unsigned MSG = ...
SIM_Noncoh_write(HW_ADDR, MSG);
}
STM (HAL)
SIM_Noncoh_write (SIM_ADDR, VAL) {
SIM_wait_cycles(RandInt(MAX_WAIT));
... // HW store buffer function...
BUS_Inject(NCWR, SIM_ADDR, VAL);
... // in-flight packet bookkeeping
SIM_return_to_simul_context(core_id);
}
BFM
HW
Network Simulator
Figure 5: Execution sequence: From SW-context to
simulator context.
cessful it was manually inspecting massive amounts of logs
that are blindly generated by logic analyzer (or simulator),
when identifying breach of consistency protocol.
To the contrary, our BFM-based approach enabled a bet-
ter scheme for an off-line analysis; we exploit the facts that
(1) the multi-processor simulation is actually being executed
single-threaded on a workstation, that (2) the simulated ad-
dress space is separated from simulator’s address space, and
that (3) simulated execution context and native execution
context are also clearly divided. In specific, we further in-
strumented the STM (i.e. our HAL) such that we add a log
entry in a global shadow data structure whenever there is a
relevant activity from the current core, such as transactional
memory access or commit request. Whenever a new entry
is appended to the shadow data structure, a global check
on consistency protocol is performed as well for instance,
if the log indicates that the current transaction has a con-
flicting memory access with another transaction but both
are allowed to commit, the simulation immediately reports
an error with accurate conflict information. Note that the
global shadow data structure is kept inside simulator’s con-
text, and therefore such a error check is thread-safe and does
not consume any simulation cycle.
3.2 Our Co-simulation Environment: Imple-
mentation
This subsection details the implementation of our
co-simulation environment where all the issues discussed in
previous subsections are resolved. Figure 4 depicts the block
diagram of our co-simulation environment. Our HDL de-
sign is simulated alongside the library HDL of our FPGA
framework, whose interconnect pin-outs are connected to
the network simulator through the PLI mechanism. Our
BFM implementation is treated as a traffic generator by
the interconnection network simulator, which is dynamically
linked at runtime. The BFM module, network simulator and
HDL simulator represent the hardware part of the system in
this simulation environment. The software part is the whole
Clock () {
for (i = 0;i< Cores_Per_Node;i++) {
if ( wait_counter[i]) { // idle cycles
if (--wait_counter[i] == 0)
SIM_return_core_context(i);
}else{// waiting for packet
if (BUS_is_packet_done(packet[i])) {
SIM_return_core_context(i);
} }
BFM
Network Simulator
Figure 6: Execution sequence: From simulator con-
text to SW-context.
application and STM software that are unmodified except
HAL, which is now built upon BFM API (Table 2). Each
application thread is implemented as a fiber (co-routines)
whose context switching is managed by BFM. Specifically,
we used the POSIX makecontext(3) and swapcontext(3)
mechanisms for fiber implementation. We instantiate two
BFM modules (CPU nodes) on the simulator with each BFM
module executing four SW threads (CPU cores) at a time,
which faithfully models our system configuration (see Fig-
ure 3).
Table 2 summarizes the API of our processor BFM, which
is called by the software model. Noticeably, the API pro-
vides separate methods for normal (cached), non-coherent,
and uncached accesses as well as flush and atomic opera-
tions. Separation of these methods is required for accurate
implementation of TSO memory consistency as explained in
the previous subsection.
Figure 5 shows how execution flows from a SW context
(i.e. a fiber executing the SW model) to the simulator con-
text (i.e. the main fiber for simulation). As in normal execu-
tion, whenever the user application reads a shared variable,
the code jumps to and executes the TM_READ function in the
STM library. Note that the application has executed na-
tively up to this point and has not yet consumed a single
simulation cycle. However, instead of actually sending a
transactional read message to the FPGA, the HAL part of
the STM calls into the BFM API (SIM_Noncoh_write) which
eventually injects a packet into the simulator (BUS_Inject)
and switches context to simulator execution
(SIM_return_simulator). However, before injecting each
packet, the BFM adds random idle cycles (SIM_wait_cycles).
Figure 6 shows execution flow from the simulator context
to SW context. The network simulator, which performs sim-
ple cycle-based simulation, calls clock() function for each
BFM at each simulation cycle. Since the software context is
either idle-waiting or waiting for packet transfer, the BFM
checks those conditions and resumes any software model that
is ready by context-switching back to the software model
(SIM_return_core_context).
4. RESULTS AND DISCUSSION
Our new co-simulation environment (Section 3.2), was ex-
tremely useful for verifying the functional correctness of our
system. With this environment, we first simulated our old
design and confirmed that the old system fails at cases like
Figure 2. Note that the new environment is able to generate
such cases, while our previous ISS-based simulation wasn’t;
also it is easy to track down errors in this environment, which

Citations
More filters
Proceedings ArticleDOI
12 Mar 2016
TL;DR: This work proposes, selective caching, wherein it disallow GPU caching of any memory that would require coherence updates to propagate between the CPU and GPU, thereby decoupling the GPU from vendor-specific CPU coherence protocols.
Abstract: Cache coherence is ubiquitous in shared memory multiprocessors because it provides a simple, high performance memory abstraction to programmers. Recent work suggests extending hardware cache coherence between CPUs and GPUs to help support programming models with tightly coordinated sharing between CPU and GPU threads. However, implementing hardware cache coherence is particularly challenging in systems with discrete CPUs and GPUs that may not be produced by a single vendor. Instead, we propose, selective caching, wherein we disallow GPU caching of any memory that would require coherence updates to propagate between the CPU and GPU, thereby decoupling the GPU from vendor-specific CPU coherence protocols. We propose several architectural improvements to offset the performance penalty of selective caching: aggressive request coalescing, CPU-side coherent caching for GPU-uncacheable requests, and a CPU-GPU interconnect optimization to support variable-size transfers. Moreover, current GPU workloads access many read-only memory pages; we exploit this property to allow promiscuous GPU caching of these pages, relying on page-level protection, rather than hardware cache coherence, to ensure correctness. These optimizations bring a selective caching GPU implementation to within 93% of a hardware cache-coherent implementation without the need to integrate CPUs and GPUs under a single hardware coherence protocol.

44 citations


Cites background from "A case of system-level hardware/sof..."

  • ...Such approaches also incur highly coordinated design and verification effort by both CPU and GPU vendors [24] that is challenging when multiple vendors wish to integrate existing CPU and GPU designs in a timely manner....

    [...]

  • ...Building scalable, high-performance cache coherence requires a holistic system that strikes a balance between directory storage overhead, cache probe bandwidth, and application characteristics [8, 24, 33, 36, 54, 55, 58]....

    [...]

Journal ArticleDOI
TL;DR: An ideal security verification solution naturally handling both hardware and software components is sketched, and an evaluation of formal verification methods that have already been proposed for mixed hardware/software systems are proposed with regards to the ideal method.
Abstract: Critical and privacy-sensitive applications of smart and connected objects such as health-related objects are now common, thus raising the need to design these objects with strong security guarantees. Many recent works offer practical hardware-assisted security solutions that take advantage of a tight cooperation between hardware and software to provide system-level security guarantees. Formally and consistently proving the efficiency of these solutions raises challenges since software and hardware verifications approaches generally rely on different representations. The paper first sketches an ideal security verification solution naturally handling both hardware and software components. Next, it proposes an evaluation of formal verification methods that have already been proposed for mixed hardware/software systems, with regards to the ideal method. At last, the paper presents a conceptual approach to this ideal method relying on ProVerif, and applies this approach to a remote attestation system (SMART).

14 citations


Cites methods from "A case of system-level hardware/sof..."

  • ...As disjoint verification of hardware and software naturally suffers from the considerable manual effort needed for finding a good abstraction that could both be proved to be a refinement of the hardware and be used as a base for verifying the software, some research work has been done to verify hardware/software co-designs as a whole [20,23]....

    [...]

Journal ArticleDOI
TL;DR: A SystemC transaction level modelling wrapping library that can be used for the assertion of system properties, protocol compliance, or fault injection and has been successfully applied to the robustness verification of the on-board boot software of the Instrument Control Unit of the Solar Orbiter's Energetic Particle Detector.
Abstract: This paper presents the design of a SystemC transaction level modelling wrapping library that can be used for the assertion of system properties, protocol compliance, or fault injection. The library uses C++ virtual table hooks as a dynamic binary instrumentation technique to inline wrappers in the TLM2 transaction path. This technique can be applied after the elaboration phase and needs neither source code modifications nor recompilation of the top level SystemC modules. The proposed technique has been successfully applied to the robustness verification of the on-board boot software of the Instrument Control Unit of the Solar Orbiter's Energetic Particle Detector.

3 citations


Cites background from "A case of system-level hardware/sof..."

  • ...Another work [16] presents a system-level codesign and coverification case study....

    [...]

01 Jan 2013
TL;DR: A software profiler called AddressTracer is proposed that is accurately able to evaluate performance matrices of any specific software portion and provides up to 50.15% improvement in accuracy of profiling software compared to Gprof and 6.89% compared to Airwolf.
Abstract: Embedded systems are a mixture of software running on a microprocessor and application-specific hardware. There are many co-design methodologies that are used to design embedded systems. One of them is Hardware/Software co-design methodology which requires an appropriate profiler to detect the software portions that contribute to a large percentage of program execution and cause performance bottleneck. Detecting these software portions improves the system efficiency where these portions are either reprogrammed to eliminate the performance bottleneck or moved to the hardware domain gaining the advantages of this domain. There are profiling tools used to profile software programs such as GNU Gprof profiler. GNU Gprof integrates an extra code with the software program to be profiled causing inaccurate results and a significant execution time overhead. To address these issues, this paper proposes a software profiler called AddressTracer that is accurately able to evaluate performance matrices of any specific software portion. A set of benchmarks, Dijkstra, Secure Hash Algorithm, and Bitcount are profiled using AddressTracer, Airwolf and GNU software profiling tool (Gprof), for a quantitative comparison. The achieved results show that AddressTracer gives accurate profiling results compared to Gprof and Airwolf profilers. AddressTracer provides up to 50.15% improvement in accuracy of profiling software compared to Gprof and 6.89% compared to Airwolf. Furthermore, AddressTracer is a non-intrusive profiler which does not cause any performance overhead.

2 citations


Additional excerpts

  • ...modules of embedded systems are realized in software [1]....

    [...]

Florian Lugou1
01 Sep 2015
TL;DR: In this paper, an ideal security verification solution for mixed hardware/software systems is presented, relying on ProVerif, and applied to a remote at-testation system (SMART).
Abstract: Critical and private applications of smart and connected objects such as health-related objects are now common, thus raising the need to design these objects with strong security guarantees. Many re- cent works offer practical hardware-assisted security solutions that take advantage of a tight cooperation between hardware and software to provide system-level security guarantees. Formally and consistently proving the efficiency of these solutions raises challenges since software and hardware verifications approaches generally rely on different representations. The paper first sketches an ideal security verification solution naturally handling both hardware and software components. Next, it proposes an evaluation of formal verification methods that have already been pro- posed for mixed hardware/software systems, with regards to the ideal method. At last, the paper presents a conceptual approach to this ideal method relying on ProVerif, and applies this approach to a remote at- testation system (SMART).

2 citations

References
More filters
Journal ArticleDOI
12 Jun 2005
TL;DR: The goals are to provide easy-to-use, portable, transparent, and efficient instrumentation, and to illustrate Pin's versatility, two Pintools in daily use to analyze production software are described.
Abstract: Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pin's rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the application's original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pin's versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), Itanium®, and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website.

4,019 citations

Book
01 Jan 2004
TL;DR: This book offers a detailed and comprehensive presentation of the basic principles of interconnection network design, clearly illustrating them with numerous examples, chapter exercises, and case studies, allowing a designer to see all the steps of the process from abstract design to concrete implementation.
Abstract: One of the greatest challenges faced by designers of digital systems is optimizing the communication and interconnection between system components. Interconnection networks offer an attractive and economical solution to this communication crisis and are fast becoming pervasive in digital systems. Current trends suggest that this communication bottleneck will be even more problematic when designing future generations of machines. Consequently, the anatomy of an interconnection network router and science of interconnection network design will only grow in importance in the coming years. This book offers a detailed and comprehensive presentation of the basic principles of interconnection network design, clearly illustrating them with numerous examples, chapter exercises, and case studies. It incorporates hardware-level descriptions of concepts, allowing a designer to see all the steps of the process from abstract design to concrete implementation. ·Case studies throughout the book draw on extensive author experience in designing interconnection networks over a period of more than twenty years, providing real world examples of what works, and what doesn't. ·Tightly couples concepts with implementation costs to facilitate a deeper understanding of the tradeoffs in the design of a practical network. ·A set of examples and exercises in every chapter help the reader to fully understand all the implications of every design decision. Table of Contents Chapter 1 Introduction to Interconnection Networks 1.1 Three Questions About Interconnection Networks 1.2 Uses of Interconnection Networks 1.3 Network Basics 1.4 History 1.5 Organization of this Book Chapter 2 A Simple Interconnection Network 2.1 Network Specifications and Constraints 2.2 Topology 2.3 Routing 2.4 Flow Control 2.5 Router Design 2.6 Performance Analysis 2.7 Exercises Chapter 3 Topology Basics 3.1 Nomenclature 3.2 Traffic Patterns 3.3 Performance 3.4 Packaging Cost 3.5 Case Study: The SGI Origin 2000 3.6 Bibliographic Notes 3.7 Exercises Chapter 4 Butterfly Networks 4.1 The Structure of Butterfly Networks 4.2 Isomorphic Butterflies 4.3 Performance and Packaging Cost 4.4 Path Diversity and Extra Stages 4.5 Case Study: The BBN Butterfly 4.6 Bibliographic Notes 4.7 Exercises Chapter 5 Torus Networks 5.1 The Structure of Torus Networks 5.2 Performance 5.3 Building Mesh and Torus Networks 5.4 Express Cubes 5.5 Case Study: The MIT J-Machine 5.6 Bibliographic Notes 5.7 Exercises Chapter 6 Non-Blocking Networks 6.1 Non-Blocking vs. Non-Interfering Networks 6.2 Crossbar Networks 6.3 Clos Networks 6.4 Benes Networks 6.5 Sorting Networks 6.6 Case Study: The Velio VC2002 (Zeus) Grooming Switch 6.7 Bibliographic Notes 6.8 Exercises Chapter 7 Slicing and Dicing 7.1 Concentrators and Distributors 7.2 Slicing and Dicing 7.3 Slicing Multistage Networks 7.4 Case Study: Bit Slicing in the Tiny Tera 7.5 Bibliographic Notes 7.6 Exercises Chapter 8 Routing Basics 8.1 A Routing Example 8.2 Taxonomy of Routing Algorithms 8.3 The Routing Relation 8.4 Deterministic Routing 8.5 Case Study: Dimension-Order Routing in the Cray T3D 8.6 Bibliographic Notes 8.7 Exercises Chapter 9 Oblivious Routing 9.1 Valiant's Randomized Routing Algorithm 9.2 Minimal Oblivious Routing 9.3 Load-Balanced Oblivious Routing 9.4 Analysis of Oblivious Routing 9.5 Case Study: Oblivious Routing in the Avici Terabit Switch Router(TSR) 9.6 Bibliographic Notes 9.7 Exercises Chapter 10 Adaptive Routing 10.1 Adaptive Routing Basics 10.2 Minimal Adaptive Routing 10.3 Fully Adaptive Routing 10.4 Load-Balanced Adaptive Routing 10.5 Search-Based Routing 10.6 Case Study: Adaptive Routing in the Thinking Machines CM-5 10.7 Bibliographic Notes 10.8 Exercises Chapter 11 Routing Mechanics 11.1 Table-Based Routing 11.2 Algorithmic Routing 11.3 Case Study: Oblivious Source Routing in the IBM Vulcan Network 11.4 Bibliographic Notes 11.5 Exercises Chapter 12 Flow Control Basics 12.1 Resources and Allocation Units 12.2 Bufferless Flow Control 12.3 Circuit Switching 12.4 Bibliographic Notes 12.5 Exercises Chapter 13 Buffered Flow Control 13.1 Packet-Buffer Flow Control 13.2 Flit-Buffer Flow Control 13.3 Buffer Management and Backpressure 13.4 Flit-Reservation Flow Control 13.5 Bibliographic Notes 13.6 Exercises Chapter 14 Deadlock and Livelock 14.1 Deadlock 14.2 Deadlock Avoidance 14.3 Adaptive Routing 14.4 Deadlock Recovery 14.5 Livelock 14.6 Case Study: Deadlock Avoidance in the Cray T3E 14.7 Bibliographic Notes 14.8 Exercises Chapter 15 Quality of Service 15.1 Service Classes and Service Contracts 15.2 Burstiness and Network Delays 15.3 Implementation of Guaranteed Services 15.4 Implementation of Best-Effort Services 15.5 Separation of Resources 15.6 Case Study: ATM Service Classes 15.7 Case Study: Virtual Networks in the Avici TSR 15.8 Bibliographic Notes 15.9 Exercises Chapter 16 Router Architecture 16.1 Basic Router Architecture 16.2 Stalls 16.3 Closing the Loop with Credits 16.4 Reallocating a Channel 16.5 Speculation and Lookahead 16.6 Flit and Credit Encoding 16.7 Case Study: The Alpha 21364 Router 16.8 Bibliographic Notes 16.9 Exercises Chapter 17 Router Datapath Components 17.1 Input Buffer Organization 17.2 Switches 17.3 Output Organization 17.4 Case Study: The Datapath of the IBM Colony Router 17.5 Bibliographic Notes 17.6 Exercises Chapter 18 Arbitration 18.1 Arbitration Timing 18.2 Fairness 18.3 Fixed Priority Arbiter 18.4 Variable Priority Iterative Arbiters 18.5 Matrix Arbiter 18.6 Queuing Arbiter 18.7 Exercises Chapter 19 Allocation 19.1 Representations 19.2 Exact Algorithms 19.3 Separable Allocators 19.4 Wavefront Allocator 19.5 Incremental vs. Batch Allocation 19.6 Multistage Allocation 19.7 Performance of Allocators 19.8 Case Study: The Tiny Tera Allocator 19.9 Bibliographic Notes 19.10 Exercises Chapter 20 Network Interfaces 20.1 Processor-Network Interface 20.2 Shared-Memory Interface 20.3 Line-Fabric Interface 20.4 Case Study: The MIT M-Machine Network Interface 20.5 Bibliographic Notes 20.6 Exercises Chapter 21 Error Control 411 21.1 Know Thy Enemy: Failure Modes and Fault Models 21.2 The Error Control Process: Detection, Containment, and Recovery 21.3 Link Level Error Control 21.4 Router Error Control 21.5 Network-Level Error Control 21.6 End-to-end Error Control 21.7 Bibliographic Notes 21.8 Exercises Chapter 22 Buses 22.1 Bus Basics 22.2 Bus Arbitration 22.3 High Performance Bus Protocol 22.4 From Buses to Networks 22.5 Case Study: The PCI Bus 22.6 Bibliographic Notes 22.7 Exercises Chapter 23 Performance Analysis 23.1 Measures of Interconnection Network Performance 23.2 Analysis 23.3 Validation 23.4 Case Study: Efficiency and Loss in the BBN Monarch Network 23.5 Bibliographic Notes 23.6 Exercises Chapter 24 Simulation 24.1 Levels of Detail 24.2 Network Workloads 24.3 Simulation Measurements 24.4 Simulator Design 24.5 Bibliographic Notes 24.6 Exercises Chapter 25 Simulation Examples 495 25.1 Routing 25.2 Flow Control Performance 25.3 Fault Tolerance Appendix A Nomenclature Appendix B Glossary Appendix C Network Simulator

3,233 citations

Proceedings ArticleDOI
19 Jun 2010
TL;DR: The sources of these performance and energy overheads in general-purpose processing systems are explored by quantifying the overheads of a 720p HD H.264 encoder running on a general- Purpose CMP system and exploring methods to eliminate these overheads by transforming the CPU into a specialized system for H. 264 encoding.
Abstract: Due to their high volume, general-purpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in general-purpose processing systems by quantifying the overheads of a 720p HD H.264 encoder running on a general-purpose CMP system. It then explores methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. We evaluate the gains from customizations useful to broad classes of algorithms, such as SIMD units, as well as those specific to particular computation, such as customized storage and functional units. The ASIC is 500x more energy efficient than our original four-processor CMP. Broadly applicable optimizations improve performance by 10x and energy by 7x. However, the very low energy costs of actual core ops (100s fJ in 90nm) mean that over 90% of the energy used in these solutions is still "overhead". Achieving ASIC-like performance and efficiency requires algorithm-specific optimizations. For each sub-algorithm of H.264, we create a large, specialized functional unit that is capable of executing 100s of operations per instruction. This improves performance and energy by an additional 25x and the final customized CMP matches an ASIC solution's performance within 3x of its energy and within comparable area.

460 citations

Book
12 Jan 2007
TL;DR: This book presents an overview of the state of the art in the design and implementation of transactional memory systems, as of early summer 2006.
Abstract: The advent of multicore processors has renewed interest in the idea of incorporating transactions into the programming model used to write parallel programs. This approach, known as transactional memory, offers an alternative, and hopefully better, way to coordinate concurrent threads. The ACI (atomicity, consistency, isolation) properties of transactions provide a foundation to ensure that concurrent reads and writes of shared data do not produce inconsistent or incorrect results. At a higher level, a computation wrapped in a transaction executes atomically – either it completes successfully and commits its result in its entirety or it aborts. In addition, isolation ensures the transaction produces the same result as if no other transactions were executing concurrently. Although transactions are not a parallel programming panacea, they shift much of the burden of synchronizing and coordinating parallel computations from a programmer to a compiler, runtime system, and hardware. The challenge for the system implementers is to build an efficient transactional memory infrastructure. This book presents an overview of the state of the art in the design and implementation of transactional memory systems, as of early summer 2006.

442 citations


"A case of system-level hardware/sof..." refers background in this paper

  • ...Transactional Memory (TM) [9] is an abstract programming model that aims to greatly simplify parallel programming....

    [...]

Proceedings ArticleDOI
25 Apr 2007
TL;DR: Why PTLsim's x86 focus is highly relevant, and the full system simulation results are used to demonstrate the pitfalls of userspace only simulation, are described.
Abstract: In this paper, we introduce PTLsim, a cycle accurate full system x86-64 microprocessor simulator and virtual machine. PTLsim models a modern superscalar out of order x86-64 processor core at a configurable level of detail ranging from RTL-level models of all key pipeline structures, caches and devices up to full-speed native execution on the host CPU. Unlike other microarchitectural simulators, PTLsim targets the real commercially available x86 ISA, rather than a discontinued architecture with limited tools and an uncertain future. PTLsim supports several flavors: a single threaded userspace version and a full system version providing an SMT model and the infrastructure for multi-core support. We first describe what it takes to perform cycle accurate modeling of a complete x86 machine at the muop (micro-operation) level, along with the challenges and requirements for effective full system multi-processor capable simulation. We then describe the internal architecture of full system PTLsim and how it interacts with the Xen hypervisor and PTLsim's native mode co-simulation technology. We experimentally evaluate PTLsim's real world accuracy by configuring it like an AMD Athlon 64 machine before running a demanding full system client-server networked benchmark inside PTLsim. We compare the statistics generated by our model with the actual numbers from the real processor to demonstrate PTLsim is accurate to within 5% across all major parameters. We provide a discussion of prior simulation tools, along with their strengths and weaknesses. We describe why PTLsim's x86 focus is highly relevant, and we use our full system simulation results to demonstrate the pitfalls of userspace only simulation. Finally, we conclude by detailing future work

389 citations

Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "A case of system-level hardware/software co-design and co-verification of a commodity multi-processor system with custom hardware" ?

This paper presents an interesting system-level co-design and co-verification case study for a non-trivial design where multiple high-performing x86 processors and custom hardware were connected through a coherent interconnection fabric. The authors demonstrate how such a co-simulation environment can be constructed from existing tools and software. However, the authors found that significant extensions need to be made to the conventional BFM methodology in order to capture various data-race cases in simulation, which eventually happen in modern multiprocessor systems.