
2009 Winter Simulation Conference – Modeling Methodology Track, Austin, Texas, December 13-16, 2009, pp.899-907.
Copyright
c
2009 IEEE (DOI 10.1109/WSC.2009.5429718).
PERFORMANCE LIMITATIONS OF BLOCK–MULTITHREADED
DISTRIBUTED–MEMORY SYSTEMS
W.M. Zuberek
Department of Computer Science and Department of Applied Informatics
Memorial Un iversity University of Life Sciences
St.John’s, Canada A1B 3X5 02-787 Warsaw, Poland
ABSTRACT
The performance of modern c omput er systems is incr easi ngly often limited by long latencies of accesses to the
memory subsystems. Instruction–level multithreading is an architectural approach to tolerating such long latencies
by switching ins tr u c tion threads rather than waiting for the comple ti on of memory operations. The paper studies
performance limitations in distributed–memory block multithreaded systems and determines conditions for such
systems to be balanced. Event–driven simulation of a timed Petri net model of a simple distributed–memory
system c onfirms the derived performance results.
1 INTRODUCTION
In modern computer systems, the performance of memory is increasingly often becoming the factor limiting
the performance of the system. Due to continuous progress in manufacturing technologies, until recent years
the performance of processors has been doubling every 18 months (the so–called Moore’s law (Hamilton 1999)).
However, the performance of memory chips has been improving by only 10% per year (Rixner et al. 2000), creating
a “performance gap” in matching proces s or’ s performance with the required memory bandwidth. Detailed studies
have shown that the number of processor cycles required to access main memory has been doubling approximately
every six years (Sinharoy 1997). In effect, it is bec oming more and more often the case that the performance of
applications depends on the performance of machine’s memory hierarchy and i t is not unusual that as much as
60% of pro ce ss or ’s time is spent on waiting for the completion of memory operations (Sinharoy 1997).
Memory hierarchies, and in particular multi–level cache memories, have been introduced to reduce t he effective
latency of memory accesses. Cache memorie s provide efficient access to information when the inf ormation is
available at lower levels of memory hierarchy; occasionally, however, long–latency memor y operations are needed
to transfer the information from the higher levels of memory hierarchy to the lower ones. Extensive research has
focused on reducing and tolerating these large memory acces s latencies. Techniques for reducing th e frequency
and impact of cache misses include hardware and software prefetching (Chen and Bauer 1994, Klaiber and Levy
1991), speculative loads and executi on (Rogers at al. 1992) and multithreading (Agarwal 1992; Byrd and Holliday
1995; Ungerer et al. 2003, Chaudhry et al. 2005, Emer et al. 2007).
Instruction–level multithreading, and in particular block–multithreading (Agarwal 1992; Boothe and Ranada
1992; Byrd and Holliday 1995), tolerates long–latency memory accesses and synchronization delays by switching
the threads rather than waiting for the completion of a long–latency operation which can require hundreds or even
thousands of processor cycles (Emer at al. 2007). It is believed that the return on multithreading is among the
highest in computer microarchi te ct ur al techniques (Chaudhry at al. 2005).
In distributed–memory systems, the latency of memory accesses is even more pronounced than in centralized-
memory systems because memory access requests need to be forwarded through se veral intermediate nodes before
they reach their destinations, and then the results need to be sent b ack to the origin al nodes. Each of the
“hops” introduces some delay, typically assigned to the switches that control the traffic between the nodes of the
interconnecting network (Govindarajan at al. 1995).
The mismatch of performance among different components of a computer system significantly impacts the
overall performance. If different compone nts of a system are utilized at significantly different levels, the compo-
nent which is utilized most intensively will first reach its limit (i.e., utilization close to 100%), and will restrict

Performance Limitations of Block–Multithreaded Distributed–Memory Systems 900
the utilization of all other e le ments as well as the performance of the whole system; such an element is called a
bottleneck. A system which contains a bottleneck is unbalanced. In b alance d systems, the utilizations of all com-
ponents are (approximately) equal, so the performance of the system is maximized because all sys t em components
reach their performance limits at the same time.
The purpose of this paper is to study performance limitations in distributed–memory block multithreaded
systems by comparing service demands for different components of the system. The derived results are confirmed
by event–driven simulation of a timed Petri net instruction–level model of the analyzed system.
A distributed memory system with 16 processors connected by a 2–dimensional torus–like network is used as
a runnin g example in this paper; an outline of such a system is shown in Figure 1.
Figure 1: Outline of a 16–processor system.
It is assumed that all messages are routed along the shortest paths. It is also assumed that this routing is done
in a nondeterminis tic way, i.e., if ther e are several shortest paths between two nodes, each of them is equally likely
to be used. The average length of the shortest path between two nodes, or the ave r age number of hops (from
one node to another) that a message must perform to reach its destination, is usually determined assuming that
the memory accesses are uniformly distributed over the nodes of th e system (non-uniform distribu tion of memory
accesses, re pr e s enting a sort of spacial locality of memory accesses, can be taken into account by adjusting the
average number of hops).
Although many specific details refer to this 16–processor system, most of them can easily be adjusted to other
systems by changing the values of a few model paramete r s .
Each node in the network shown in Figure 1 is a block multithreaded processor which contains a processor,
local memory, and two network interfaces, as shown in Figure 2. The outbound interface (or s wit ch) handles
outgoing traffic, i.e., requests to remote memories originating at this node as well as results of remote accesses
to the memory at this node; the inbound interface handles incoming traffic, i.e., results of remote reques ts that
“return” to this n ode and remote requests to access memory at this nod e.
Processor
Ready
Queue
Interconnecting
Network
Memory
Queue
Memory
Outbound
Interface
Inbound
Interface
Figure 2: Outline of a single multithreaded processor.
Figure 2 also s hows a queue of ready threads; whenever the processor performs a context switching (i.e., switches
from one thread to another), a thread from this queue is selected for execution and the execution cont inues unt il
another context switching is performed. In block multithreading, context switching is performed for all long–

Performance Limitations of Block–Multithreaded Distributed–Memory Systems 901
latency memor y accesses by ‘suspending’ the current thread, forwarding the memory access request to the relevant
memory module (local, or remote using the interconnecting network) and selecting another thread for execution.
When the result of this request is received, the status of the thread changes fr om ‘suspended’ to ‘ready’, and the
thread join s the que ue of ready threads, waiting for another execution phase on the processor.
The average number of instructions executed between context s witching is called the runlength of a thr ead, ℓ
t
,
which is one of main modeling parameters. It is di r ec tly related to the probability that an instruction re qu es ts a
long–latency memory operation.
Another important modeling parameter is the probability of long–laten cy acc es s es to local, p
ℓ
, (or remote,
p
r
= 1 − p
ℓ
) memory (in Figure 2 it corresponds to the “decision point” between the Processor and the Memory
Queue); as the value of p
ℓ
decreases (or p
r
increases), the effects of communication overh ead and congestion in the
interconnecting network (and its switches) become more pronounce d; for p
ℓ
close to 1, the nodes can be practically
considered in isolation.
The (average) number of available threads, n
t
, is yet another basic modelin g parameter. For very small values
of n
t
, queue ing effects can be practically neglected, so th e performance can b e predicted by taking into account
only the delays of system’s components. On the other hand, for large values of n
t
, the system can be considered
in saturation, which means that one of its components will be uti liz ed in almost 100 %, limiting the utilization
of oth er components as well as the whole system. Identification of su ch limiting components and improving their
performance is the key to the improved performance of the entire system (Jain 1991).
2 TIMED PETRI NET MODEL
Petri nets have become a popular formalism for modeling systems that exhibit parallel and concurrent acti vi tie s
(Reisig 1985, Murata 1989). In timed nets (Zuberek 1991, Wang 1998), deterministic or stochastic (exponentially
distributed) firing times are associated with transitions, and transition firings are time d events, i.e., tokens are
removed from input places at the beginning of the firing period, and they are deposited to the output p laces at
the end of this period.
A timed Petri net model of a multithreaded processor at the level of instructions execution is shown in Figure
3. As usual, timed transitions are represented by “thick” bars, and immediate ones by “thin” bars.
The execution of each instruction of the ‘running’ thread is modeled by transition T run, a timed transition
with the firing time representing one processor cycle. Place P roc represents the (available) processor (if marked)
and place Ready – the queue of threads waiting for execution. The initial marking of Ready represe nts the (average)
number of available threads, n
t
.
If the processor is available (i.e., P roc is marked) and Ready is not empty, a thr ead is selected for execution
by firing the immediate trans it ion T sel. Executi on of cons ec ut ive instructions of the selected thread is performed
in the loop Pnxt, T run, P end and T nxt. P end is a free–choice place with the choice probabilities determined by
the runlength, ℓ
t
, of the thread. In general, the free–choice probability assigned to T nxt is equal to (ℓ
t
− 1)/ℓ
t
,
so if ℓ
t
is equal to 10, the probability of T nxt is 0.9; if ℓ
t
is equal to 5, this probability is 0.8, and so on. The
free–choice probability of T end is just 1/ℓ
t
.
If T end is chosen for firing rather than T nxt, the execution of the thread ends, a request for a long–latency
access to (local or remote) memor y is placed in Mem, and a token is also deposited in Pcsw. The timed transition
T csw represents the context switching and is associated with the time required for the switching to a new thread,
t
cs
. When its firing is finished, another thread is selected for execution (if it is available).
Mem is a free–choice place, with a random choice of either accessing local memory (T loc) or remote memory
(T rem); in the first case, the request is directed to Lmem where it waits for availability of Memory, and after
accessing the memory (T lmem), the thread returns to the queue of waiting threads, Ready. Memory is a shared
place with two conflicting transitions, T rmem (for remote accesses) and T lmem (for local accesses); the resolution
of this conflict (if both requests are waiting) is based on marking–dependent (relative) frequencies determined by
the numbers of tokens in Lmem and Rmem, respectively.
The free–choice probability of T rem, p
r
, is the probability of long–latency accesses to remote memory; the
free–choice probability of T loc is p
ℓ
= 1 − p
r
.
Requests for remote accesses are directed to Rem, and then, after a sequential delay (the outbound switch
modeled by Sout and T sout), forwarded to Out, where a random selection is made of one of the four (in this case)
adjacent nodes (all nodes are selected with equal probabilities). Similarly, the incoming traffic is collected from all
neighboring nodes in Inp, and, after a sequential delay (the inbound switch Sinp and Tsinp), forwarded to Dec.
Dec is a free–choice place with three transitions sharing it: T ret, which represents the satisfied requests reaching

Performance Limitations of Block–Multithreaded Distributed–Memory Systems 902
Inp
Ready
Trun
Lmem
Trmem
Tlmem
Rmem
Tloc
Trem
Proc
Memory
Sout
Tsout
Dec
Sinp
Tsinp
Tgo
Mem
to Inp
to Inp
to Inp
to Inp
Tret
Tsel
Tend
Tnxt
Pnxt
Pend
Out
Rem
Tcsw
Pcsw
from Out
from Out
from Out
from Out
Tmem
Figure 3: Instruction–level Petri net model of a block multithreaded processor.
their “home” nodes; T go, which represents requests as well as responses forwarded to another node (another
‘hop’ in the interconnecting network); and Tmem, which represents remote requests accessing the memory at the
destination node; these remote requests are queued in Rmem and served by T rmem when the memory module
Memory becomes available. The free–choice probabilities associated with T ret, T go and T mem characterize the
interconnecting network (Govindarajan 1997). For a 16–processor system (as in Figure 1), and for memory accesses
uniformly distributed among the nodes of the system, the free–choice probabilities of Tmem and T go are 0.5 for
forward moving requests, and 0.5 for T ret and T go for returning requests.
The traffic outgoing from a node (place Out) is composed of requests and responses forwarded to another
node (transition T go), responses to requests from other nodes (transition T rmem) and remote memory requests
originating in this node (transition T rem).
It can be observed that the remote memor y acc es s r eq ue s ts do not guarantee that the results of memory
accesses return to the requesting (’home’) nodes. Although a more detailed model representing the routing of
messages can be developed using colored Petri nets (Zuberek at al. 1998), such a more detailed model provides
results which are insignificantly different from a simpler model, discussed in this section. Consequently, only this
simpler model is used in performance analysis that follows.
3 PERFORMANCE ANALYSIS
The parameter s which characterize the model of the block multithreaded distributed–memory system include:

Performance Limitations of Block–Multithreaded Distributed–Memory Systems 903
n
p
– the number of processors,
n
t
– the number of threads,
ℓ
t
– the thread runlen gth ,
t
p
– the pro ce s sor cycle time,
t
m
– the memory cycle time,
t
s
– the switch delay,
n
h
– the average number of hops,
p
ℓ
– the probability to access local memory,
p
r
– the probability to access remote memory, p
r
= 1 − p
ℓ
.
For performance analysis, it is convenient to represent all timing information in relative rather than absolute
units, and the process or cycle, t
p
, has been assumed as the unit of time. Consequently, all temporal data are
expressed in processor cycl es ; e.g., t
m
= 10 means that the memory cycle time (t
m
) is equal to 10 processor cycles,
t
s
= 5 means that the switch delay (t
s
) is equal to 5 processor cycles.
For a single cycle of state changes of a thread (i.e., a th r ead going through the phases of execution, suspension,
and then waiting for another execution), the service demand at the pro c es s or is simply the thread runlength, ℓ
t
.
The service demand at the memory subsystem has two components, one due to local memory requests and the
other due to requests coming from remote processors. The component due to local requests is the product of the
visit rate (which is the probability of lo cal accesses), p
ℓ
, and memory cycle, t
m
. Likewise, the component due to
remote accesses is p
r
∗ t
m
; this expression is obtained by taking into account that for each node the requests are
coming from (N − 1) remote processors, and that remote memory reques ts are unifor mly di s tr ib u ted over (N − 1)
processors, so the se r vi ce demand due to remote requests is p
r
∗ t
m
.
The service demand due to a single thread (in each processor) at the inbound switch is obtained as follows.
The visit rate to an inbound switch due to a single processor is the product of probability of remote accesses, p
r
,
average numbe r of hops (in both directions), 2 ∗ n
h
, and the switch delay, t
s
. Remote memory requests from all N
processors are distribute d across the N inbound switches in the multipro c es s or system. Hence, the service demand
due to a single thread in an inbound switch is 2 ∗ p
r
∗ n
h
∗ t
s
. For the outbound switch, the se r vi ce demand is
d
so
= 2 ∗ t
s
∗ p
r
; the number of hops, n
h
, does not affect d
so
since each remote req ue st and its response go through
the outbound switch once at the sour ce and once at the destination proce ss or (this also explains the factor 2 in
the above formula).
The ser v ic e demands are thus:
d
p
= ℓ
t
;
d
m
= p
ℓ
∗ t
m
+ p
r
∗ t
m
= t
m
;
d
si
= 2 ∗ p
r
∗ n
h
∗ t
s
;
d
so
= 2 ∗ p
r
∗ t
s
.
A balanced system is usually defined as a system in which the service demands for all components are equal
(Jain 1991). So, in a balanced system, ℓ
t
= t
m
= 2 ∗ p
r
∗ n
h
∗ t
s
(since d
so
is always smaller than d
si
for n
h
≥ 1,
the output switch cannot become the system’s bottl en eck and is therefore disregarded in balancing the system; in
the dis cu ss e d system this switch is simply “underutilized”).
Figure 4 shows the utilization of the processors, in a 16–processor system, as a fu nc ti on of the number of
available threads, n
t
, and the probability of long–latency accesses to local memory, p
ℓ
, for fixed values of other
parameters.
Since for a 16–processor system n
h
≈ 2 (Zub er e k 2000), so for t
s
= 10 the balance i s obtained for p
r
= 0.25
(or p
ℓ
= 0.75), which is very clearly demonstrated in Figure 4 as the “edge” of the high utilization region. Figure
5 shows the utilization of the processor and the switch as functions of p
r
, the probability of accessing remote
memory (t he processor utilization plot corresponds to the crossection of Figure 4 at n
t
= 10). It can be obser ved
that the only region of high u til iz ation of the processor is when the switch is utilized less that 100% (and then it
is not the bottleneck). The balance corresponds to the intersection point of the two plots.
If the information is uniformly distributed among th e nodes of the distributed–memory system, the value of
p
r
= (n
p
− 1)/n
p
, so the utilization of processors is rather low in this case (close to 0.3 in Figure 4). This indicates
that the switches are simply too slow for this system.
There are two basic ways to reduce the limiting effects of the switches; one is to use switches with smaller
switch delay (for example, t
s
= 5), and the other is to us e parallel switches and to distribute the workload among