scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Performance limitations of block-multithreaded distributed-memory systems

01 Jan 2009-pp 899-907
About: This article is published in Winter Simulation Conference.The article was published on 2009-01-01 and is currently open access. It has received 3 citations till now. The article focuses on the topics: Block (telecommunications) & Memory management.

Summary (2 min read)

1 INTRODUCTION

  • In modern computer systems, the performance of memory is increasingly often becoming the factor limiting the performance of the system.
  • In effect, it is becoming more and more often the case that the performance of applications depends on the performance of machine’s memory hierarchy and it is not unusual that as much as 60% of processor’s time is spent on waiting for the completion of memory operations (Sinharoy 1997).
  • If different components of a system are utilized at significantly different levels, the component which is utilized most intensively will first reach its limit (i.e., utilization close to 100%), and will restrict the utilization of all other elements as well as the performance of the whole system; such an element is called a bottleneck.
  • In block multithreading, context switching is performed for all long– latency memory accesses by ‘suspending’ the current thread, forwarding the memory access request to the relevant memory module (local, or remote using the interconnecting network) and selecting another thread for execution.

2 TIMED PETRI NET MODEL

  • Petri nets have become a popular formalism for modeling systems that exhibit parallel and concurrent activities (Reisig 1985, Murata 1989).
  • If the processor is available (i.e., Proc is marked) and Ready is not empty, a thread is selected for execution by firing the immediate transition Tsel.
  • The free–choice probability of Tend is just 1/ℓt.
  • The timed transition Tcsw represents the context switching and is associated with the time required for the switching to a new thread, tcs.

3 PERFORMANCE ANALYSIS

  • For performance analysis, it is convenient to represent all timing information in relative rather than absolute units, and the processor cycle, tp, has been assumed as the unit of time.
  • Likewise, the component due to remote accesses is pr ∗ tm; this expression is obtained by taking into account that for each node the requests are coming from (N − 1) remote processors, and that remote memory requests are uniformly distributed over (N − 1) processors, so the service demand due to remote requests is pr ∗ tm.
  • The service demand due to a single thread (in each processor) at the inbound switch is obtained as follows.
  • There are two basic ways to reduce the limiting effects of the switches; one is to use switches with smaller switch delay (for example, ts = 5), and the other is to use parallel switches and to distribute the workload among them.
  • The balance is now obtained for pr = 0.5, which is still quite distant from the values corresponding to the uniform distribution of accesses among the nodes of the system.

4 CONCLUDING REMARKS

  • The paper presents a timed Petri net model of block multiprocessor system at the instruction execution level, and analyzes the effects of system bottlenecks on the performance of system components.
  • Balancing the system by improving performance characteristics of its components may sometimes be difficult because the components with improved characteristics may not be available.
  • Since the utilization of processors is probably the simplest indicator of the performance of the whole system, there may be a tendency to keep this utilization high.
  • The results obtained for a 2–dimensional torus–like network are also valid for other interconnecting networks with the same connectivity characteristics.
  • The model needs only a few small changes to represent other multiprocessor systems.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

2009 Winter Simulation Conference Modeling Methodology Track, Austin, Texas, December 13-16, 2009, pp.899-907.
Copyright
c
2009 IEEE (DOI 10.1109/WSC.2009.5429718).
PERFORMANCE LIMITATIONS OF BLOCK–MULTITHREADED
DISTRIBUTED–MEMORY SYSTEMS
W.M. Zuberek
Department of Computer Science and Department of Applied Informatics
Memorial Un iversity University of Life Sciences
St.John’s, Canada A1B 3X5 02-787 Warsaw, Poland
ABSTRACT
The performance of modern c omput er systems is incr easi ngly often limited by long latencies of accesses to the
memory subsystems. Instruction–level multithreading is an architectural approach to tolerating such long latencies
by switching ins tr u c tion threads rather than waiting for the comple ti on of memory operations. The paper studies
performance limitations in distributed–memory block multithreaded systems and determines conditions for such
systems to be balanced. Event–driven simulation of a timed Petri net model of a simple distributed–memory
system c onfirms the derived performance results.
1 INTRODUCTION
In modern computer systems, the performance of memory is increasingly often becoming the factor limiting
the performance of the system. Due to continuous progress in manufacturing technologies, until recent years
the performance of processors has been doubling every 18 months (the so–called Moore’s law (Hamilton 1999)).
However, the performance of memory chips has been improving by only 10% per year (Rixner et al. 2000), creating
a “performance gap” in matching proces s or’ s performance with the required memory bandwidth. Detailed studies
have shown that the number of processor cycles required to access main memory has been doubling approximately
every six years (Sinharoy 1997). In effect, it is bec oming more and more often the case that the performance of
applications depends on the performance of machine’s memory hierarchy and i t is not unusual that as much as
60% of pro ce ss or ’s time is spent on waiting for the completion of memory operations (Sinharoy 1997).
Memory hierarchies, and in particular multi–level cache memories, have been introduced to reduce t he effective
latency of memory accesses. Cache memorie s provide efficient access to information when the inf ormation is
available at lower levels of memory hierarchy; occasionally, however, long–latency memor y operations are needed
to transfer the information from the higher levels of memory hierarchy to the lower ones. Extensive research has
focused on reducing and tolerating these large memory acces s latencies. Techniques for reducing th e frequency
and impact of cache misses include hardware and software prefetching (Chen and Bauer 1994, Klaiber and Levy
1991), speculative loads and executi on (Rogers at al. 1992) and multithreading (Agarwal 1992; Byrd and Holliday
1995; Ungerer et al. 2003, Chaudhry et al. 2005, Emer et al. 2007).
Instruction–level multithreading, and in particular block–multithreading (Agarwal 1992; Boothe and Ranada
1992; Byrd and Holliday 1995), tolerates long–latency memory accesses and synchronization delays by switching
the threads rather than waiting for the completion of a long–latency operation which can require hundreds or even
thousands of processor cycles (Emer at al. 2007). It is believed that the return on multithreading is among the
highest in computer microarchi te ct ur al techniques (Chaudhry at al. 2005).
In distributed–memory systems, the latency of memory accesses is even more pronounced than in centralized-
memory systems because memory access requests need to be forwarded through se veral intermediate nodes before
they reach their destinations, and then the results need to be sent b ack to the origin al nodes. Each of the
“hops” introduces some delay, typically assigned to the switches that control the traffic between the nodes of the
interconnecting network (Govindarajan at al. 1995).
The mismatch of performance among different components of a computer system significantly impacts the
overall performance. If different compone nts of a system are utilized at significantly different levels, the compo-
nent which is utilized most intensively will first reach its limit (i.e., utilization close to 100%), and will restrict

Performance Limitations of Block–Multithreaded Distributed–Memory Systems 900
the utilization of all other e le ments as well as the performance of the whole system; such an element is called a
bottleneck. A system which contains a bottleneck is unbalanced. In b alance d systems, the utilizations of all com-
ponents are (approximately) equal, so the performance of the system is maximized because all sys t em components
reach their performance limits at the same time.
The purpose of this paper is to study performance limitations in distributed–memory block multithreaded
systems by comparing service demands for different components of the system. The derived results are confirmed
by event–driven simulation of a timed Petri net instruction–level model of the analyzed system.
A distributed memory system with 16 processors connected by a 2–dimensional torus–like network is used as
a runnin g example in this paper; an outline of such a system is shown in Figure 1.
Figure 1: Outline of a 16–processor system.
It is assumed that all messages are routed along the shortest paths. It is also assumed that this routing is done
in a nondeterminis tic way, i.e., if ther e are several shortest paths between two nodes, each of them is equally likely
to be used. The average length of the shortest path between two nodes, or the ave r age number of hops (from
one node to another) that a message must perform to reach its destination, is usually determined assuming that
the memory accesses are uniformly distributed over the nodes of th e system (non-uniform distribu tion of memory
accesses, re pr e s enting a sort of spacial locality of memory accesses, can be taken into account by adjusting the
average number of hops).
Although many specific details refer to this 16–processor system, most of them can easily be adjusted to other
systems by changing the values of a few model paramete r s .
Each node in the network shown in Figure 1 is a block multithreaded processor which contains a processor,
local memory, and two network interfaces, as shown in Figure 2. The outbound interface (or s wit ch) handles
outgoing traffic, i.e., requests to remote memories originating at this node as well as results of remote accesses
to the memory at this node; the inbound interface handles incoming traffic, i.e., results of remote reques ts that
“return” to this n ode and remote requests to access memory at this nod e.
Processor
Ready
Queue
Interconnecting
Network
Memory
Queue
Memory
Outbound
Interface
Inbound
Interface
Figure 2: Outline of a single multithreaded processor.
Figure 2 also s hows a queue of ready threads; whenever the processor performs a context switching (i.e., switches
from one thread to another), a thread from this queue is selected for execution and the execution cont inues unt il
another context switching is performed. In block multithreading, context switching is performed for all long–

Performance Limitations of Block–Multithreaded Distributed–Memory Systems 901
latency memor y accesses by ‘suspending’ the current thread, forwarding the memory access request to the relevant
memory module (local, or remote using the interconnecting network) and selecting another thread for execution.
When the result of this request is received, the status of the thread changes fr om ‘suspended’ to ‘ready’, and the
thread join s the que ue of ready threads, waiting for another execution phase on the processor.
The average number of instructions executed between context s witching is called the runlength of a thr ead,
t
,
which is one of main modeling parameters. It is di r ec tly related to the probability that an instruction re qu es ts a
long–latency memory operation.
Another important modeling parameter is the probability of long–laten cy acc es s es to local, p
, (or remote,
p
r
= 1 p
) memory (in Figure 2 it corresponds to the “decision point” between the Processor and the Memory
Queue); as the value of p
decreases (or p
r
increases), the effects of communication overh ead and congestion in the
interconnecting network (and its switches) become more pronounce d; for p
close to 1, the nodes can be practically
considered in isolation.
The (average) number of available threads, n
t
, is yet another basic modelin g parameter. For very small values
of n
t
, queue ing effects can be practically neglected, so th e performance can b e predicted by taking into account
only the delays of system’s components. On the other hand, for large values of n
t
, the system can be considered
in saturation, which means that one of its components will be uti liz ed in almost 100 %, limiting the utilization
of oth er components as well as the whole system. Identification of su ch limiting components and improving their
performance is the key to the improved performance of the entire system (Jain 1991).
2 TIMED PETRI NET MODEL
Petri nets have become a popular formalism for modeling systems that exhibit parallel and concurrent acti vi tie s
(Reisig 1985, Murata 1989). In timed nets (Zuberek 1991, Wang 1998), deterministic or stochastic (exponentially
distributed) firing times are associated with transitions, and transition firings are time d events, i.e., tokens are
removed from input places at the beginning of the firing period, and they are deposited to the output p laces at
the end of this period.
A timed Petri net model of a multithreaded processor at the level of instructions execution is shown in Figure
3. As usual, timed transitions are represented by “thick” bars, and immediate ones by “thin” bars.
The execution of each instruction of the ‘running’ thread is modeled by transition T run, a timed transition
with the firing time representing one processor cycle. Place P roc represents the (available) processor (if marked)
and place Ready the queue of threads waiting for execution. The initial marking of Ready represe nts the (average)
number of available threads, n
t
.
If the processor is available (i.e., P roc is marked) and Ready is not empty, a thr ead is selected for execution
by firing the immediate trans it ion T sel. Executi on of cons ec ut ive instructions of the selected thread is performed
in the loop Pnxt, T run, P end and T nxt. P end is a free–choice place with the choice probabilities determined by
the runlength,
t
, of the thread. In general, the free–choice probability assigned to T nxt is equal to (
t
1)/ℓ
t
,
so if
t
is equal to 10, the probability of T nxt is 0.9; if
t
is equal to 5, this probability is 0.8, and so on. The
free–choice probability of T end is just 1/ℓ
t
.
If T end is chosen for firing rather than T nxt, the execution of the thread ends, a request for a long–latency
access to (local or remote) memor y is placed in Mem, and a token is also deposited in Pcsw. The timed transition
T csw represents the context switching and is associated with the time required for the switching to a new thread,
t
cs
. When its firing is finished, another thread is selected for execution (if it is available).
Mem is a free–choice place, with a random choice of either accessing local memory (T loc) or remote memory
(T rem); in the first case, the request is directed to Lmem where it waits for availability of Memory, and after
accessing the memory (T lmem), the thread returns to the queue of waiting threads, Ready. Memory is a shared
place with two conflicting transitions, T rmem (for remote accesses) and T lmem (for local accesses); the resolution
of this conflict (if both requests are waiting) is based on marking–dependent (relative) frequencies determined by
the numbers of tokens in Lmem and Rmem, respectively.
The free–choice probability of T rem, p
r
, is the probability of long–latency accesses to remote memory; the
free–choice probability of T loc is p
= 1 p
r
.
Requests for remote accesses are directed to Rem, and then, after a sequential delay (the outbound switch
modeled by Sout and T sout), forwarded to Out, where a random selection is made of one of the four (in this case)
adjacent nodes (all nodes are selected with equal probabilities). Similarly, the incoming traffic is collected from all
neighboring nodes in Inp, and, after a sequential delay (the inbound switch Sinp and Tsinp), forwarded to Dec.
Dec is a free–choice place with three transitions sharing it: T ret, which represents the satisfied requests reaching

Performance Limitations of Block–Multithreaded Distributed–Memory Systems 902
Inp
Ready
Trun
Lmem
Trmem
Tlmem
Rmem
Tloc
Trem
Proc
Memory
Sout
Tsout
Dec
Sinp
Tsinp
Tgo
Mem
to Inp
to Inp
to Inp
to Inp
Tret
Tsel
Tend
Tnxt
Pnxt
Pend
Out
Rem
Tcsw
Pcsw
from Out
from Out
from Out
from Out
Tmem
Figure 3: Instruction–level Petri net model of a block multithreaded processor.
their “home” nodes; T go, which represents requests as well as responses forwarded to another node (another
‘hop’ in the interconnecting network); and Tmem, which represents remote requests accessing the memory at the
destination node; these remote requests are queued in Rmem and served by T rmem when the memory module
Memory becomes available. The free–choice probabilities associated with T ret, T go and T mem characterize the
interconnecting network (Govindarajan 1997). For a 16–processor system (as in Figure 1), and for memory accesses
uniformly distributed among the nodes of the system, the free–choice probabilities of Tmem and T go are 0.5 for
forward moving requests, and 0.5 for T ret and T go for returning requests.
The traffic outgoing from a node (place Out) is composed of requests and responses forwarded to another
node (transition T go), responses to requests from other nodes (transition T rmem) and remote memory requests
originating in this node (transition T rem).
It can be observed that the remote memor y acc es s r eq ue s ts do not guarantee that the results of memory
accesses return to the requesting (’home’) nodes. Although a more detailed model representing the routing of
messages can be developed using colored Petri nets (Zuberek at al. 1998), such a more detailed model provides
results which are insignificantly different from a simpler model, discussed in this section. Consequently, only this
simpler model is used in performance analysis that follows.
3 PERFORMANCE ANALYSIS
The parameter s which characterize the model of the block multithreaded distributed–memory system include:

Performance Limitations of Block–Multithreaded Distributed–Memory Systems 903
n
p
the number of processors,
n
t
the number of threads,
t
the thread runlen gth ,
t
p
the pro ce s sor cycle time,
t
m
the memory cycle time,
t
s
the switch delay,
n
h
the average number of hops,
p
the probability to access local memory,
p
r
the probability to access remote memory, p
r
= 1 p
.
For performance analysis, it is convenient to represent all timing information in relative rather than absolute
units, and the process or cycle, t
p
, has been assumed as the unit of time. Consequently, all temporal data are
expressed in processor cycl es ; e.g., t
m
= 10 means that the memory cycle time (t
m
) is equal to 10 processor cycles,
t
s
= 5 means that the switch delay (t
s
) is equal to 5 processor cycles.
For a single cycle of state changes of a thread (i.e., a th r ead going through the phases of execution, suspension,
and then waiting for another execution), the service demand at the pro c es s or is simply the thread runlength,
t
.
The service demand at the memory subsystem has two components, one due to local memory requests and the
other due to requests coming from remote processors. The component due to local requests is the product of the
visit rate (which is the probability of lo cal accesses), p
, and memory cycle, t
m
. Likewise, the component due to
remote accesses is p
r
t
m
; this expression is obtained by taking into account that for each node the requests are
coming from (N 1) remote processors, and that remote memory reques ts are unifor mly di s tr ib u ted over (N 1)
processors, so the se r vi ce demand due to remote requests is p
r
t
m
.
The service demand due to a single thread (in each processor) at the inbound switch is obtained as follows.
The visit rate to an inbound switch due to a single processor is the product of probability of remote accesses, p
r
,
average numbe r of hops (in both directions), 2 n
h
, and the switch delay, t
s
. Remote memory requests from all N
processors are distribute d across the N inbound switches in the multipro c es s or system. Hence, the service demand
due to a single thread in an inbound switch is 2 p
r
n
h
t
s
. For the outbound switch, the se r vi ce demand is
d
so
= 2 t
s
p
r
; the number of hops, n
h
, does not affect d
so
since each remote req ue st and its response go through
the outbound switch once at the sour ce and once at the destination proce ss or (this also explains the factor 2 in
the above formula).
The ser v ic e demands are thus:
d
p
=
t
;
d
m
= p
t
m
+ p
r
t
m
= t
m
;
d
si
= 2 p
r
n
h
t
s
;
d
so
= 2 p
r
t
s
.
A balanced system is usually defined as a system in which the service demands for all components are equal
(Jain 1991). So, in a balanced system,
t
= t
m
= 2 p
r
n
h
t
s
(since d
so
is always smaller than d
si
for n
h
1,
the output switch cannot become the system’s bottl en eck and is therefore disregarded in balancing the system; in
the dis cu ss e d system this switch is simply “underutilized”).
Figure 4 shows the utilization of the processors, in a 16–processor system, as a fu nc ti on of the number of
available threads, n
t
, and the probability of long–latency accesses to local memory, p
, for fixed values of other
parameters.
Since for a 16–processor system n
h
2 (Zub er e k 2000), so for t
s
= 10 the balance i s obtained for p
r
= 0.25
(or p
= 0.75), which is very clearly demonstrated in Figure 4 as the “edge” of the high utilization region. Figure
5 shows the utilization of the processor and the switch as functions of p
r
, the probability of accessing remote
memory (t he processor utilization plot corresponds to the crossection of Figure 4 at n
t
= 10). It can be obser ved
that the only region of high u til iz ation of the processor is when the switch is utilized less that 100% (and then it
is not the bottleneck). The balance corresponds to the intersection point of the two plots.
If the information is uniformly distributed among th e nodes of the distributed–memory system, the value of
p
r
= (n
p
1)/n
p
, so the utilization of processors is rather low in this case (close to 0.3 in Figure 4). This indicates
that the switches are simply too slow for this system.
There are two basic ways to reduce the limiting effects of the switches; one is to use switches with smaller
switch delay (for example, t
s
= 5), and the other is to us e parallel switches and to distribute the workload among

Citations
More filters
Proceedings ArticleDOI
09 Mar 2015
TL;DR: This paper proposes a multi-agent distributed hybrid reactive re-enforcement learning technique based on selected agent intermediary sub-goals using a learning reward scheme in a distributed-computing memory setting for faster and efficient agent learning.
Abstract: Collaborative monitoring of large infrastructures, such as military, transportation and maritime systems are decisive issues in many surveillance, protection, and security applications. In many of these applications, dynamic multi-agent systems using reinforcement learning for agents’ autonomous path planning, where agents could be moving randomly to reach their respective goals and avoiding topographical obstacles intelligently, becomes a challenging problem. This is specially so in a dynamic agent environment. In our prior work we presented an intelligent multi-agent hybrid reactive and reinforcement learning technique for collaborative autonomous agent path planning for monitoring Critical Key Infrastructures and Resources (CKIR) in a geographically and a computationally distributed systems. Here agent monitoring of large environments is reduced to monitoring of relatively smaller track-able geographically distributed agent environment regions. In this paper we tackle this problem in the challenging case of complex and cluttered environments, where agents’ initial random-walk paths become challenging and relatively nonconverging. Here we propose a multi-agent distributed hybrid reactive re-enforcement learning technique based on selected agent intermediary sub-goals using a learning reward scheme in a distributed-computing memory setting. Various case study scenarios are presented for convergence study to the shortest minimum-amount-of-time exploratory steps for faster and efficient agent learning. In this work the distributed dynamic agent communication is done via a Message Passing Interface (MPI).

6 citations

Proceedings ArticleDOI
27 Jun 2016
TL;DR: This paper presents a hybrid master-slave and peer-to-peer system architecture, where each distributed agent knows only of a given master node, is only concerned with its assigned work load, has a limited knowledge of the environment and can, collaboratively with other agents, share learned information of the environments over a communication network.
Abstract: In many large infrastructures, such as military battlefields, transportation and maritime systems spanning hundreds of miles at a time, collaborative multi-agent based monitoring is important. Agent Reinforcement Learning (RL), in general, becomes more challenging in a dynamic complex cluttered environment for autonomous path planning, where agents could be moving randomly to reach their respective goals. In our previous work we presented a hybrid master-slave and peer-to-peer system architecture, where each distributed agent knows only of a given master node, is only concerned with its assigned work load, has a limited knowledge of the environment and can, collaboratively with other agents, share learned information of the environment over a communication network. In this paper we extend our previous work and focus on (a) the study of the performance of said system and the effect of the agents' random walks on the overall system agent learning speed, when each of the distributed agents, after the random walk phase, starts its exploratory trials independently of the other agents, asynchronously, and immediately after it finishes its first exploratory trial towards a sub-goal or after its random walk phase, without waiting for the slowest agent to finish its first random walk or its first exploratory phase toward a sub-goal. (b) the effect on the agent learning speed, of using an environment-clutter-index to select agent sub-goals with the aim of reducing the agent initial random walk steps and (c) the effect of agent sharing/or not sharing environment information on the agent learning speed in such scenarios.

3 citations

Journal ArticleDOI
TL;DR: The paper studies performance limitations in distributed-memory block multithreaded systems and determines conditions for such systems to be balanced by event-driven simulation of a timed Petri net model of a simple distributed- memory system.
Abstract: The performance of modern computer systems is increasingly often limited by long latencies of accesses to the memory subsystems. Instruction-level multithreading is an architectural approach to tolerating such long latencies by switching instruction threads rather than waiting for the completion of memory operations. The paper studies performance limitations in distributed-memory block multithreaded systems and determines conditions for such systems to be balanced. Event-driven simulation of a timed Petri net model of a simple distributed-memory system confirms the derived performance results.

1 citations

References
More filters
Journal ArticleDOI
01 Apr 1989
TL;DR: The author proceeds with introductory modeling examples, behavioral and structural properties, three methods of analysis, subclasses of Petri nets and their analysis, and one section is devoted to marked graphs, the concurrent system model most amenable to analysis.
Abstract: Starts with a brief review of the history and the application areas considered in the literature. The author then proceeds with introductory modeling examples, behavioral and structural properties, three methods of analysis, subclasses of Petri nets and their analysis. In particular, one section is devoted to marked graphs, the concurrent system model most amenable to analysis. Introductory discussions on stochastic nets with their application to performance modeling, and on high-level nets with their application to logic programming, are provided. Also included are recent results on reachability criteria. Suggestions are provided for further reading on many subject areas of Petri nets. >

10,755 citations

Journal ArticleDOI
TL;DR: This survey paper explains and classifies the explicit multithreaded techniques in research and in commercial microprocessors.
Abstract: Hardware multithreading is becoming a generally applied technique in the next generation of microprocessors. Several multithreaded processors are announced by industry or already into production in the areas of high-performance microprocessors, media, and network processors.A multithreaded processor is able to pursue two or more threads of control in parallel within the processor pipeline. The contexts of two or more threads of control are often stored in separate on-chip register sets. Unused instruction slots, which arise from latencies during the pipelined execution of single-threaded programs by a contemporary microprocessor, are filled by instructions of other threads within a multithreaded processor. The execution units are multiplexed between the thread contexts that are loaded in the register sets.Underutilization of a superscalar processor due to missing instruction-level parallelism can be overcome by simultaneous multithreading, where a processor can issue multiple instructions from multiple threads each cycle. Simultaneous multithreaded processors combine the multithreading technique with a wide-issue superscalar processor to utilize a larger part of the issue bandwidth by issuing instructions from different threads simultaneously.Explicit multithreaded processors are multithreaded processors that apply processes or operating system threads in their hardware thread slots. These processors optimize the throughput of multiprogramming workloads rather than single-thread performance. We distinguish these processors from implicit multithreaded processors that utilize thread-level speculation by speculatively executing compiler- or machine-generated threads of control that are part of a single sequential program.This survey paper explains and classifies the explicit multithreading techniques in research and in commercial microprocessors.

319 citations

Journal ArticleDOI
TL;DR: Any “state” description of timed nets must take into account the distribution of tokens in places as well as in (firing) transitions, and the state space of timednets can be quite different from the space of reachable markings.
Abstract: In timed Petri nets, the transitions fire in “real-time”, i.e., there is a (deterministic or random) firing time associated with each transition, the tokens are removed from input places at the beginning of firing, and are deposited into output places when the firing terminates (they may be considered as remaining “in” the transitions for the firing time). Any “state” description of timed nets must thus take into account the distribution of tokens in places as well as in (firing) transitions, and the state space of timed nets can be quite different from the space of reachable markings. Performance analysis of timed nets is based on stationary probabilities of states. For bounded nets stationary probabilities are determined from a finite set of simultaneous linear equilibrium equations. For unbounded nets the state space is infinite, the set of linear equilibrium equations is also infinite and it must be reduced to a finite set of nonlinear equations for effective solution. Simple examples illustrate capabilities of timed Petri net models.

236 citations

Journal ArticleDOI
TL;DR: An analytical performance model for multithreaded processors that includes cache interference, network contention, context-switching overhead, and data-sharing effects is presented and indicates that processors can substantially benefit from multithreading, even in systems with small caches, provided sufficient network bandwidth exists.
Abstract: An analytical performance model for multithreaded processors that includes cache interference, network contention, context-switching overhead, and data-sharing effects is presented. The model is validated through the author's simulations and by comparison with previously published simulation results. The results indicate that processors can substantially benefit from multithreading, even in systems with small caches, provided sufficient network bandwidth exists. Caches that are much larger than the working-set sizes of individual processes yield close to full processor utilization with as few as two to four contexts. Smaller caches require more contexts to keep the processor busy, while caches that are comparable in size to the working-sets of individual processes cannot achieve a high utilization regardless of the number of contexts. Increased network contention due to multithreading has a major effect on performance. The available network bandwidth and the context-switching overhead limits the best possible utilization. >

188 citations

BookDOI
01 Jan 1998

125 citations

Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Performance limitations of block–multithreaded distributed–memory systems" ?

The paper studies performance limitations in distributed–memory block multithreaded systems and determines conditions for such systems to be balanced.