Performance limitations of block-multithreaded distributed-memory systems
Summary (2 min read)
- In modern computer systems, the performance of memory is increasingly often becoming the factor limiting the performance of the system.
- In effect, it is becoming more and more often the case that the performance of applications depends on the performance of machine’s memory hierarchy and it is not unusual that as much as 60% of processor’s time is spent on waiting for the completion of memory operations (Sinharoy 1997).
- If different components of a system are utilized at significantly different levels, the component which is utilized most intensively will first reach its limit (i.e., utilization close to 100%), and will restrict the utilization of all other elements as well as the performance of the whole system; such an element is called a bottleneck.
- In block multithreading, context switching is performed for all long– latency memory accesses by ‘suspending’ the current thread, forwarding the memory access request to the relevant memory module (local, or remote using the interconnecting network) and selecting another thread for execution.
2 TIMED PETRI NET MODEL
- Petri nets have become a popular formalism for modeling systems that exhibit parallel and concurrent activities (Reisig 1985, Murata 1989).
- If the processor is available (i.e., Proc is marked) and Ready is not empty, a thread is selected for execution by firing the immediate transition Tsel.
- The free–choice probability of Tend is just 1/ℓt.
- The timed transition Tcsw represents the context switching and is associated with the time required for the switching to a new thread, tcs.
3 PERFORMANCE ANALYSIS
- For performance analysis, it is convenient to represent all timing information in relative rather than absolute units, and the processor cycle, tp, has been assumed as the unit of time.
- Likewise, the component due to remote accesses is pr ∗ tm; this expression is obtained by taking into account that for each node the requests are coming from (N − 1) remote processors, and that remote memory requests are uniformly distributed over (N − 1) processors, so the service demand due to remote requests is pr ∗ tm.
- The service demand due to a single thread (in each processor) at the inbound switch is obtained as follows.
- There are two basic ways to reduce the limiting effects of the switches; one is to use switches with smaller switch delay (for example, ts = 5), and the other is to use parallel switches and to distribute the workload among them.
- The balance is now obtained for pr = 0.5, which is still quite distant from the values corresponding to the uniform distribution of accesses among the nodes of the system.
4 CONCLUDING REMARKS
- The paper presents a timed Petri net model of block multiprocessor system at the instruction execution level, and analyzes the effects of system bottlenecks on the performance of system components.
- Balancing the system by improving performance characteristics of its components may sometimes be difficult because the components with improved characteristics may not be available.
- Since the utilization of processors is probably the simplest indicator of the performance of the whole system, there may be a tendency to keep this utilization high.
- The results obtained for a 2–dimensional torus–like network are also valid for other interconnecting networks with the same connectivity characteristics.
- The model needs only a few small changes to represent other multiprocessor systems.
Did you find this useful? Give us your feedback
Related Papers (5)
Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Performance limitations of block–multithreaded distributed–memory systems" ?
The paper studies performance limitations in distributed–memory block multithreaded systems and determines conditions for such systems to be balanced.