scispace - formally typeset
Search or ask a question

Showing papers by "Per Stenström published in 1993"


Proceedings ArticleDOI
01 May 1993
TL;DR: An adaptive protocol is proposed that effectively eliminates most single invalidations and improves the performance by reducing the shared access penalty and the network traffic.
Abstract: Parallel programs that use critical sections and are executed on a shared-memory multiprocessor with a write-invalidate protocol result in invalidation actions that could be eliminated. For this type of sharing, called migratory sharing, each processor typically causes a cache miss followed by an invalidation request which could be merged with the preceding cache-miss request.In this paper we propose an adaptive protocol that invokes this optimization dynamically for migratory blocks. For other blocks, the protocol works as an ordinary write-invalidate protocol. We show that the protocol is a simple extension to a write-invalidate protocol.Based on a program-driven simulation model of an architecture similar to the Stanford DASH, and a set of four benchmarks, we evaluate the potential performance improvements of the protocol. We find that it effectively eliminates most single invalidations which improves the performance by reducing the shared access penalty and the network traffic.

239 citations


Proceedings ArticleDOI
16 Aug 1993
TL;DR: This work proposes to adapt the number of pref etched blocks according to a dynamic measure of prefetching effectiveness, and shows significant reductions of the read penalty and of the overall execution time.
Abstract: To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware tech niques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the exe cution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of pref etched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show significant reductions of the read penalty and of the overall execution time.

149 citations


Proceedings ArticleDOI
01 May 1993
TL;DR: A new classification of misses in shared-memory multiprocessors based on interprocessor communication is introduced, which identifies the set of essential misses, i.e., the smallest set of misses necessary for correct execution.
Abstract: In this paper we introduce a new classification of misses in shared-memory multiprocessors based on interprocessor communication. We identify the set of essential misses, i.e., the smallest set of misses necessary for correct execution. Essential misses include cold misses and true sharing misses. All other misses are useless misses and can be ignored without affecting the correctness of program execution. Based on the new classification we compare the effectiveness of five different protocols which delay and combine invalidations leading to useless misses. In cache-based systems the protocols are very effective and have miss rates close to the essential miss rate. In virtual shared memory systems the techniques are also effective but leave room for improvements.

121 citations


Proceedings ArticleDOI
01 Jan 1993
TL;DR: The CacheMire Test Bench – A Flexible and Effective Approach for Simulation of Multiprocessors is presented, which aims to provide an efficient and effective test bench for simulation of multi-modal systems.
Abstract: The CacheMire Test Bench – A Flexible and Effective Approach for Simulation of Multiprocessors

83 citations


Proceedings ArticleDOI
05 Jan 1993
TL;DR: It was found that tree-based and linear-list protocols performed almost as well as full-map protocols but with a considerably lower implementation cost, however, if the sharing set is large, linear- list schemes may suffer because of the large write latency while tree- based protocols still perform well.
Abstract: The authors have evaluated the implementation and performance tradeoffs between three directory-based cache coherence protocols. They study two link-based approaches, called tree-based and linear-list protocols, and contrast their performance and implementation cost with that of a full-map protocol. Using program-driven simulation and a set of three benchmark programs, it was found that tree-based and linear-list protocols performed almost as well as full-map protocols but with a considerably lower implementation cost. However, if the sharing set is large, linear-list schemes may suffer because of the large write latency while tree-based protocols still perform well. >

10 citations


01 Jan 1993
TL;DR: This work proposes to adapt the number of pref etched blocks according to a dynamic measure of prefetching effectiveness, and shows significant reductions of the read penalty and of the overall execution time.
Abstract: To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware tech¬ niques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the exe¬ cution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of pref etched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show significant reductions of the read penalty and of the overall execution time.

6 citations


Book
Per Stenström1
01 Feb 1993

2 citations


01 Jan 1993
TL;DR: A graphical tool that uses animation and other graphical techniques to visualize how a pipelined datapath and control unit work and outline a laboratory that makes use of it is described.
Abstract: The breakthrough of pipelined microprocessors has brought about a need to teach instruction pipelining in electrical and computer engineering curricula at the undergraduate level to a considerable depth. Although the idea of pipelining is conceptually simple, students often find pipelining difficult to visualize. Only the most talented students assimilate the ideas of how hazard issues are eliminated. Based on the pedagogical approach used in the landmark book “Computer Architecture—A Quantitative Approach” by John Hennessy and David Patterson, we have developed a graphical tool that uses animation and other graphical techniques to visualize how a pipelined datapath and control unit work. In this paper, we describe the graphical tool and outline a laboratory that makes use of it.

2 citations


01 Jan 1993
TL;DR: This paper uses an unblocked matrix multiplication and two diff erent ways of parallelising a blocked matrix multiplication algorithm to illustrate the problems involved and how the sharing behaviour can aid in choosing the right algorithm.
Abstract: Shared memory multiprocessors are becoming more and more important but one major problem is how to keep the processor caches coherent. Many diff erent solutions to this problem exist and the performance of a given program depends largely on the access pattern observed to shared data, the sharing behaviour. We discuss in this paper how to characterise and visualise sharing behaviour. In terms of cache coherence the degree of sharing, the access mode and the temporal granularity are found to be essential in order to describe and understand sharing behaviour. The sharing behaviour can be measured by simulation and visualised in asharing profile diagram. We use an unblocked matrix multiplication and two diff erent ways of parallelising a blocked matrix multipl ication algorithm to illustrate the problems involved and how the sharing profil e can aid in choosing the right algorithm.

2 citations