Showing papers by "Per Stenström published in 1993"

PDF

Open Access

Proceedings Article•DOI•

An adaptive cache coherence protocol optimized for migratory sharing

[...]

Per Stenström¹, Mats Brorsson¹, Lars Sandberg¹•Institutions (1)

01 May 1993

TL;DR: An adaptive protocol is proposed that effectively eliminates most single invalidations and improves the performance by reducing the shared access penalty and the network traffic.

...read moreread less

Abstract: Parallel programs that use critical sections and are executed on a shared-memory multiprocessor with a write-invalidate protocol result in invalidation actions that could be eliminated. For this type of sharing, called migratory sharing, each processor typically causes a cache miss followed by an invalidation request which could be merged with the preceding cache-miss request.In this paper we propose an adaptive protocol that invokes this optimization dynamically for migratory blocks. For other blocks, the protocol works as an ordinary write-invalidate protocol. We show that the protocol is a simple extension to a write-invalidate protocol.Based on a program-driven simulation model of an architecture similar to the Stanford DASH, and a set of four benchmarks, we evaluate the potential performance improvements of the protocol. We find that it effectively eliminates most single invalidations which improves the performance by reducing the shared access penalty and the network traffic.

...read moreread less

239 citations

Proceedings Article•DOI•

Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors

[...]

Fredrik Dahlgren¹, Michel Dubois², Per Stenström¹•Institutions (2)

Lund University¹, University of Southern California²

16 Aug 1993

TL;DR: This work proposes to adapt the number of pref etched blocks according to a dynamic measure of prefetching effectiveness, and shows significant reductions of the read penalty and of the overall execution time.

...read moreread less

Abstract: To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware tech niques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the exe cution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of pref etched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show significant reductions of the read penalty and of the overall execution time.

...read moreread less

149 citations

Proceedings Article•DOI•

The detection and elimination of useless misses in multiprocessors

[...]

Michel Dubois¹, Jonas Skeppstedt¹, Livio Ricciulli¹, Krishnan Ramamurthy¹, Per Stenström¹ - Show less +1 more•Institutions (1)

University of Southern California¹

01 May 1993

TL;DR: A new classification of misses in shared-memory multiprocessors based on interprocessor communication is introduced, which identifies the set of essential misses, i.e., the smallest set of misses necessary for correct execution.

...read moreread less

Abstract: In this paper we introduce a new classification of misses in shared-memory multiprocessors based on interprocessor communication. We identify the set of essential misses, i.e., the smallest set of misses necessary for correct execution. Essential misses include cold misses and true sharing misses. All other misses are useless misses and can be ignored without affecting the correctness of program execution. Based on the new classification we compare the effectiveness of five different protocols which delay and combine invalidations leading to useless misses. In cache-based systems the protocols are very effective and have miss rates close to the essential miss rate. In virtual shared memory systems the techniques are also effective but leave room for improvements.

...read moreread less

121 citations

Proceedings Article•DOI•

The Cachemire Test Bench A Flexible And Effective Approach For Simulation Of Multiprocessors

[...]

Mats Brorsson¹, Fredrik Dahlgren, Håkan Nilsson, Per Stenström•Institutions (1)

Lund University¹

01 Jan 1993

TL;DR: The CacheMire Test Bench – A Flexible and Effective Approach for Simulation of Multiprocessors is presented, which aims to provide an efficient and effective test bench for simulation of multi-modal systems.

...read moreread less

Abstract: The CacheMire Test Bench – A Flexible and Effective Approach for Simulation of Multiprocessors

...read moreread less

83 citations

Proceedings Article•DOI•

Performance evaluation of link-based cache coherence schemes

[...]

Håkan Nilsson¹, Per Stenström¹•Institutions (1)

Lund University¹

05 Jan 1993

TL;DR: It was found that tree-based and linear-list protocols performed almost as well as full-map protocols but with a considerably lower implementation cost, however, if the sharing set is large, linear- list schemes may suffer because of the large write latency while tree- based protocols still perform well.

...read moreread less

Abstract: The authors have evaluated the implementation and performance tradeoffs between three directory-based cache coherence protocols. They study two link-based approaches, called tree-based and linear-list protocols, and contrast their performance and implementation cost with that of a full-map protocol. Using program-driven simulation and a set of three benchmark programs, it was found that tree-based and linear-list protocols performed almost as well as full-map protocols but with a considerably lower implementation cost. However, if the sharing set is large, linear-list schemes may suffer because of the large write latency while tree-based protocols still perform well. >

...read moreread less

10 citations

1993 International Conference on Parallel Processing Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors

[...]

Fredrik Dahlgren, Michel Dubois, Per Stenström

01 Jan 1993

...read moreread less

Abstract: To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware tech¬ niques is that they need no support from the programmer or compiler. Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the exe¬ cution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of pref etched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show significant reductions of the read penalty and of the overall execution time.

...read moreread less

6 citations

Book•

68000 Microcomputer Organization and Programming

[...]

Per Stenström¹•Institutions (1)

Lund University¹

01 Feb 1993

2 citations

Using Graphics and Animation to Visulize Instruction Pipelining and its Hazards

[...]

Per Stenström, Håkan Nilsson, Jonas Skeppstedt

01 Jan 1993

TL;DR: A graphical tool that uses animation and other graphical techniques to visualize how a pipelined datapath and control unit work and outline a laboratory that makes use of it is described.

...read moreread less

Abstract: The breakthrough of pipelined microprocessors has brought about a need to teach instruction pipelining in electrical and computer engineering curricula at the undergraduate level to a considerable depth. Although the idea of pipelining is conceptually simple, students often find pipelining difficult to visualize. Only the most talented students assimilate the ideas of how hazard issues are eliminated. Based on the pedagogical approach used in the landmark book “Computer Architecture—A Quantitative Approach” by John Hennessy and David Patterson, we have developed a graphical tool that uses animation and other graphical techniques to visualize how a pipelined datapath and control unit work. In this paper, we describe the graphical tool and outline a laboratory that makes use of it.

...read moreread less

2 citations

Visualisation of Cache Coherence Bottlenecks in Shared Memory Multiprocessor Applications

[...]

Mats Brorsson¹, Per Stenström¹•Institutions (1)

Lund University¹

01 Jan 1993

TL;DR: This paper uses an unblocked matrix multiplication and two diff erent ways of parallelising a blocked matrix multiplication algorithm to illustrate the problems involved and how the sharing behaviour can aid in choosing the right algorithm.

...read moreread less

Abstract: Shared memory multiprocessors are becoming more and more important but one major problem is how to keep the processor caches coherent. Many diff erent solutions to this problem exist and the performance of a given program depends largely on the access pattern observed to shared data, the sharing behaviour. We discuss in this paper how to characterise and visualise sharing behaviour. In terms of cache coherence the degree of sharing, the access mode and the temporal granularity are found to be essential in order to describe and understand sharing behaviour. The sharing behaviour can be measured by simulation and visualised in asharing profile diagram. We use an unblocked matrix multiplication and two diff erent ways of parallelising a blocked matrix multipl ication algorithm to illustrate the problems involved and how the sharing profil e can aid in choosing the right algorithm.

...read moreread less

2 citations