scispace - formally typeset
Search or ask a question
Author

Per Stenström

Other affiliations: Stanford University, Ericsson, Lund University  ...read more
Bio: Per Stenström is an academic researcher from Chalmers University of Technology. The author has contributed to research in topics: Cache & Cache coherence. The author has an hindex of 43, co-authored 245 publications receiving 8193 citations. Previous affiliations of Per Stenström include Stanford University & Ericsson.


Papers
More filters
Journal ArticleDOI
TL;DR: Different approaches to the determination of upper bounds on execution times are described and several commercially available tools1 and research prototypes are surveyed.
Abstract: The determination of upper bounds on execution times, commonly called worst-case execution times (WCETs), is a necessary step in the development and validation process for hard real-time systems. This problem is hard if the underlying processor architecture has components, such as caches, pipelines, branch prediction, and other speculative components. This article describes different approaches to this problem and surveys several commercially available tools1 and research prototypes.

1,946 citations

Proceedings ArticleDOI
01 Dec 1999
TL;DR: This work provides necessary conditions when timing anomalies can show up and identifies what architectural features that may cause such anomalies, and proposes some simple code modification techniques to make it impossible for any anomalies to occur.
Abstract: Previous timing analysis methods have assumed that the worst-case instruction execution time necessarily corresponds to the worst-case behavior. We show that this assumption is wrong in dynamically scheduled processors. A cache miss, for example, can in some cases result in a shorter execution time than a cache hit. Many examples of such timing anomalies are provided. We first provide necessary conditions when timing anomalies can show up and identify what architectural features that may cause such anomalies. We also show that analyzing the effect of these anomalies with known techniques results in prohibitive computational complexities. Instead, we propose some simple code modification techniques to make it impossible for any anomalies to occur. These modifications make it possible to estimate WCET by known techniques. Our evaluation shows that the pessimism imposed by these techniques is fairly limited; it is less than 27% for the programs in our benchmark suite.

380 citations

Journal ArticleDOI
Per Stenström1
TL;DR: Schemes for cache coherence that exhibit various degrees of hardware complexity, ranging from protocols that maintain coherence in hardware, to software policies that prevent the existence of copies of shared, writable data, are surveyed.
Abstract: Schemes for cache coherence that exhibit various degrees of hardware complexity, ranging from protocols that maintain coherence in hardware, to software policies that prevent the existence of copies of shared, writable data, are surveyed. Some examples of the use of shared data are examined. These examples help point out a number of performance issues. Hardware protocols are considered. It is seen that consistency can be maintained efficiently, although in some cases with considerable hardware complexity, especially for multiprocessors with many processors. Software schemes are investigated as an alternative capable of reducing the hardware cost. >

338 citations

Book ChapterDOI
29 Aug 2000
TL;DR: Computer architecture is a truly fascinating field in which improvements in the basic echnology and innovations how to make best use of he underlying technology has yielded a performance growth exceeding a million times over the past 50 years.
Abstract: Computer architecture is a truly fascinating field in tha improvements in the basic echnology and innovations how to make best use of he underlying tech- nology has yielded a performance growth exceeding a million times over the past 50 years. What is even more amazing is the fact that he pressure on maintain- ing his rate of performance growth shows no decline. In fact, as performance thresholds are passed, application designers face new opportunities tha give new challenging problems to work on for computer architects.

310 citations

Proceedings ArticleDOI
01 May 1993
TL;DR: An adaptive protocol is proposed that effectively eliminates most single invalidations and improves the performance by reducing the shared access penalty and the network traffic.
Abstract: Parallel programs that use critical sections and are executed on a shared-memory multiprocessor with a write-invalidate protocol result in invalidation actions that could be eliminated. For this type of sharing, called migratory sharing, each processor typically causes a cache miss followed by an invalidation request which could be merged with the preceding cache-miss request.In this paper we propose an adaptive protocol that invokes this optimization dynamically for migratory blocks. For other blocks, the protocol works as an ordinary write-invalidate protocol. We show that the protocol is a simple extension to a write-invalidate protocol.Based on a program-driven simulation model of an architecture similar to the Stanford DASH, and a set of four benchmarks, we evaluate the potential performance improvements of the protocol. We find that it effectively eliminates most single invalidations which improves the performance by reducing the shared access penalty and the network traffic.

239 citations


Cited by
More filters
Proceedings ArticleDOI
01 May 1995
TL;DR: This paper quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well, including the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality.
Abstract: The SPLASH-2 suite of parallel applications has recently been released to facilitate the study of centralized and distributed shared-address-space multiprocessors. In this context, this paper has two goals. One is to quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well. The properties we study include the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality, as well as how these properties scale with problem size and the number of processors. The other, related goal is methodological: to assist people who will use the programs in architectural evaluations to prune the space of application and machine parameters in an informed and meaningful way. For example, by characterizing the working sets of the applications, we describe which operating points in terms of cache size and problem size are representative of realistic situations, which are not, and which re redundant. Using SPLASH-2 as an example, we hope to convey the importance of understanding the interplay of problem size, number of processors, and working sets in designing experiments and interpreting their results.

4,002 citations

Proceedings Article
10 Apr 2005
TL;DR: QEMU supports full system emulation in which a complete and unmodified operating system is run in a virtual machine and Linux user mode emulation where a Linux process compiled for one target CPU can be run on another CPU.
Abstract: We present the internals of QEMU, a fast machine emulator using an original portable dynamic translator. It emulates several CPUs (x86, PowerPC, ARM and Sparc) on several hosts (x86, PowerPC, ARM, Sparc, Alpha and MIPS). QEMU supports full system emulation in which a complete and unmodified operating system is run in a virtual machine and Linux user mode emulation where a Linux process compiled for one target CPU can be run on another CPU.

2,420 citations

Journal ArticleDOI
TL;DR: Different approaches to the determination of upper bounds on execution times are described and several commercially available tools1 and research prototypes are surveyed.
Abstract: The determination of upper bounds on execution times, commonly called worst-case execution times (WCETs), is a necessary step in the development and validation process for hard real-time systems. This problem is hard if the underlying processor architecture has components, such as caches, pipelines, branch prediction, and other speculative components. This article describes different approaches to this problem and surveys several commercially available tools1 and research prototypes.

1,946 citations

Journal ArticleDOI
TL;DR: The aim of this paper is to present the reader with a perspective on how JFNK may be applicable to applications of interest and to provide sources of further practical information.

1,803 citations

Book
15 Aug 1998
TL;DR: This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures and provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.
Abstract: The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software techniques for addressing each of these issues but also explores how these techniques interact in the same system. Examining architecture from an application-driven perspective, it provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions. * synthesizes a decade of research and development for practicing engineers, graduate students, and researchers in parallel computer architecture, system software, and applications development * presents in-depth application case studies from computer graphics, computational science and engineering, and data mining to demonstrate sound quantitative evaluation of design trade-offs * describes the process of programming for performance, including both the architecture-independent and architecture-dependent aspects, with examples and case-studies * illustrates bus-based and network-based parallel systems with case studies of more than a dozen important commercial designs Table of Contents 1 Introduction 2 Parallel Programs 3 Programming for Performance 4 Workload-Driven Evaluation 5 Shared Memory Multiprocessors 6 Snoop-based Multiprocessor Design 7 Scalable Multiprocessors 8 Directory-based Cache Coherence 9 Hardware-Software Tradeoffs 10 Interconnection Network Design 11 Latency Tolerance 12 Future Directions APPENDIX A Parallel Benchmark Suites

1,571 citations