The SPLASH-2 suite of parallel applications has recently been released to facilitate the study of centralized and distributed shared-address-space multiprocessors. In this context, this paper has two goals. One is to quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well. The properties we study include the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality, as well as how these properties scale with problem size and the number of processors. The other, related goal is methodological: to assist people who will use the programs in architectural evaluations to prune the space of application and machine parameters in an informed and meaningful way. For example, by characterizing the working sets of the applications, we describe which operating points in terms of cache size and problem size are representative of realistic situations, which are not, and which re redundant. Using SPLASH-2 as an example, we hope to convey the importance of understanding the interplay of problem size, number of processors, and working sets in designing experiments and interpreting their results.

/pdf/the-splash-2-programs-characterization-and-methodological-14b6i8va45.pdf

The SPLASH-2 programs: characterization and methodological considerations

We present the internals of QEMU, a fast machine emulator using an original portable dynamic translator. It emulates several CPUs (x86, PowerPC, ARM and Sparc) on several hosts (x86, PowerPC, ARM, Sparc, Alpha and MIPS). QEMU supports full system emulation in which a complete and unmodified operating system is run in a virtual machine and Linux user mode emulation where a Linux process compiled for one target CPU can be run on another CPU.

/pdf/qemu-a-fast-and-portable-dynamic-translator-xpb07ophlg.pdf

QEMU, a fast and portable dynamic translator

The determination of upper bounds on execution times, commonly called worst-case execution times (WCETs), is a necessary step in the development and validation process for hard real-time systems. This problem is hard if the underlying processor architecture has components, such as caches, pipelines, branch prediction, and other speculative components. This article describes different approaches to this problem and surveys several commercially available tools1 and research prototypes.

/pdf/the-worst-case-execution-time-problem-overview-of-methods-2kdq3gxkoy.pdf

The worst-case execution-time problem—overview of methods and survey of tools

http://www.inf.ufes.br/~luciac/comcie/JFNK-Knoll.pdf

Jacobian-free Newton-Krylov methods: a survey of approaches and applications

The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software techniques for addressing each of these issues but also explores how these techniques interact in the same system. Examining architecture from an application-driven perspective, it provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.


* synthesizes a decade of research and development for practicing engineers, graduate students, and researchers in parallel computer architecture, system software, and applications development

* presents in-depth application case studies from computer graphics, computational science and engineering, and data mining to demonstrate sound quantitative evaluation of design trade-offs 

* describes the process of programming for performance, including both the architecture-independent and architecture-dependent aspects, with examples and case-studies

* illustrates bus-based and network-based parallel systems with case studies of more than a dozen important commercial designs

Table of Contents

1 Introduction 
2 Parallel Programs 
3 Programming for Performance 
4 Workload-Driven Evaluation 
5 Shared Memory Multiprocessors 
6 Snoop-based Multiprocessor Design 
7 Scalable Multiprocessors
8 Directory-based Cache Coherence
9 Hardware-Software Tradeoffs 
10 Interconnection Network Design 
11 Latency Tolerance 
12 Future Directions 
APPENDIX A Parallel Benchmark Suites

Parallel Computer Architecture: A Hardware/Software Approach

Applications that are run on multicore systems without performance targets can waste significant energy. This paper considers, for the first time, a QoS-driven coordinated resource management algorithm (RMA) that dynamically adjusts the size of the per-core last-level cache partitions and the per-core voltage-frequency settings to save energy while respecting QoS requirements of individual applications in multi-programmed workloads run on multi-core systems. It does so by doing configuration-space exploration across the spectrum of LLC partition sizes and DVFS settings at runtime at negligible overhead. Compared to DVFS and cache partitioning alone, we show that our proposed coordinated RMA is capable of saving, on average, 20% energy as compared to 15% for DVFS alone and 7% for cache partitioning alone, when the performance target is set to 70% of the baseline system performance.

/pdf/qos-driven-coordinated-management-of-resources-to-save-4x39ash7fj.pdf

QoS-Driven Coordinated Management of Resources to Save Energy in Multi-core Systems

The authors have evaluated the implementation and performance tradeoffs between three directory-based cache coherence protocols. They study two link-based approaches, called tree-based and linear-list protocols, and contrast their performance and implementation cost with that of a full-map protocol. Using program-driven simulation and a set of three benchmark programs, it was found that tree-based and linear-list protocols performed almost as well as full-map protocols but with a considerably lower implementation cost. However, if the sharing set is large, linear-list schemes may suffer because of the large write latency while tree-based protocols still perform well. >

Performance evaluation of link-based cache coherence schemes

Hardware prefetching on IBM's latest POWER8 processor is able to improve performance of many applications significantly, but it can also cause performance loss for others. The IBM POWER8 processor provides one of the most sophisticated hardware prefetching designs which supports 225 different configurations. Obviously, it is a big challenge to find the optimal or near-optimal hardware prefetching configuration for a specific application. We present a dynamic prefetching tuning scheme in this paper, named prefetch automatic tuner (PATer). PATer uses a prediction model based on machine learning to dynamically tune the prefetch configuration based on the values of hardware performance monitoring counters (PMCs). By developing a two-phase prefetching selection algorithm and a prediction accuracy optimization algorithm in this tool, we identify a set of selected key hardware prefetch configurations that matter mostly to performance as well as a set of PMCs that maximize the machine learning prediction accuracy. We show that PATer is able to accelerate the execution of diverse workloads up to 1.4×.

PATer: A Hardware Prefetching Automatic Tuner on IBM POWER8 Processor

This paper proposes to transform the branch outcome history from the time domain to the frequency domain. With our proposed Fourier Analysis Branch (FAB) predictor, we can represent long periodic branch history patterns - as long as 2/sup 13/ bits - with a realistic number of bits (52 bits). We evaluate the potential gains of the FAB predictor by considering a hybrid branch predictor in which each branch is predicted using a static scheme, the 2-bit dynamic scheme, the PAp and GAp schemes, and our FAB predictor. By including our FAB predictor in the hybrid predictor, it is possible to cut the misprediction rate of integer applications in the SPEC95 suite by between 5 and 50% with an average of 20%. Besides evaluating its performance, this paper shows some key properties of our FAB predictor and presents some possible implementation approaches.

Per Stenström

Papers

QoS-Driven Coordinated Management of Resources to Save Energy in Multi-core Systems

Performance evaluation of link-based cache coherence schemes

PATer: A Hardware Prefetching Automatic Tuner on IBM POWER8 Processor

The FAB predictor: using Fourier analysis to predict the outcome of conditional branches

Coordinated management of DVFS and cache partitioning under QoS constraints to save energy in multi-core systems