The SPLASH-2 suite of parallel applications has recently been released to facilitate the study of centralized and distributed shared-address-space multiprocessors. In this context, this paper has two goals. One is to quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well. The properties we study include the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality, as well as how these properties scale with problem size and the number of processors. The other, related goal is methodological: to assist people who will use the programs in architectural evaluations to prune the space of application and machine parameters in an informed and meaningful way. For example, by characterizing the working sets of the applications, we describe which operating points in terms of cache size and problem size are representative of realistic situations, which are not, and which re redundant. Using SPLASH-2 as an example, we hope to convey the importance of understanding the interplay of problem size, number of processors, and working sets in designing experiments and interpreting their results.

/pdf/the-splash-2-programs-characterization-and-methodological-14b6i8va45.pdf

The SPLASH-2 programs: characterization and methodological considerations

We present the internals of QEMU, a fast machine emulator using an original portable dynamic translator. It emulates several CPUs (x86, PowerPC, ARM and Sparc) on several hosts (x86, PowerPC, ARM, Sparc, Alpha and MIPS). QEMU supports full system emulation in which a complete and unmodified operating system is run in a virtual machine and Linux user mode emulation where a Linux process compiled for one target CPU can be run on another CPU.

/pdf/qemu-a-fast-and-portable-dynamic-translator-xpb07ophlg.pdf

QEMU, a fast and portable dynamic translator

The determination of upper bounds on execution times, commonly called worst-case execution times (WCETs), is a necessary step in the development and validation process for hard real-time systems. This problem is hard if the underlying processor architecture has components, such as caches, pipelines, branch prediction, and other speculative components. This article describes different approaches to this problem and surveys several commercially available tools1 and research prototypes.

/pdf/the-worst-case-execution-time-problem-overview-of-methods-2kdq3gxkoy.pdf

The worst-case execution-time problem—overview of methods and survey of tools

http://www.inf.ufes.br/~luciac/comcie/JFNK-Knoll.pdf

Jacobian-free Newton-Krylov methods: a survey of approaches and applications

The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches on a common machine structure. This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures. It then examines the design issues that are critical to all parallel architecture across the full range of modern design, covering data access, communication performance, coordination of cooperative work, and correct implementation of useful semantics. It not only describes the hardware and software techniques for addressing each of these issues but also explores how these techniques interact in the same system. Examining architecture from an application-driven perspective, it provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.


* synthesizes a decade of research and development for practicing engineers, graduate students, and researchers in parallel computer architecture, system software, and applications development

* presents in-depth application case studies from computer graphics, computational science and engineering, and data mining to demonstrate sound quantitative evaluation of design trade-offs 

* describes the process of programming for performance, including both the architecture-independent and architecture-dependent aspects, with examples and case-studies

* illustrates bus-based and network-based parallel systems with case studies of more than a dozen important commercial designs

Table of Contents

1 Introduction 
2 Parallel Programs 
3 Programming for Performance 
4 Workload-Driven Evaluation 
5 Shared Memory Multiprocessors 
6 Snoop-based Multiprocessor Design 
7 Scalable Multiprocessors
8 Directory-based Cache Coherence
9 Hardware-Software Tradeoffs 
10 Interconnection Network Design 
11 Latency Tolerance 
12 Future Directions 
APPENDIX A Parallel Benchmark Suites

Parallel Computer Architecture: A Hardware/Software Approach

Hardware transactional memory (HTM) systems have been studied extensively along the dimensions of speculative versioning and contention management policies. The relative performance of several designs policies has been discussed at length in prior work within the framework of scalable chip-multiprocessing systems. Yet, the impact of simple structural optimizations like write-buffering has not been investigated and performance deviations due to the presence or absence of these optimizations remains unclear. This lack of insight into the effective use and impact of these interfacial structures between the processor core and the coherent memory hierarchy forms the crux of the problem we study in this paper. Through detailed modeling of various write-buffering configurations we show that they play a major role in determining the overall performance of a practical HTM system. Our study of both eager and lazy conflict resolution mechanisms in a scalable parallel architecture notes a remarkable convergence of the performance of these two diametrically opposite design points when write buffers are introduced and used well to support the common case. Mitigation of redundant actions, fewer invalidations on abort, latency-hiding and prefetch effects contribute towards reducing execution times for transactions. Shorter transaction durations also imply a lower contention probability, thereby amplifying gains even further. The insights, related to the interplay between buffering mechanisms, system policies and workload characteristics, contained in this paper clearly distinguish gains in performance to be had from write-buffering from those that can be ascribed to HTM policy. We believe that this information would facilitate sound design decisions when incorporating HTMs into parallel architectures.

/pdf/eager-meets-lazy-the-impact-of-write-buffering-on-hardware-3dzzjgfp1w.pdf

Eager Meets Lazy: The Impact of Write-Buffering on Hardware Transactional Memory

Thread-level speculative execution is a technique that makes it possible for a wider range of single-threaded applications to make use of the processing resources in a chip multiprocessor.We consider module-level speculation, i.e., speculative threads executing the code after a module (i.e., a procedure, function, or method) call. Unfortunately, previous studies have shown that indiscriminate module-level speculation results in significant overheads, mainly due to frequent misspeculations. In addition to hurting performance, excessive overhead is harmful from a resource usage and energy efficiency standpoint. We show that the overhead when spawning speculative threads for all module continuations is on average three times as big as the time spent on useful execution on our baseline 8-way chip multiprocessorIn this paper, we present and make a detailed evaluation of a technique that aims at reducing the overhead associated with misspeculations. History-based prediction is used in an attempt to prevent speculative threads from being spawned when they are expected to cause misspeculations. We find that the overhead can be reduced with a factor of six on average compared to indiscriminate speculation. The impact on speedup is small for most applications, but in several cases speedup is slightly improved.

Reducing misspeculation overhead for module-level speculative execution

Multicore architectures can provide high predictable performance through parallel processing. Unfortunately, computing the makespan of parallel applications is overly pessimistic either due to load imbalance issues plaguing static scheduling methods or due to timing anomalies plaguing dynamic scheduling methods. This paper contributes with an anomaly-free dynamic scheduling method, called Lazy, which is non-preemptive and non-greedy in the sense that some ready tasks may not be dispatched for execution even if some processors are idle. Assuming parallel applications using contemporary taskbased parallel programming models, such as OpenMP, the general idea of Lazy is to avoid timing anomalies by assigning fixed priorities to the tasks and then dispatch selective highestpriority ready tasks for execution at each scheduling point. We formally prove that Lazy is timing-anomaly free. Unlike all the commonly-used dynamic schedulers like breadth-first and depth-first schedulers (e.g., CilkPlus) that rely on analytical approaches to determine an upper bound on the makespan of parallel application, a safe makespan of a parallel application is computed by simulating Lazy. Our experimental results show that the makespan computed by simulating Lazy is much tighter and scales better as demonstrated by four parallel benchmarks from a task-parallel benchmark suite in comparison to the state-of-the-art.

Timing-anomaly free dynamic scheduling of task-based parallel applications

This paper analyzes the sources of performance losses in hardware transactional memory and investigates techniques to reduce the losses. It dissects the root causes of data conflicts in hardware transactional memory systems (HTM) into four classes of conflicts: true sharing, false sharing, silent store, and write-write conflicts. These conflicts can cause performance and energy losses due to aborts and extra communication. To quantify losses, the paper first proposes the 5C cache-miss classification model that extends the well-established 4C model with a new class of cache misses known as contamination misses. The paper also contributes with two techniques for removal of data conflicts: One for removal of false sharing conflicts and another for removal of silent store conflicts. In addition, it revisits and adapts a technique that is able to reduce losses due to both true and false conflicts. All of the proposed techniques can be accommodated in a lazy versioning and lazy conflict resolution HTM built on top of a MESI cache-coherence infrastructure with quite modest extensions. Their ability to reduce performance is quantitatively established, individually as well as in combination. Performance is improved substantially.

/pdf/classification-and-elimination-of-conflicts-in-hardware-3icg80571r.pdf

Classification and Elimination of Conflicts in Hardware Transactional Memory Systems

Hardware transactional memory (HTM) designs are very sensitive to the manner in which speculative updates from transactions are handled in the system. This study highlights how the lack of effective techniques for store management results in a quick degradation in the performance of eager HTM systems with increasing contention and, thus, lends credence to the belief that eager designs do not perform as well as their lazy counterparts when conflicts abound. In this work, we present two simple ways to improve handling of speculative stores--a way to effectively manage lines that exhibit migratory sharing and a way to hide store latency, particularly for those stores that target contended cache lines owned by other concurrent transactions. These two mechanisms yield substantial improvements in execution time when running applications with high contention, allowing eager designs to exceed the performance of lazy ones. Interestingly, the benefits that accrue from these enhancements can be at par with those achieved using more complex system-wide HTM techniques. Coupled with the fact that eager designs are easier to integrate into cache coherent architectures than lazy ones, we claim that with judicious management of stores they represent a more compelling design alternative.

Per Stenström

Papers

Eager Meets Lazy: The Impact of Write-Buffering on Hardware Transactional Memory

Reducing misspeculation overhead for module-level speculative execution

Timing-anomaly free dynamic scheduling of task-based parallel applications

Classification and Elimination of Conflicts in Hardware Transactional Memory Systems

Eager Beats Lazy: Improving Store Management in Eager Hardware Transactional Memory