scispace - formally typeset
Proceedings ArticleDOI

Using early phase termination to eliminate load imbalances at barrier synchronization points

Martin Rinard
- Vol. 42, Iss: 10, pp 369-386
Reads0
Chats0
TLDR
A general computational pattern that works well with early phase termination is identified and it is explained why computations that exhibit this pattern can tolerate the early termination of parallel tasks without producing unacceptable results.
Abstract
We present a new technique, early phase termination, for eliminating idle processors in parallel computations that use barrier synchronization. This technique simply terminates each parallel phaseas soon as there are too few remaining tasks to keep all of the processors busy. Although this technique completely eliminates the idling that would other wise occur at barrier synchronization points, it may also change the computation and therefore the result that the computation produces. We address this issue by providing probabilistic distortion models that characterize how the use of early phase termination distorts the result that the computation produces. Our experimental results show that for our set of benchmark applications, 1) early phase termination can improve the performance of the parallel computation, 2) the distortion is small (or can be made to be small with the use of an appropriate compensation technique) and 3) the distortion models provide accurate and tight distortion bounds. These bounds can enable users to evaluate the effect of early phase termination and confidently accept results from parallel computations that use this technique if they find the distortion bounds to be acceptable. Finally, we identify a general computational pattern that works well with early phase termination and explain why computations that exhibit this pattern can tolerate the early termination of parallel tasks without producing unacceptable results.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Managing performance vs. accuracy trade-offs with loop perforation

TL;DR: The results indicate that, for a range of applications, this approach typically delivers performance increases of over a factor of two (and up to a factors of seven) while changing the result that the application produces by less than 10%.
Proceedings ArticleDOI

Green: a framework for supporting energy-conscious programming using controlled approximation

TL;DR: Green enables programmers to approximate expensive functions and loops and operates in two phases and can produce significant improvements in performance and energy consumption with small and controlled QoS degradation.
Proceedings ArticleDOI

Dynamic knobs for responsive power-aware computing

TL;DR: The experimental results show that PowerDial can enable benchmark applications to execute responsively in the face of power caps that would otherwise significantly impair responsiveness, and can significantly reduce the number of machines required to service intermittent load spikes, enabling reductions in power and capital costs.
Proceedings ArticleDOI

SAGE: self-tuning approximation for graphics engines

TL;DR: Across a set of machine learning and image processing kernels, SAGE's approximation yields an average of 2.5× speedup with less than 10% quality loss compared to the accurate execution on a NVIDIA GTX 560 GPU.
Proceedings ArticleDOI

Quality of service profiling

TL;DR: The experimental results from applying the implemented quality of service profiler to a challenging set of benchmark applications show that it can enable developers to identify promising optimization opportunities and deliver successful optimizations that substantially increase the performance with only smallquality of service losses.
References
More filters
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Journal ArticleDOI

A hierarchical O(N log N) force-calculation algorithm

TL;DR: A novel method of directly calculating the force on N bodies that grows only as N log N is described, using a tree-structured hierarchical subdivision of space into cubic cells, each is recursively divided into eight subcells whenever more than one particle is found to occupy the same cell.
Proceedings ArticleDOI

Transactional memory: architectural support for lock-free data structures

TL;DR: Simulation results show that transactional memory matches or outperforms the best known locking techniques for simple benchmarks, even in the absence of priority inversion, convoying, and deadlock.
Journal ArticleDOI

A Hierarchical O(N) Force Calculation Algorithm

TL;DR: A novel code for the approximate computation of long-range forces between N mutually interacting bodies based on a hierarchical tree of cubic cells and features mutual cell–cell interactions which are calculated via a Cartesian Taylor expansion in a symmetric way, such that total momentum is conserved.
Related Papers (5)