Proceedings ArticleDOI
Using early phase termination to eliminate load imbalances at barrier synchronization points
Martin Rinard
- Vol. 42, Iss: 10, pp 369-386
Reads0
Chats0
TLDR
A general computational pattern that works well with early phase termination is identified and it is explained why computations that exhibit this pattern can tolerate the early termination of parallel tasks without producing unacceptable results.Abstract:
We present a new technique, early phase termination, for eliminating idle processors in parallel computations that use barrier synchronization. This technique simply terminates each parallel phaseas soon as there are too few remaining tasks to keep all of the processors busy. Although this technique completely eliminates the idling that would other wise occur at barrier synchronization points, it may also change the computation and therefore the result that the computation produces. We address this issue by providing probabilistic distortion models that characterize how the use of early phase termination distorts the result that the computation produces. Our experimental results show that for our set of benchmark applications, 1) early phase termination can improve the performance of the parallel computation, 2) the distortion is small (or can be made to be small with the use of an appropriate compensation technique) and 3) the distortion models provide accurate and tight distortion bounds. These bounds can enable users to evaluate the effect of early phase termination and confidently accept results from parallel computations that use this technique if they find the distortion bounds to be acceptable. Finally, we identify a general computational pattern that works well with early phase termination and explain why computations that exhibit this pattern can tolerate the early termination of parallel tasks without producing unacceptable results.read more
Citations
More filters
Proceedings ArticleDOI
Managing performance vs. accuracy trade-offs with loop perforation
TL;DR: The results indicate that, for a range of applications, this approach typically delivers performance increases of over a factor of two (and up to a factors of seven) while changing the result that the application produces by less than 10%.
Proceedings ArticleDOI
Green: a framework for supporting energy-conscious programming using controlled approximation
Woongki Baek,Trishul Chilimbi +1 more
TL;DR: Green enables programmers to approximate expensive functions and loops and operates in two phases and can produce significant improvements in performance and energy consumption with small and controlled QoS degradation.
Proceedings ArticleDOI
Dynamic knobs for responsive power-aware computing
Henry Hoffmann,Stelios Sidiroglou,Michael Carbin,Sasa Misailovic,Anant Agarwal,Martin Rinard +5 more
TL;DR: The experimental results show that PowerDial can enable benchmark applications to execute responsively in the face of power caps that would otherwise significantly impair responsiveness, and can significantly reduce the number of machines required to service intermittent load spikes, enabling reductions in power and capital costs.
Proceedings ArticleDOI
SAGE: self-tuning approximation for graphics engines
TL;DR: Across a set of machine learning and image processing kernels, SAGE's approximation yields an average of 2.5× speedup with less than 10% quality loss compared to the accurate execution on a NVIDIA GTX 560 GPU.
Proceedings ArticleDOI
Quality of service profiling
TL;DR: The experimental results from applying the implemented quality of service profiler to a challenging set of benchmark applications show that it can enable developers to identify promising optimization opportunities and deliver successful optimizations that substantially increase the performance with only smallquality of service losses.
References
More filters
Journal ArticleDOI
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI
MapReduce: simplified data processing on large clusters
Jeffrey Dean,Sanjay Ghemawat +1 more
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Journal ArticleDOI
A hierarchical O(N log N) force-calculation algorithm
Josh Barnes,Piet Hut +1 more
TL;DR: A novel method of directly calculating the force on N bodies that grows only as N log N is described, using a tree-structured hierarchical subdivision of space into cubic cells, each is recursively divided into eight subcells whenever more than one particle is found to occupy the same cell.
Proceedings ArticleDOI
Transactional memory: architectural support for lock-free data structures
Maurice Herlihy,J. Eliot B. Moss +1 more
TL;DR: Simulation results show that transactional memory matches or outperforms the best known locking techniques for simple benchmarks, even in the absence of priority inversion, convoying, and deadlock.
Journal ArticleDOI
A Hierarchical O(N) Force Calculation Algorithm
TL;DR: A novel code for the approximate computation of long-range forces between N mutually interacting bodies based on a hierarchical tree of cubic cells and features mutual cell–cell interactions which are calculated via a Cartesian Taylor expansion in a symmetric way, such that total momentum is conserved.
Related Papers (5)
Green: a framework for supporting energy-conscious programming using controlled approximation
Woongki Baek,Trishul Chilimbi +1 more