scispace - formally typeset
Open AccessJournal ArticleDOI

Scalable Energy Efficiency with Resilience for High Performance Computing Systems: A Quantitative Methodology

Reads0
Chats0
TLDR
By extending the Amdahl’s Law and the Karp-Flatt Metric, taking resilience into consideration, this article quantitatively model the integrated energy efficiency in terms of performance per Watt and showcases the trade-offs among typical HPC parameters.
Abstract
Ever-growing performance of supercomputers nowadays brings demanding requirements of energy efficiency and resilience, due to rapidly expanding size and duration in use of the large-scale computing systems. Many application/architecture-dependent parameters that determine energy efficiency and resilience individually have causal effects with each other, which directly affect the trade-offs among performance, energy efficiency and resilience at scale. To enable high-efficiency management for large-scale High-Performance Computing (HPC) systems nowadays, quantitatively understanding the entangled effects among performance, energy efficiency, and resilience is thus required. While previous work focuses on exploring energy-saving and resilience-enhancing opportunities separately, little has been done to theoretically and empirically investigate the interplay between energy efficiency and resilience at scale. In this article, by extending the Amdahl’s Law and the Karp-Flatt Metric, taking resilience into consideration, we quantitatively model the integrated energy efficiency in terms of performance per Watt and showcase the trade-offs among typical HPC parameters, such as number of cores, frequency/voltage, and failure rates. Experimental results for a wide spectrum of HPC benchmarks on two HPC systems show that the proposed models are accurate in extrapolating resilience-aware performance and energy efficiency, and capable of capturing the interplay among various energy-saving and resilience factors. Moreover, the models can help find the optimal HPC configuration for the highest integrated energy efficiency, in the presence of failures and applied resilience techniques.

read more

Citations
More filters
Proceedings Article

Improving communication performance in dense linear algebra via topology aware collectives

Solomonik, +2 more
TL;DR: This work maps novel 2.5D dense linear algebra algorithms to exploit rectangular collectives on cuboid partitions allocated by a Blue Gene/P supercomputer, and derives LogP- based novel performance models for rectangular broadcasts and reductions.
Journal ArticleDOI

A secure and efficient file protecting system based on SHA3 and parallel AES

TL;DR: SEFPS based on advanced SHA3 and parallel AES can provide the protection of both confidentiality and integrity, and produce high performance by GPU parallelism or/and CPU parallelism, and can be used in computers no matter whether equipped with Nvidia GPUs or not.
Journal ArticleDOI

Optimizing energy consumption for a performance-aware cloud data center in the public sector

TL;DR: This study presents a method to minimize energy consumption while processing the same workload, i.e., ultimately reducing the energy consumed by operating servers.
Journal ArticleDOI

Energy balance between voltage-frequency scaling and resilience for linear algebra routines on low-power multicore architectures

TL;DR: The energy efficiency of dense linear algebra routines using several low-power multicore processors is evaluated and whether the potential energy reduction achieved when scaling the processor to operate at a low voltage compensates the cost of integrating a fault tolerance mechanism that tackles SDC is analyzed.
References
More filters
Proceedings ArticleDOI

Validity of the single processor approach to achieving large scale computing capabilities

TL;DR: In this paper, the authors argue that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution.
Journal ArticleDOI

Dark Silicon and the End of Multicore Scaling

TL;DR: A comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.
Proceedings ArticleDOI

LogP: towards a realistic model of parallel computation

TL;DR: A new parallel machine model, called LogP, is offered that reflects the critical technology trends underlying parallel computers and is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers.
Proceedings ArticleDOI

Dark silicon and the end of multicore scaling

TL;DR: The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community.
Proceedings ArticleDOI

Scheduling for reduced CPU energy

TL;DR: A new metric for cpu energy performance, millions-of-instructions-per-joule (MIPJ), and several methods for varying the clock speed dynamically under control of the operating system, and examine the performance of these methods against workstation traces.
Related Papers (5)