Scalable Energy Efficiency with Resilience for High Performance Computing Systems: A Quantitative Methodology

doi:10.1145/2822893

Open AccessJournal ArticleDOI

Scalable Energy Efficiency with Resilience for High Performance Computing Systems: A Quantitative Methodology

Li Tan, +2 more

- 16 Nov 2015 -

ACM Transactions on Architecture and Cod...

- Vol. 12, Iss: 4, pp 35

Chats0

TLDR

By extending the Amdahl’s Law and the Karp-Flatt Metric, taking resilience into consideration, this article quantitatively model the integrated energy efficiency in terms of performance per Watt and showcases the trade-offs among typical HPC parameters.

Abstract:

Ever-growing performance of supercomputers nowadays brings demanding requirements of energy efficiency and resilience, due to rapidly expanding size and duration in use of the large-scale computing systems. Many application/architecture-dependent parameters that determine energy efficiency and resilience individually have causal effects with each other, which directly affect the trade-offs among performance, energy efficiency and resilience at scale. To enable high-efficiency management for large-scale High-Performance Computing (HPC) systems nowadays, quantitatively understanding the entangled effects among performance, energy efficiency, and resilience is thus required. While previous work focuses on exploring energy-saving and resilience-enhancing opportunities separately, little has been done to theoretically and empirically investigate the interplay between energy efficiency and resilience at scale. In this article, by extending the Amdahl’s Law and the Karp-Flatt Metric, taking resilience into consideration, we quantitatively model the integrated energy efficiency in terms of performance per Watt and showcase the trade-offs among typical HPC parameters, such as number of cores, frequency/voltage, and failure rates. Experimental results for a wide spectrum of HPC benchmarks on two HPC systems show that the proposed models are accurate in extrapolating resilience-aware performance and energy efficiency, and capable of capturing the interplay among various energy-saving and resilience factors. Moreover, the models can help find the optimal HPC configuration for the highest integrated energy efficiency, in the presence of failures and applied resilience techniques.

Scalable Energy Efficiency with Resilience for High Performance Computing Systems: A Quantitative Methodology

Citations

Improving communication performance in dense linear algebra via topology aware collectives

NAS Parallel Benchmarks によるHPFの評価

A secure and efficient file protecting system based on SHA3 and parallel AES

Optimizing energy consumption for a performance-aware cloud data center in the public sector

Energy balance between voltage-frequency scaling and resilience for linear algebra routines on low-power multicore architectures

References

Validity of the single processor approach to achieving large scale computing capabilities

Dark Silicon and the End of Multicore Scaling

LogP: towards a realistic model of parallel computation

Dark silicon and the end of multicore scaling

Scheduling for reduced CPU energy

Related Papers (5)

Investigating the Interplay between Energy Efficiency and Resilience in High Performance Computing

Energy Profiling and Analysis of the HPC Challenge Benchmarks

Using Performance-Power Modeling to Improve Energy Efficiency of HPC Applications

Towards energy efficient scaling of scientific codes

Predicting the Energy and Power Consumption of Strong and Weak Scaling HPC Applications