Recovery of Distributed Iterative Solvers for Linear Systems Using Non-Volatile RAM

doi:10.1109/ftxs56515.2022.00007

Proceedings ArticleDOI

Recovery of Distributed Iterative Solvers for Linear Systems Using Non-Volatile RAM

TLDR

In-NVRAM ESR as discussed by the authors uses MPI One-Sided Communication (OSC) over RDMA implementation, which was optimized and applied for using NVRAM to store recovery data for iterative linear solvers.

Abstract:

HPC systems are a critical resource for scientific research and advanced industries. The increased demand for computational power and memory ushers in the exascale era, in which supercomputers are designed to provide enormous computing power to meet these needs. These complex supercomputers consist of numerous compute nodes and are consequently expected to experience frequent faults and crashes.Mathematical solvers, in particular, iterative linear solvers are key building block in numerous large-scale scientific applications. Consequently, supporting the recovery of distributed solvers is necessary for scaling scientific applications to exascale platforms. Previous recovery methods for iterative solvers are based on Checkpoint-Restart (CR), which incurs high fault tolerance overhead, or intrinsic fault tolerance, which require extra computation time to converge after failures.Exact state reconstruction (ESR) was proposed as an alternative mechanism to alleviate the impact of frequent failures on long-term computations. ESR has been shown to provide exact reconstruction of the computation state while avoiding the need for costly checkpointing. However, ESR currently relies on volatile memory for fault tolerance, and must therefore maintain redundancies in the RAM of multiple nodes. This not only incurs high memory overhead but also prevents ESR from being fully resilient, that is, resilient against a full system crash.Recent supercomputer designs feature emerging non-volatile RAM (NVRAM) technology, for example, the exascale Aurora that is planned to consist of Intel Optane™ DCPMM. This paper investigates how NVRAM can be utilized to devise an enhanced ESR-based recovery mechanism that is more efficient and provides full resilience. Our mechanism, called in-NVRAM ESR, provides full resiliency while significantly reducing both the memory footprint and the time overhead in comparison with the original ESR design (in-RAM ESR). In-NVRAM ESR is based on a novel MPI One-Sided Communication (OSC) over RDMA implementation, which was optimized and applied for using NVRAM to store recovery data for iterative linear solvers. The source code used in this work, as well as the benchmarks and other relevant sources, are available at: https://github.com/Scientific-Computing-Lab-NRCN/In-NVRAM-ESR.git.

References

PDF

Open Access

More filters

Journal ArticleDOI

GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems

Youcef Saad, +1 more

- 01 Jul 1986 -

Siam Journal on Scientific and Statistic...

TL;DR: An iterative method for solving linear systems, which has the property of minimizing at every step the norm of the residual vector over a Krylov subspace.

...read moreread less

Proceedings ArticleDOI

Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling

Moinuddin K. Qureshi, +5 more

TL;DR: Start-Gap is proposed, a simple, novel, and effective wear-leveling technique that uses only two registers that boosts the achievable lifetime of the baseline 16 GB PCM-based system from 5% to 97% of the theoretical maximum, while incurring a total storage overhead of less than 13 bytes and obviating the latency overhead of accessing large tables.

...read moreread less

Journal ArticleDOI

Understanding failures in petascale computers

Bianca Schroeder, +1 more

TL;DR: This paper reviews sources of failure information for compute clusters and storage systems, projects failure rates and the corresponding decrease in application effectiveness, and discusses coping strategies such as application-level checkpoint compression and system level process-pairs fault-tolerance for supercomputing.

...read moreread less

Journal ArticleDOI

Toward Exascale Resilience

Franck Cappello, +5 more

TL;DR: This white paper synthesizes the motivations, observations and research issues considered as determinant of several complimentary experts of HPC in applications, programming models, distributed systems and system management.

...read moreread less