Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters
Paul Hargrove,Jason Duell +1 more
- Vol. 46, Iss: 1, pp 494-499
Reads0
Chats0
TLDR
The motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI, are described.Abstract:
This article describes the motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI. Application-level solutions, including both checkpointing and fault-tolerant algorithms, are recognized as more time and space efficient than system-level checkpoints, which cannot make use of any application-specific knowledge. However, system-level checkpointing allows for preemption, making it suitable for responding to ''fault precursors'' (for instance, elevated error rates from ECC memory or network CRCs, or elevated temperature from sensors). Preemption can also increase the efficiency of batch scheduling; for instance reducing idle cycles (by allowing for shutdown without any queue draining period or reallocation of resources to eliminate idle nodes when better fitting jobs are queued), and reducing the average queued time (by limiting large jobs to running during off-peak hours, without the need to limit the length of such jobs). Each of these potential uses makes BLCR a valuable tool for efficient resource management in Linux clusters.read more
Citations
More filters
Proceedings ArticleDOI
PLFS: a checkpoint filesystem for parallel applications
John M. Bent,Garth A. Gibson,Gary Grider,Ben McClelland,Paul Nowoczynski,James Nunez,Milo Polte,Meghan Wingate +7 more
TL;DR: A virtual parallel log structured file system which remaps an application's preferred data layout into one which is optimized for the underlying file system, which can reduce checkpoint time by an order of magnitude.
Proceedings ArticleDOI
DMTCP: Transparent checkpointing for cluster computations and the desktop
TL;DR: DMTCP as mentioned in this paper is a transparent user-level checkpointing package for distributed applications, which is used for the runCMS experiment of the Large Hadron Collider at CERN, and it can be incorporated and distributed as a checkpoint-restart module within some larger package.
Journal ArticleDOI
Toward Exascale Resilience: 2014 Update
TL;DR: This paper surveys what the community has learned in the past five years and summarizes the research problems still considered critical by the HPC community.
Proceedings ArticleDOI
Combining batch execution and leasing using virtual machines
TL;DR: A scheduling approach in which users request resource leases, where leases can request either as-soon-as-possible ("best-effort") or reservation start times, is described, and a VM-based approach can provide better performance than a scheduler that does not support task pre-emption.
Proceedings ArticleDOI
Detection and correction of silent data corruption for large-scale high-performance computing
TL;DR: RedMPI is an MPI library residing in the profiling layer of any standards-compliant MPI implementation capable of both online detection and correction of soft errors that occur in MPI applications without requiring code changes to application source code.
References
More filters
Journal ArticleDOI
The design and implementation of Zap: a system for migrating computing environments
TL;DR: The paper demonstrates that the Linux Zap prototype can provide general-purpose process migration functionality with low overhead and results for migrating pods show that these kinds of pods can be migrated with subsecond checkpoint and restart latencies.
Journal ArticleDOI
The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing
Sriram Sankaran,Jeffrey M. Squyres,Brian Barrett,Vishal Sahay,Andrew Lumsdaine,Jason Duell,Paul Hargrove,Eric Roman +7 more
TL;DR: This work designs and implements a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications that integrates the Berkeley Lab BLCR kernel-level process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface.
ReportDOI
The design and implementation of Berkeley Lab's linuxcheckpoint/restart
TL;DR: BLCR can be used either as a stand alone system for checkpointing applications on a single machine, or as a component by a scheduling system or parallel communication library for checkpointed and restoring parallel jobs running on multiple machines.
CRAK: Linux Checkpoint/Restart As a Kernel Module
TL;DR: CRAK is the first system for Unix/Linux that provides transparent checkpoint/restart with the following properties: (1) it does not require any modifications of existing operating system or application code and (2) it supports migrating network sockets.
Proceedings ArticleDOI
BProc: the Beowulf distributed process space
TL;DR: Job startup with BProc's process migration mechanism is faster than the traditional method of logging into a node and starting the process with rsh, so the vast majority of MPI applications will experience no performance loss as a result of being managed by B proc.