Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters

doi:10.1088/1742-6596/46/1/067

Open AccessJournal ArticleDOI

Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters

Paul Hargrove, +1 more

- Vol. 46, Iss: 1, pp 494-499

Chats0

TLDR

The motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI, are described.

Abstract:

This article describes the motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI. Application-level solutions, including both checkpointing and fault-tolerant algorithms, are recognized as more time and space efficient than system-level checkpoints, which cannot make use of any application-specific knowledge. However, system-level checkpointing allows for preemption, making it suitable for responding to ''fault precursors'' (for instance, elevated error rates from ECC memory or network CRCs, or elevated temperature from sensors). Preemption can also increase the efficiency of batch scheduling; for instance reducing idle cycles (by allowing for shutdown without any queue draining period or reallocation of resources to eliminate idle nodes when better fitting jobs are queued), and reducing the average queued time (by limiting large jobs to running during off-peak hours, without the need to limit the length of such jobs). Each of these potential uses makes BLCR a valuable tool for efficient resource management in Linux clusters.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

PLFS: a checkpoint filesystem for parallel applications

John M. Bent, +7 more

TL;DR: A virtual parallel log structured file system which remaps an application's preferred data layout into one which is optimized for the underlying file system, which can reduce checkpoint time by an order of magnitude.

...read moreread less

Proceedings ArticleDOI

DMTCP: Transparent checkpointing for cluster computations and the desktop

Jason Ansel, +2 more

TL;DR: DMTCP as mentioned in this paper is a transparent user-level checkpointing package for distributed applications, which is used for the runCMS experiment of the Large Hadron Collider at CERN, and it can be incorporated and distributed as a checkpoint-restart module within some larger package.

...read moreread less

Journal ArticleDOI

Toward Exascale Resilience: 2014 Update

Franck Cappello, +5 more

TL;DR: This paper surveys what the community has learned in the past five years and summarizes the research problems still considered critical by the HPC community.

...read moreread less

Proceedings ArticleDOI

Combining batch execution and leasing using virtual machines

Borja Sotomayor, +2 more

TL;DR: A scheduling approach in which users request resource leases, where leases can request either as-soon-as-possible ("best-effort") or reservation start times, is described, and a VM-based approach can provide better performance than a scheduler that does not support task pre-emption.

...read moreread less

Proceedings ArticleDOI

Detection and correction of silent data corruption for large-scale high-performance computing

David Fiala, +5 more

TL;DR: RedMPI is an MPI library residing in the profiling layer of any standards-compliant MPI implementation capable of both online detection and correction of soft errors that occur in MPI applications without requiring code changes to application source code.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

The design and implementation of Zap: a system for migrating computing environments

Steven Osman, +3 more

TL;DR: The paper demonstrates that the Linux Zap prototype can provide general-purpose process migration functionality with low overhead and results for migrating pods show that these kinds of pods can be migrated with subsecond checkpoint and restart latencies.

...read moreread less

Journal ArticleDOI

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

Sriram Sankaran, +7 more

TL;DR: This work designs and implements a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications that integrates the Berkeley Lab BLCR kernel-level process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface.

...read moreread less

ReportDOI

The design and implementation of Berkeley Lab's linuxcheckpoint/restart

Jason Duell

- 30 Apr 2005 -

Lawrence Berkeley National Laboratory

TL;DR: BLCR can be used either as a stand alone system for checkpointing applications on a single machine, or as a component by a scheduling system or parallel communication library for checkpointed and restoring parallel jobs running on multiple machines.

...read moreread less

CRAK: Linux Checkpoint/Restart As a Kernel Module

Hua Zhong

TL;DR: CRAK is the first system for Unix/Linux that provides transparent checkpoint/restart with the following properties: (1) it does not require any modifications of existing operating system or application code and (2) it supports migrating network sockets.

...read moreread less

Proceedings ArticleDOI

BProc: the Beowulf distributed process space

Erik Hendriks

TL;DR: Job startup with BProc's process migration mechanism is faster than the traditional method of logging into a node and starting the process with rsh, so the vast majority of MPI applications will experience no performance loss as a result of being managed by B proc.

...read moreread less