scispace - formally typeset
Open AccessJournal ArticleDOI

Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters

Paul Hargrove, +1 more
- Vol. 46, Iss: 1, pp 494-499
Reads0
Chats0
TLDR
The motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI, are described.
Abstract
This article describes the motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI. Application-level solutions, including both checkpointing and fault-tolerant algorithms, are recognized as more time and space efficient than system-level checkpoints, which cannot make use of any application-specific knowledge. However, system-level checkpointing allows for preemption, making it suitable for responding to ''fault precursors'' (for instance, elevated error rates from ECC memory or network CRCs, or elevated temperature from sensors). Preemption can also increase the efficiency of batch scheduling; for instance reducing idle cycles (by allowing for shutdown without any queue draining period or reallocation of resources to eliminate idle nodes when better fitting jobs are queued), and reducing the average queued time (by limiting large jobs to running during off-peak hours, without the need to limit the length of such jobs). Each of these potential uses makes BLCR a valuable tool for efficient resource management in Linux clusters.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

PLFS: a checkpoint filesystem for parallel applications

TL;DR: A virtual parallel log structured file system which remaps an application's preferred data layout into one which is optimized for the underlying file system, which can reduce checkpoint time by an order of magnitude.
Proceedings ArticleDOI

DMTCP: Transparent checkpointing for cluster computations and the desktop

TL;DR: DMTCP as mentioned in this paper is a transparent user-level checkpointing package for distributed applications, which is used for the runCMS experiment of the Large Hadron Collider at CERN, and it can be incorporated and distributed as a checkpoint-restart module within some larger package.
Journal ArticleDOI

Toward Exascale Resilience: 2014 Update

TL;DR: This paper surveys what the community has learned in the past five years and summarizes the research problems still considered critical by the HPC community.
Proceedings ArticleDOI

Combining batch execution and leasing using virtual machines

TL;DR: A scheduling approach in which users request resource leases, where leases can request either as-soon-as-possible ("best-effort") or reservation start times, is described, and a VM-based approach can provide better performance than a scheduler that does not support task pre-emption.
Proceedings ArticleDOI

Detection and correction of silent data corruption for large-scale high-performance computing

TL;DR: RedMPI is an MPI library residing in the profiling layer of any standards-compliant MPI implementation capable of both online detection and correction of soft errors that occur in MPI applications without requiring code changes to application source code.
References
More filters
Journal ArticleDOI

The design and implementation of Zap: a system for migrating computing environments

TL;DR: The paper demonstrates that the Linux Zap prototype can provide general-purpose process migration functionality with low overhead and results for migrating pods show that these kinds of pods can be migrated with subsecond checkpoint and restart latencies.
Journal ArticleDOI

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

TL;DR: This work designs and implements a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications that integrates the Berkeley Lab BLCR kernel-level process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface.
ReportDOI

The design and implementation of Berkeley Lab's linuxcheckpoint/restart

TL;DR: BLCR can be used either as a stand alone system for checkpointing applications on a single machine, or as a component by a scheduling system or parallel communication library for checkpointed and restoring parallel jobs running on multiple machines.

CRAK: Linux Checkpoint/Restart As a Kernel Module

Hua Zhong
TL;DR: CRAK is the first system for Unix/Linux that provides transparent checkpoint/restart with the following properties: (1) it does not require any modifications of existing operating system or application code and (2) it supports migrating network sockets.
Proceedings ArticleDOI

BProc: the Beowulf distributed process space

TL;DR: Job startup with BProc's process migration mechanism is faster than the traditional method of logging into a node and starting the process with rsh, so the vast majority of MPI applications will experience no performance loss as a result of being managed by B proc.
Related Papers (5)