CheCUDA: A Checkpoint/Restart Tool for CUDA Applications

doi:10.1109/PDCAT.2009.78

Home
/
Papers
/
CheCUDA: A Checkpoint/Restart Tool for CUDA Applications

Proceedings Article•DOI•

CheCUDA: A Checkpoint/Restart Tool for CUDA Applications

Hiroyuki Takizawa¹, Katsuto Sato¹, Kazuhiko Komatsu¹, Hiroaki Kobayashi¹•Institutions (1)

Tohoku University¹

08 Dec 2009-pp 408-413

TL;DR: It is demonstrated that a prototype implementation of CheCUDA can correctly checkpoint and restart a CUDA application written with basic APIs and also indicates that Che CUDA can migrate a process from one PC to another even if the process uses a GPU.

read less

Abstract: In this paper, a tool named CheCUDA is designed to checkpoint CUDA applications that use GPUs as accelerators. As existing checkpoint/restart implementations do not support checkpointing the GPU status, CheCUDA hooks a part of basic CUDA driver API calls in order to record the status changes on the main memory. At checkpointing, CheCUDA stores the status changes in a file after copying all necessary data in the video memory to the main memory and then disabling the CUDA runtime. At restarting, CheCUDA reads the file, re-initializes the CUDA runtime, and recovers the resources on GPUs so as to restart from the stored status. This paper demonstrates that a prototype implementation of CheCUDA can correctly checkpoint and restart a CUDA application written with basic APIs. This also indicates that CheCUDA can migrate a process from one PC to another even if the process uses a GPU. Accordingly, CheCUDA is useful not only to enhance the dependability of CUDA applications but also to enable dynamic task scheduling of CUDA applications required especially on heterogeneous GPU cluster systems. This paper also shows the timing overhead for checkpointing.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU

[...]

Keun Soo Yim¹, Cuong Pham¹, Mushfiq U. Saleheen¹, Zbigniew Kalbarczyk¹, Ravishankar K. Iyer¹ - Show less +1 more•Institutions (1)

University of Illinois at Urbana–Champaign¹

16 May 2011

TL;DR: On average, 16-33% of in-jected faults cause silent data corruption (SDC) errors in the HPC programs executing on GPU, which is higher than that measured in CPU programs.

...read moreread less

Abstract: High performance and relatively low cost of GPU-based platforms provide an attractive alternative for general purpose high performance computing (HPC). However, the emerging HPC applications have usually stricter output cor-rectness requirements than typical GPU applications (i.e., 3D graphics). This paper first analyzes the error resiliency of GPGPU platforms using a fault injection tool we have devel-oped for commodity GPU devices. On average, 16-33% of in-jected faults cause silent data corruption (SDC) errors in the HPC programs executing on GPU. This SDC ratio is signifi-cantly higher than that measured in CPU programs (

...read moreread less

96 citations

Proceedings Article•DOI•

CheCL: Transparent Checkpointing and Process Migration of OpenCL Applications

[...]

Hiroyuki Takizawa¹, Kentaro Koyama¹, Katsuto Sato¹, Kazuhiko Komatsu¹, Hiroaki Kobayashi¹ - Show less +1 more•Institutions (1)

Tohoku University¹

16 May 2011

TL;DR: A new transparent checkpoint/restart (CPR) tool, named CheCL, for high-performance and dependable GPU computing, that can perform CPR on an OpenCL application program without any modification and recompilation of its code.

...read moreread less

Abstract: In this paper, we propose a new transparent checkpoint/restart (CPR) tool, named CheCL, for high-performance and dependable GPU computing. CheCL can perform CPR on an OpenCL application program without any modification and recompilation of its code. A conventional check pointing system fails to checkpoint a process if the process uses OpenCL. Therefore, in CheCL, every API call is forwarded to another process called an API proxy, and the API proxy invokes the API function, two processes, an application process and an API proxy, are launched for an OpenCL application. In this case, as the application process is not an OpenCL process but a standard process, it can be safely check pointed. While CheCL intercepts all API calls, it records the information necessary for restoring OpenCL objects. The application process does not hold any OpenCL handles, but CheCL handles to keep such information. Those handles are automatically converted to OpenCL handles and then passed to API functions. Upon restart, OpenCL objects are automatically restored based on the recorded information. This paper demonstrates the feasibility of transparent check pointing of OpenCL programs including MPI applications, and quantitatively evaluates the runtime overheads. It is also discussed that CheCL can enable process migration of OpenCL applications among distinct nodes, and among different kinds of compute devices such as a CPU and a GPU.

...read moreread less

61 citations

Cites background or methods from "CheCUDA: A Checkpoint/Restart Tool ..."

...In our previous work [6], CheCUDA has been developed as a CPR tool for CUDA, which is the de facto standard programming framework for current GPU computing [7]....
[...]
...Thus, CheCL remembers all the OpenCL objects that existed before checkpointing, and restores them after restarting....
[...]
...Finally, Section VI gives concluding remarks and our future work....
[...]

Journal Article•DOI•

Swan: A tool for porting CUDA programs to OpenCL

[...]

Matthew J. Harvey¹, G. De Fabritiis²•Institutions (2)

Imperial College London¹, Barcelona Biomedical Research Park²

01 Apr 2011-Computer Physics Communications

TL;DR: It is concluded that OpenCL is a viable platform for developing portable GPU applications but that the more mature CUDA tools continue to provide best performance.

...read moreread less

58 citations

Proceedings Article•DOI•

NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA

[...]

Akira Nukada¹, Hiroyuki Takizawa², Satoshi Matsuoka¹•Institutions (2)

Tokyo Institute of Technology¹, Tohoku University²

16 May 2011

TL;DR: This paper presents a checkpoint-restart library for CUDA that first deletes all CUDA resources before check pointing and then restores them right after check pointing, and proposes a novel technique that replays memory related API calls.

...read moreread less

Abstract: Today, CUDA is the de facto standard programming framework to exploit the computational power of graphics processing units (GPUs) to accelerate various kinds of applications. For efficient use of a large GPU-accelerated system, one important mechanism is checkpoint-restart that can be used not only to improve fault tolerance but also to optimize node/slot allocation by suspending a job on one node and migrating the job to another node. Although several checkpoint-restart implementations have been developed so far, they do not support CUDA applications or have some severe limitations for CUDA support. Hence, we present a checkpoint-restart library for CUDA that first deletes all CUDA resources before check pointing and then restores them right after check pointing. It is necessary to restore each memory chunk at the same memory address. To this end, we propose a novel technique that replays memory related API calls. The library supports both CUDA runtime API and CUDA driver API. Moreover, the library is transparent to applications, it is not necessary to recompile the applications for check pointing. This paper demonstrates that the proposed library can achieve checkpoint-restart of various applications at acceptable overheads, and the library also works for MPI applications such as HPL.

...read moreread less

51 citations

Cites methods from "CheCUDA: A Checkpoint/Restart Tool ..."

...proposed CheCUDA [7], which uses BLCR to enable CPR for CUDA applications....
[...]

Proceedings Article•DOI•

Optimizing software-directed instruction replication for GPU error detection

[...]

Abdulrahman Mahmoud¹, Siva Kumar Sastry Hari², Michael J. Sullivan², Timothy Tsai², Stephen W. Keckler² - Show less +1 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Nvidia²

11 Nov 2018

TL;DR: This paper describes a practical methodology to employ instruction duplication for GPUs and identifies implementation challenges that can incur high overheads (69% on average), and explores GPU-specific software optimizations that trade fine-grained recoverability for performance.

...read moreread less

Abstract: Application execution on safety-critical and high-performance computer systems must be resilient to transient errors. As GPUs become more pervasive in such systems, they must supplement ECC/parity for major storage structures with reliability techniques that cover more of the GPU hardware logic. Instruction duplication has been explored for CPU resilience; however, it has never been studied in the context of GPUs, and it is unclear whether the performance and design choices it presents make it a feasible GPU solution. This paper describes a practical methodology to employ instruction duplication for GPUs and identifies implementation challenges that can incur high overheads (69% on average). It explores GPU-specific software optimizations that trade fine-grained recoverability for performance. It also proposes simple ISA extensions with limited hardware changes and area costs to further improve performance, cutting the runtime overheads by more than half to an average of 30%.

...read moreread less

50 citations

Cites background from "CheCUDA: A Checkpoint/Restart Tool ..."

...The increase in detection latency is however not a concern for coarse grain coordinated checkpointing solutions [15], [16], [17], [18]....
[...]
...This approach trades off the ability to detect the error until the end of the kernel and diagnose which thread is corrupted, which is not a concern for existing coarse grained checkpointing solutions [15], [16], [17], [18]....
[...]
...While this optimization may violate the error containment assumptions of some recovery schemes, it works fine for coarse-grain coordinated checkpoint systems that discard memory values in the event of a detected error to roll back to a previous checkpoint [15], [16], [17], [18]....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Xen and the art of virtualization

[...]

Paul Barham¹, Boris Dragovic², Keir Fraser², Steven Hand², Tim Harris², Alex Ho², Rolf Neugebauer³, Ian Pratt², Andrew Warfield² - Show less +5 more•Institutions (3)

Microsoft¹, University of Cambridge², Intel³

19 Oct 2003

TL;DR: Xen, an x86 virtual machine monitor which allows multiple commodity operating systems to share conventional hardware in a safe and resource managed fashion, but without sacrificing either performance or functionality, considerably outperform competing commercial and freely available solutions.

...read moreread less

Abstract: Numerous systems have been designed which use virtualization to subdivide the ample resources of a modern computer. Some require specialized hardware, or cannot support commodity operating systems. Some target 100% binary compatibility at the expense of performance. Others sacrifice security or functionality for speed. Few offer resource isolation or performance guarantees; most provide only best-effort provisioning, risking denial of service.This paper presents Xen, an x86 virtual machine monitor which allows multiple commodity operating systems to share conventional hardware in a safe and resource managed fashion, but without sacrificing either performance or functionality. This is achieved by providing an idealized virtual machine abstraction to which operating systems such as Linux, BSD and Windows XP, can be ported with minimal effort.Our design is targeted at hosting up to 100 virtual machine instances simultaneously on a modern server. The virtualization approach taken by Xen is extremely efficient: we allow operating systems such as Linux and Windows XP to be hosted simultaneously for a negligible performance overhead --- at most a few percent compared with the unvirtualized case. We considerably outperform competing commercial and freely available solutions in a range of microbenchmarks and system-wide tests.

...read moreread less

6,326 citations

GPU Computing

[...]

John D. Owens¹, Mike Houston, David Luebke, Simon Green, John E. Stone, James C. Phillips - Show less +2 more•Institutions (1)

University of California, Davis¹

01 May 2008

TL;DR: The background, hardware, and programming model for GPU computing is described, the state of the art in tools and techniques are summarized, and four GPU computing successes in game physics and computational biophysics that deliver order-of-magnitude performance gains over optimized CPU applications are presented.

...read moreread less

Abstract: The graphics processing unit (GPU) has become an integral part of today's mainstream computing systems. Over the past six years, there has been a marked increase in the performance and capabilities of GPUs. The modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor featuring peak arithmetic and memory bandwidth that substantially outpaces its CPU counterpart. The GPU's rapid increase in both programmability and capability has spawned a research community that has successfully mapped a broad range of computationally demanding, complex problems to the GPU. This effort in general-purpose computing on the GPU, also known as GPU computing, has positioned the GPU as a compelling alternative to traditional microprocessors in high-performance computer systems of the future. We describe the background, hardware, and programming model for GPU computing, summarize the state of the art in tools and techniques, and present four GPU computing successes in game physics and computational biophysics that deliver order-of-magnitude performance gains over optimized CPU applications.

...read moreread less

1,570 citations

"CheCUDA: A Checkpoint/Restart Tool ..." refers background in this paper

...So far, many researchers have reported that various scientific and engineering applications can significantly be accelerated using GPUs[1]....
[...]

Proceedings Article•

Libckpt: transparent checkpointing under Unix

[...]

James S. Plank¹, Micah Beck¹, Gerry Kingsley¹, Kai Li²•Institutions (2)

University of Tennessee¹, Princeton University²

16 Jan 1995

TL;DR: In this paper, the authors describe a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature and also supports the incorporation of user directives into the creation of checkpoints.

...read moreread less

Abstract: Checkpointing is a simple technique for rollback recovery: the state of an executing program is periodically saved to a disk file from which it can be recovered after a failure. While recent research has developed a collection of powerful techniques for minimizing the overhead of writing checkpoint files, checkpointing remains unavailable to most application developers. In this paper we describe libckpt, a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature. While libckpt can be used in a mode which is almost totally transparent to the programmer, it also supports the incorporation of user directives into the creation of checkpoints. This user-directed checkpointing is an innovation which is unique to our work.

...read moreread less

670 citations

Additional excerpts

...One of such unsupported functions is checkpoint/restart(CPR)....
[...]

Journal Article•DOI•

Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters

[...]

Paul Hargrove¹, Jason Duell•Institutions (1)

Lawrence Berkeley National Laboratory¹

01 Sep 2006

TL;DR: The motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI, are described.

...read moreread less

Abstract: This article describes the motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI. Application-level solutions, including both checkpointing and fault-tolerant algorithms, are recognized as more time and space efficient than system-level checkpoints, which cannot make use of any application-specific knowledge. However, system-level checkpointing allows for preemption, making it suitable for responding to ''fault precursors'' (for instance, elevated error rates from ECC memory or network CRCs, or elevated temperature from sensors). Preemption can also increase the efficiency of batch scheduling; for instance reducing idle cycles (by allowing for shutdown without any queue draining period or reallocation of resources to eliminate idle nodes when better fitting jobs are queued), and reducing the average queued time (by limiting large jobs to running during off-peak hours, without the need to limit the length of such jobs). Each of these potential uses makes BLCR a valuable tool for efficient resource management in Linux clusters.

...read moreread less

439 citations

Additional excerpts

...One of such unsupported functions is checkpoint/restart(CPR)....
[...]

Proceedings Article•DOI•

DMTCP: Transparent checkpointing for cluster computations and the desktop

[...]

Jason Ansel¹, Kapil Aryay², Gene Coopermany²•Institutions (2)

Massachusetts Institute of Technology¹, Northeastern University²

23 May 2009

TL;DR: DMTCP as mentioned in this paper is a transparent user-level checkpointing package for distributed applications, which is used for the runCMS experiment of the Large Hadron Collider at CERN, and it can be incorporated and distributed as a checkpoint-restart module within some larger package.

...read moreread less

Abstract: DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs as a 680 MB image in memory that includes 540 dynamic libraries, and is used for the CMS experiment of the Large Hadron Collider at CERN. DMTCP transparently checkpoints general cluster computations consisting of many nodes, processes, and threads; as well as typical desktop applications. On 128 distributed cores (32 nodes), checkpoint and restart times are typically 2 seconds, with negligible run-time overhead. Typical checkpoint times are reduced to 0.2 seconds when using forked checkpointing. Experimental results show that checkpoint time remains nearly constant as the number of nodes increases on a medium-size cluster. DMTCP automatically accounts for fork, exec, ssh, mutexes/ semaphores, TCP/IP sockets, UNIX domain sockets, pipes, ptys (pseudo-terminals), terminal modes, ownership of controlling terminals, signal handlers, open file descriptors, shared open file descriptors, I/O (including the readline library), shared memory (via mmap), parent-child process relationships, pid virtualization, and other operating system artifacts. By emphasizing an unprivileged, user-space approach, compatibility is maintained across Linux kernels from 2.6.9 through the current 2.6.28. Since DMTCP is unprivileged and does not require special kernel modules or kernel patches, DMTCP can be incorporated and distributed as a checkpoint-restart module within some larger package.

...read moreread less

282 citations