[서평]「The Unified Modeling Language User Guide」

Introduction to Mathematical Statistics. By R.V. Hogg and A. T. Craig. Pp. ix, 245. 47s. 1959. (The Macmillan Company, New York)

High-performance computing (HPC) systems are growing more powerful by utilizing more hardware components. As the system mean-time-before-failure correspondingly drops, applications must checkpoint more frequently to make progress. However, as the system memory sizes grow faster than the bandwidth to the parallel file system, the cost of checkpointing begins to dominate application run times. Multi-level checkpointing potentially solves this problem through multiple types of checkpoints with different costs and different levels of resiliency in a single run. This solution employs lightweight checkpoints to handle the most common failure modes and relies on more expensive checkpoints for less common, but more severe failures. This theoretically promising approach has not been fully evaluated in a large- scale, production system context. We have designed the Scalable Checkpoint/Restart (SCR) library, a multi-level checkpoint system that writes checkpoints to RAM, Flash, or disk on the compute nodes in addition to the parallel file system. We present the performance and reliability properties of SCR as well as a probabilistic Markov model that predicts its performance on current and future systems. We show that multi-level checkpointing improves efficiency on existing large-scale systems and that this benefit increases as the system size grows. In particular, we developed low-cost checkpoint schemes that are 100x-1000x faster than the parallel file system and effective against 85% of our system failures. This leads to a gain in machine efficiency of up to 35%, and it reduces the the load on the parallel file system by a factor of two on current and future systems.

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

We present here a report produced by a workshop on 'Addressing failures in exascale computing' held in Park City, Utah, 4-11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach.

The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia, and their interests ranged from theory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions.

/pdf/addressing-failures-in-exascale-computing-1b4vxg0uhl.pdf

Addressing failures in exascale computing

Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status.Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from "unhealthy" nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplified by but not limited to Xen. This paper contributes an automatic and transparent mechanism for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring and load-based migration. We exploit Xen's live migration mechanism for a guest operating system (OS) to migrate an MPI task from a health-deteriorating node to a healthy one without stopping the MPI task during most of the migration. Our proactive FT daemon orchestrates the tasks of health monitoring, load determination and initiation of guest OS migration. Experimental results demonstrate that live migration hides migration costs and limits the overhead to only a few seconds making it an attractive approach to realize FT in HPC systems. Overall, our enhancements make proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full checkpoint/restart schemes since checkpoint frequencies can be reduced as fewer unanticipated failures are encountered. In the context of OS virtualization, we believe that this is the first comprehensive study of proactive fault tolerance where live migration is actually triggered by health monitoring.

/pdf/proactive-fault-tolerance-for-hpc-with-xen-virtualization-2f1k6qrtbj.pdf

Proactive fault tolerance for HPC with Xen virtualization

The increase in the physical size of high performance computing (HPC) platform makes system reliability more challenging. In order to minimize the performance loss (rollback and checkpoint overheads) due to unexpected failures or unnecessary overhead of fault tolerant mechanisms, we present a reliability-aware method for an optimal checkpoint/restart strategy. Our scheme aims at addressing fault tolerance challenge, especially in a large-scale HPC system, by providing optimal checkpoint placement techniques that are derived from the actual system reliability. Unlike existing checkpoint models, which can only handle Poisson failure and a constant checkpoint interval, our model can deal with a varying checkpoint interval and with different failure distributions. In addition, the approach considers optimality for both checkpoint overhead and rollback time. Our validation results suggest a significant improvement over existing techniques.

An optimal checkpoint/restart model for a large scale high performance computing system

Today's increased computing speeds allow conventional sequential machines to effectively emulate associative computing techniques We present a parallel programming paradigm called ASC (ASsociative Computing), designed for a wide range of computing engines Our paradigm has an efficient associative-based, dynamic memory-allocation mechanism that does not use pointers It incorporates data parallelism at the base level, so that programmers do not have to specify low-level sequential tasks such as sorting, looping and parallelization Our paradigm supports all of the standard data-parallel and massively parallel computing algorithms It combines numerical computation (such as convolution, matrix multiplication, and graphics) with nonnumerical computing (such as compilation, graph algorithms, rule-based systems, and language interpreters) This article focuses on the nonnumerical aspects of ASC >

/pdf/asc-an-associative-computing-paradigm-2sft81l76y.pdf

ASC: an associative-computing paradigm

For full checkpoint on a large-scale HPC system, huge memory contexts must potentially be transferred through the network and saved in a reliable storage. As such, the time taken to checkpoint becomes a critical issue which directly impacts the total execution time. Therefore, incremental checkpoint as a less intrusive method to reduce the waste time has been gaining significant attentions in the HPC community. In this paper, we built a model that aims to reduce full checkpoint overhead by performing a set of incremental checkpoints between two consecutive full checkpoints. Moreover, a method to find the number of those incremental checkpoints is given. Furthermore, most of the comparison results between the incremental checkpoint model and the full checkpoint model (Liu et al., 2007) on the same failure data set show that the total waste time in the incremental checkpoint model is significantly smaller than the waste time in the full checkpoint model.

Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments

Fault tolerance is a major concern to guarantee availability of critical services as well as application execution. Traditional approaches for fault tolerance include checkpoint/restart or duplication. However it is also possible to anticipate failures and proactively take action before failures occur in order to minimize failure impact on the system and application execution. This document presents a proactive fault tolerance framework. This framework can use different proactive fault tolerance mechanisms, i.e., migration and pause/un-pause. The framework also allows the implementation of new proactive fault tolerance policies thanks to a modular architecture. A first proactive fault tolerance policy has been implemented and preliminary experimentations have been done based on system-level virtualization and compared with results obtained by simulation.

/pdf/a-framework-for-proactive-fault-tolerance-1lnyhk2t30.pdf

A Framework for Proactive Fault Tolerance

Cluster computing has been attracting more and more attention from both the industry and the academia for its enormous computing power, cost effectiveness, and scalability. Availability is a key system attribute that needs to be considered both at system design stage and must reflect the actuality. System monitoring and logging enables identifying unplanned events to reflect the actual system's availability. This paper proposes a single framework that coordinates event monitoring, filtering, data analysis and dynamic availability modeling. The availability model is abstracted and categorized based on functionality. We describe the proposed architecture, and a sample analysis of real time event logs from a 512 node cluster from Lawrence Livermore National Laboratory.

Chokchai Leangsuksun

Papers

An optimal checkpoint/restart model for a large scale high performance computing system

ASC: an associative-computing paradigm

Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments

A Framework for Proactive Fault Tolerance

Availability modeling and analysis on high performance cluster computing systems