Checkpoint-restart for a network of virtual machines
read more
Citations
Transparent checkpoint-restart over infiniband
Design and Implementation for Checkpointing of Distributed Resources Using Process-Level Virtualization
Checkpointing as a service in heterogeneous cloud environments
Checkpointing as a Service in Heterogeneous Cloud Environments
HotRestore: a fast restore system for virtual machine cluster
References
Xen and the art of virtualization
IPython: A System for Interactive Scientific Computing
Remus: high availability via asynchronous virtual machine replication
Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters
BTRFS: The Linux B-Tree Filesystem
Related Papers (5)
Frequently Asked Questions (15)
Q2. What is the advantage of using the Btrfs filesystem?
The integration of the Btrfs copy-on-write filesystem with nested copies of KVM/QEMU was used for fast, incremental snapshots of a network of virtual machines.
Q3. How long does it take to restart a virtual machine?
Note that on restart from a checkpoint image, the shadow page tables inside the kernel must be recreated, after which the pages will be faulted back into RAM.
Q4. What is the primary mechanism to extend checkpoint-restart?
DMTCP plugins offer two primary mechanisms to extend checkpoint-restart: a run-time mechanism (wrapper functions around library calls made by the application); and customization of checkpoint/restart to save and restore the state of external objects.
Q5. What is the common method of checkpointing distributed computations?
Checkpointing of distributed computations is primarily handled by one of two mechanisms today: checkpoint-restart services for MPI; and transparent checkpoint of arbitrary distributed computations.
Q6. How does the DMTCP plugin perform checkpoints?
At thetime of checkpoint, “drains the network”: (a) by stopping user threads of all processes in the computation; (b) receiving from each socket until all network data “in flight” has been collected; and (c) by then writing a checkpoint image.
Q7. What is the common use of BLCR?
In addition to BLCR, two other commonly used packages for single-host checkpointing are CryoPid2 [23] and OpenVZ [24] (based on CRIU [25]).
Q8. What is the effect of the VM size on checkpoint-restart?
For larger sizes (guest VMs with 512 MB to 1024 MB), the checkpoint times grow proportionally to the size of the allocated memory for the larger sizes.
Q9. What is the advantage of using Btrfs?
Like BlobSeer, Btrfs exposes the raw checkpoint image to the host, making it compatible with the use of DMTCP from outside both the VM and the VM kernel driver.
Q10. What is the widely used example of a transparent user-space checkpoint-restart?
DMTCP [6] was the first transparent user-space checkpoint-restart for distributed computations, and remains the most widely used example of this.
Q11. What is the corresponding parameter of the pre-checkpoint QEMU virtual machine?
the DMTCP plugin makes calls to the KVM kernel module to reset the KVM parameters so as to correspond to those of the pre-checkpoint QEMU virtual machine.
Q12. What is the alternative to draining the network?
Two alternative approaches to draining the network are: (a) to send a broadcast packet that plays the role of the DMTCP cookie; and (b) to wait for a specified time sufficient for all network packets to arrive.
Q13. What is the time between a checkpoint and a restart?
Tables I, II, and III show that restart time increases slowly with the number of VMs, while checkpoint time is close to constant.
Q14. What is the architecture of a guest virtual machine?
DMTCP then writes to a checkpoint image the memory of the QEMU virtual machine, which consists of the user-space memory of the process of the host operating system that is running QEMU.
Q15. What is the purpose of the experiment?
The experimental results are split into four subsections concerning: a network of virtual machines; the use of Btrfs for filesystem snapshots; DMTCP optimizations; and performance on a commodity computer.