A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
Citations
103 citations
Cites background from "A survey of fault tolerance mechani..."
...Such concepts are elaborated in Elnozahy et al. (2002), Sancho et al. (2005), Chen et al. (2015), and Egwutuoha et al. (2013)....
[...]
80 citations
62 citations
59 citations
Cites background from "A survey of fault tolerance mechani..."
...Checkpointing a process involves halting its execution, allowing it to be restarted at a later stage, and enabling migration, see [32], [33]....
[...]
47 citations
Cites methods from "A survey of fault tolerance mechani..."
...Most graph processing systems use checkpointing and rollback mechanisms (Egwutuoha et al. 2013) for failure recovery, such as Pregel and Pregel-like systems like Giraph....
[...]
References
8,381 citations
6,804 citations
3,186 citations
"A survey of fault tolerance mechani..." refers methods in this paper
...Stop-and-copy and live migration of VMs are the commonly used techniques [16]....
[...]
3,181 citations
2,738 citations
"A survey of fault tolerance mechani..." refers background in this paper
...A number of checkpoint protocols have been proposed to ensure global coordination: a nonblocking checkpointing coordination protocol was proposed [11] to ensure that applications that would make coordinated checkpointing inconsistent are prevented from running....
[...]