scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Adaptive checkpointing for fault tolerance in an autonomous mobile computing grid

TL;DR: The results of simulation for the presented scheme verify that the introduction of redundancy with the checkpointing procedure vastly increases the likelihood of successful recovery of a failed node in a MoG.
Abstract: The widespread availability and increasing processing power of mobile devices has lead to a focus towards the development of autonomous mobile computing grids (MoGs). Such mobile grids allow the successful execution of distributed applications; without access to any static nodes or wired networks. However, the implementation of a fault tolerance technique is essential to completely utilize the mobile devices as viable computing resources. Checkpointing is a well explored fault tolerance technique for mobile computing systems. The paper presents an adaptive checkpointing technique for failure recovery of mobile nodes in a MoG. The presented protocol relies on cooperative checkpointing by the constituent nodes in the system. A node uses the stable storage of other nodes in the system to save its checkpoint data, in case the requisite stable storage is not available at the node itself. Further, depending on the availability of resources in the MoG, the scheme replicates a node's checkpoint data at multiple nodes. The results of simulation for the presented scheme verify that the introduction of redundancy with the checkpointing procedure vastly increases the likelihood of successful recovery of a failed node in a MoG.
Citations
More filters
Journal ArticleDOI
TL;DR: A comprehensive overview of fault tolerance-related issues in cloud computing is presented, emphasizing upon the significant concepts, architectural details, and the state-of-art techniques and methods.

84 citations


Cites background from "Adaptive checkpointing for fault to..."

  • ...An effective execution of the distributed applications in mobile grid computing (MoG) is desirable if the faults/failure of the mobile devices are handled properly (Jaggi and Singh, 2014)....

    [...]

Journal ArticleDOI
TL;DR: This work proposes a quantum‐inspired Newtonian approach of attraction based on gravitational search algorithm for scheduling the jobs on mobile computational grid to harness the true potential of the grid.
Abstract: Owing to the advancements in low‐power consumption processors and high‐power storage in a small‐sized battery, the cost of handheld mobile devices, eg, mobiles, tabs, or personal digital assistants, have reduced to a great extent. This has enabled people to have at least 1 smartphone in general with this number increasing exponentially. However, the increasing use of these mobile devices results in an equal increase in the underused processing capacity of these devices too. This encourages the research aiming to use this processing power by forming a mobile computational grid. Because of the inherent limitations of bandwidth, battery, and computational power, job scheduling on these devices demands an efficient scheduling approach to harness the true potential of the grid. The problem becomes even more challenging considering the dynamic nature of these mobile devices. Job scheduling being nondeterministic polynomial time–complete allows the use of evolutionary approaches by exploring and exploiting the search space efficiently. The exploration gets boosted even more with the use of quantum‐computing concepts. This work proposes a quantum‐inspired Newtonian approach of attraction based on gravitational search algorithm for scheduling the jobs on mobile computational grid. Simulation study has been performed to evaluate the performance of the model over various dimensions. A comparative study has been performed with quantum‐genetic algorithm. Simulation result establishes the effectiveness of model under various test conditions.

8 citations

Journal Article
TL;DR: A proxy-based coordinated checkpointing scheme for the mobile to Grid middleware, Mobile Access to Grid Infrastructure (MAGi), which makes it efficient to rollback to the latest consistent global snapshot, without direct involvement of the mobile hosts, which results in less processing and storage overhead on mobile device as compared to existing schemes.
Abstract: Mobile Grid is an emerging and prospering field of distributed computing where mobile devices are enjoying the benefits of Grid. Challenges faced by mobile Grid are unpredictable network quality, lower trust, limited resources (battery power, network bandwidth, storage, processing power, etc) and extended periods of disconnections which may result in lost of the work done by these devices. We, therefore, need a proper fault tolerance scheme for these mobile hosts. A major issue is the appropriate handling of failures with minimal processing and storage overhead on mobile hosts. To meet these goals, we propose a proxy-based coordinated checkpointing scheme for our mobile to Grid middleware, Mobile Access to Grid Infrastructure (MAGi). In this scheme mobile hosts seamlessly store checkpoints on their respective proxies running on the middleware. Together with the central coordinator component, these proxies act as a centralized checkpointing store. This approach makes it efficient to rollback to the latest consistent global snapshot, without direct involvement of the mobile hosts, which results in less processing and storage overhead on mobile device as compared to existing schemes.

5 citations

Proceedings ArticleDOI
04 Nov 2022
TL;DR: In this paper , a communication-induced adaptive checkpointing fault-tolerant mechanism (CIAC-FTM) is proposed to address software faults at application level in IoT. This mechanism places checkpoints at required nodes depending on the type of fault detected, in turn reducing checkpoints and storage overheads.
Abstract: High performance computations in IoTsystems processes huge data, suffers from various types of faults due to hardware or software faults, malicious attacks, network congestion, missing deadlines and server overloads, and Fault Tolerance in such systems is mandatory to maintain the performance of long running applications.Communication Induced Adaptive Checkpointing Fault Tolerance Mechanism (CIAC-FTM) used to address such Faults. The research focus on Software Faults - Transient Faults at application level. This CIAC-FTM places checkpoints at required nodes depending on the type of fault detected, in turn reduces checkpoints and storage overheads. The collaboration of Sensors, huge Data, Machine to Machine (M2M) communication,and IoT gives us a promising formula for near perfect system. Current technologies are able to work hand-in-hand to create a bigger system. IoT ensures a more connected world. The data procured by the sensor nodes are pushed into a network and processing units, processed data used to take decision to act on actuators as designed. Fault-free data at every transaction is at most priority and so Communication Induced Checkpoints (CIC)areintroduced to ensure that the fault-free task restarts and hence CIAC-FTM ensure that the transient faults are rectified.
References
More filters
Journal ArticleDOI
TL;DR: This survey covers rollback-recovery techniques that do not require special language constructs and distinguishes between checkpoint-based and log-based protocols, which rely solely on checkpointing for system state restoration.
Abstract: This survey covers rollback-recovery techniques that do not require special language constructs. In the first part of the survey we classify rollback-recovery protocols into checkpoint-based and log-based.Checkpoint-based protocols rely solely on checkpointing for system state restoration. Checkpointing can be coordinated, uncoordinated, or communication-induced. Log-based protocols combine checkpointing with logging of nondeterministic events, encoded in tuples called determinants. Depending on how determinants are logged, log-based protocols can be pessimistic, optimistic, or causal. Throughout the survey, we highlight the research issues that are at the core of rollback-recovery and present the solutions that currently address them. We also compare the performance of different rollback-recovery protocols with respect to a series of desirable properties and discuss the issues that arise in the practical implementations of these protocols.

1,772 citations

Journal ArticleDOI
TL;DR: A synchronous snapshot collection algorithm for mobile systems that neither forces every node to take a local snapshot, nor blocks the underlying computation during snapshot collection, and a minimal rollback/recovery algorithm in which the computation at a node is rolled back only if it depends on operations that have been undone due to the failure of node(s).
Abstract: A mobile computing system consists of mobile and stationary nodes, connected to each other by a communication network. The presence of mobile nodes in the system places constraints on the permissible energy consumption and available communication bandwidth. To minimize the lost computation during recovery from node failures, periodic collection of a consistent snapshot of the system (checkpoint) is required. Locating mobile nodes contributes to the checkpointing and recovery costs. Synchronous snapshot collection algorithms, designed for static networks, either force every node in the system to take a new local snapshot, or block the underlying computation during snapshot collection. Hence, they are not suitable for mobile computing systems. If nodes take their local checkpoints independently in an uncoordinated manner, each node may have to store multiple local checkpoints in stable storage. This is not suitable for mobile nodes as they have small memory. This paper presents a synchronous snapshot collection algorithm for mobile systems that neither forces every node to take a local snapshot, nor blocks the underlying computation during snapshot collection. If a node initiates snapshot collection, local snapshots of only those nodes that have directly or transitively affected the initiator since their last snapshots need to be taken. We prove that the global snapshot collection terminates within a finite time of its invocation and the collected global snapshot is consistent. We also propose a minimal rollback/recovery algorithm in which the computation at a node is rolled back only if it depends on operations that have been undone due to the failure of node(s). Both the algorithms have low communication and storage overheads and meet the low energy consumption and low bandwidth constraints of mobile computing systems.

213 citations

Journal ArticleDOI
TL;DR: Several heuristics are introduced that dynamically adapt the above mentioned parameters based on information on grid status to provide high job throughput in the presence of failure while reducing the system overhead.
Abstract: A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result, varying resource availability becomes commonplace, often resulting in loss and delay of executing jobs. To ensure good grid performance, fault tolerance should be taken into account. Commonly utilized techniques for providing fault tolerance in distributed systems are periodic job checkpointing and replication. While very robust, both techniques can delay job execution if inappropriate checkpointing intervals and replica numbers are chosen. This paper introduces several heuristics that dynamically adapt the above mentioned parameters based on information on grid status to provide high job throughput in the presence of failure while reducing the system overhead. Furthermore, a novel fault-tolerant algorithm combining checkpointing and replication is presented. The proposed methods are evaluated in a newly developed grid simulation environment dynamic scheduling in distributed environments (DSiDE), which allows for easy modeling of dynamic system and job behavior. Simulations are run employing workload and system parameters derived from logs that were collected from several large-scale parallel production systems. Experiments have shown that adaptive approaches can considerably improve system performance, while the preference for one of the solutions depends on particular system characteristics, such as load, job submission patterns, and failure frequency.

102 citations


"Adaptive checkpointing for fault to..." refers methods in this paper

  • ...A fault tolerant algorithm that combines checkpointing and replication is presented in [10]....

    [...]

  • ...We use representative values from [10] for the minimum and maximum number of required replicas as 1 and 3 respectively; though a node may increase the degree of redundancy if it identifies that it is executing a high priority job....

    [...]

Journal ArticleDOI
TL;DR: This paper presents a survey of the current state of wireless grid computing, including a discussion of the cooperation between wired and wireless grids including ways in which wireless grids extend the capabilities of existing wired grids.
Abstract: Wireless Grid computing extends the traditional Grid computing paradigm to include a diverse collection of mobile devices enabled to communicate using radio frequency, infrared, optical and other wireless mechanisms. Among the devices coming into use in wireless grid implementations are tiny sensors, Radio Frequency Identification tags (RFID). Personal Digital Assistants (PDAs) and paging devices, cellular phones, hand-held or wearable computers, laptop computers and special purpose computers embedded into many modern appliances [8, 26, 29]. Though many of these devices were initially developed to serve a specific, autonomous purpose, their potential for cooperation through the sharing of resources and capabilities, and the massive amounts of resources available due to their numbers, is quickly leading to applications resembling traditional Grid computing. This paper presents a survey of the current state of wireless grid computing. This includes a discussion of the cooperation between wired and wireless grids including ways in which wireless grids extend the capabilities of existing wired grids. It also discusses many of the new capabilities and resources available to wireless grid devices and a sampling of several applications of these new resources. It provides a sampling of many current research endeavors in the wireless grid arena and an examination of a number of the potential challenges resulting from the unique characteristics of wireless grid devices.

76 citations


"Adaptive checkpointing for fault to..." refers background in this paper

  • ...Grid computing enables large-scale resource sharing amongst distributed, loosely coordinated systems for solving complex and challenging problems in science and engineering areas [1, 2]....

    [...]

Journal ArticleDOI
TL;DR: This review of emerging grids sets out to develop a comprehensive classification of both traditional and emerging grid systems, with an aim to motivate further research and to help establish a solid foundation in this rapidly developing field.
Abstract: Advances in grid computing are stimulating the emergence of novel types of grids, such as accessible, manageable, interactive, and personal grids. More and more researchers are realizing emerging grids' potential to bridge the gap between grid technologies and users. This review of emerging grids sets out to develop a comprehensive classification of both traditional and emerging grid systems, with an aim to motivate further research and to help establish a solid foundation in this rapidly developing field.

63 citations


"Adaptive checkpointing for fault to..." refers background in this paper

  • ...Grid computing enables large-scale resource sharing amongst distributed, loosely coordinated systems for solving complex and challenging problems in science and engineering areas [1, 2]....

    [...]