scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Toward a Smart Cloud: A Review of Fault-Tolerance Methods in Cloud Systems

01 Mar 2021-IEEE Transactions on Services Computing (IEEE Computer Society)-Vol. 14, Iss: 2, pp 589-605
TL;DR: A comprehensive survey of the state-of-the-art work on fault tolerance methods proposed for cloud computing is presented and current issues and challenges in cloud fault tolerance are discussed to identify promising areas for future research.
Abstract: This paper presents a comprehensive survey of the state-of-the-art work on fault tolerance methods proposed for cloud computing. The survey classifies fault-tolerance methods into three categories: 1) ReActive Methods (RAMs); 2) PRoactive Methods (PRMs); and 3) ReSilient Methods (RSMs). RAMs allow the system to enter into a fault status and then try to recover the system. PRMs tend to prevent the system from entering a fault status by implementing mechanisms that enable them to avoid errors before they affect the system. On the other hand, recently emerging RSMs aim to minimize the amount of time it takes for a system to recover from a fault. Machine Learning and Artificial Intelligence have played an active role in RSM domain in such a way that the recovery time is mapped to a function to be optimized (i.e., by converging the recovery time to a fraction of milliseconds). As the system learns to deal with new faults, the recovery time will become shorter. In addition, current issues and challenges in cloud fault tolerance are also discussed to identify promising areas for future research.
Citations
More filters
Posted Content
TL;DR: This work discusses deep reinforcement learning in an overview style, focusing on contemporary work, and in historical contexts, with background of artificial intelligence, machine learning, deep learning, and reinforcement learning (RL), with resources.
Abstract: We discuss deep reinforcement learning in an overview style. We draw a big picture, filled with details. We discuss six core elements, six important mechanisms, and twelve applications, focusing on contemporary work, and in historical contexts. We start with background of artificial intelligence, machine learning, deep learning, and reinforcement learning (RL), with resources. Next we discuss RL core elements, including value function, policy, reward, model, exploration vs. exploitation, and representation. Then we discuss important mechanisms for RL, including attention and memory, unsupervised learning, hierarchical RL, multi-agent RL, relational RL, and learning to learn. After that, we discuss RL applications, including games, robotics, natural language processing (NLP), computer vision, finance, business management, healthcare, education, energy, transportation, computer systems, and, science, engineering, and art. Finally we summarize briefly, discuss challenges and opportunities, and close with an epilogue.

239 citations

Journal ArticleDOI
TL;DR: A comprehensive overview of fault tolerance-related issues in cloud computing is presented, emphasizing upon the significant concepts, architectural details, and the state-of-art techniques and methods.

84 citations


Cites background from "Toward a Smart Cloud: A Review of F..."

  • ...Fault tolerance (FT) is an essential concern in cloud computing platform since it enables the system to provide the required services with good performance in presence of the one or more failures of the system components(Gokhroo et al., 2017; Valle et al., 2008; Mukwevho and Celik, 2018)....

    [...]

  • ..., it avoids recovery from faults and errors (Charity and Hua, 2016; Valle et al., 2008; Mukwevho and Celik, 2018; Engelmann et al., 2009, 2009)....

    [...]

Journal ArticleDOI
TL;DR: There is a need to protect digital documents from authorized users who try to redistribute it illegally.
Abstract: Nowadays, the use of digital content or digital media is increasing day by day. Therefore, there is a need to protect the digital document from both unauthorized users and authorized users. The digital document should be protected from authorized users who try to redistribute it illegally. Digital watermarking techniques along with cryptography are insufficient to ensure an adequate level of security of digital media. The security of the transferring digital data in the modern world is also a big challenge because there is a high risk of security breaches. In this article, a secure technique of image fusion using hybrid domains (spatial and frequency) for privacy preserving and copyright protection is proposed. The proposed method provides a secure technique for the digital content in cloud environment. Two cloud services are used to develop this work, which eliminates the role of a trusted third party (TTP). First is the design of an infrastructure as a service (IaaS) to store different images with encryption processes to speed up the image fusion process and save storage space. Second, a Platform as a Service (PaaS) is used to enable the digital content to improve computation power and to increase the bandwidth. The prime objective of the proposed scheme is to transfer the digital media between a service provider and customer in a secure way using a hybrid domain along with cloud storage. Imperceptibility and robustness measures are used to calculate the performance of the proposed approach.

67 citations

Journal ArticleDOI
TL;DR: In this paper , a survey of 129 research papers published through February 2022 were considered and further classified, and the authors critically reviewed techniques to tolerate faults in cloud computing systems and discussed the taxonomy of errors, faults, and failures.

47 citations

Journal ArticleDOI
TL;DR: The research paper identifies the need for FT efficiency metric in LB algorithms which is one of the main concerns in cloud environments and proposes a novel algorithm that employs FT metrics in LB.
Abstract: The past few years have witnessed the emergence of a novel paradigm called cloud computing. CC aims to provide computation and resources over the internet via dynamic provisioning of services. There are several challenges and issues associated with implementation of CC. This research paper deliberates on one of CC main problems i.e. load balancing (LB). The goal of LB is equilibrating the computation on the cloud servers such that no host is under/ overloaded. Several LB algorithms have been implemented in literature to provide effective administration and satisfying customer requests for appropriate cloud nodes, to improve the overall efficiency of cloud services, and to provide the end user with more satisfaction. An efficient LB algorithm improves efficiency and asset's usage through effectively spreading the workload across the system's different nodes. This review research paper objective is to present critical study of existing techniques of LB, to discuss various LB parameters i.e. throughput, performance, migration time, response time, overhead, resource usage, scalability, fault tolerance, power savings, etc. The research paper also discusses the problems of LB in the CC environment and identifies the need for a novel LB algorithm that employs FT metrics. It has been found that traditional LB algorithms are not good enough and they do not consider FT efficiency metrics for their operation. Hence, the research paper identifies the need for FT efficiency metric in LB algorithms which is one of the main concerns in cloud environments. A novel algorithm that employs FT in LB is therefore proposed.

36 citations


Cites background or methods from "Toward a Smart Cloud: A Review of F..."

  • ...reactive methods, proactive methods & resilient methods [68]....

    [...]

  • ...On an ongoing app, it is used to evaluate a set of status variables [68]....

    [...]

  • ...1) Checkpointing/Restarting: The approach works by consistently storing system status, begin the task from the most current state in case of failure [68]....

    [...]

  • ...Upcoming guidance on cloud FT moves towards smart and resilient methods [68]....

    [...]

  • ...3) Retry: The retry approach works by easily retrieving a rejected query several times over the same asset [68] 4) Custom Exception Handling: Includes methods where programmers inject code into the app so that during debugging they can handle different errors [68]....

    [...]

References
More filters
Book
01 Jan 1988
TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Abstract: Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability. The book is divided into three parts. Part I defines the reinforcement learning problem in terms of Markov decision processes. Part II provides basic solution methods: dynamic programming, Monte Carlo methods, and temporal-difference learning. Part III presents a unified view of the solution methods and incorporates artificial neural networks, eligibility traces, and planning; the two final chapters present case studies and consider the future of reinforcement learning.

37,989 citations


"Toward a Smart Cloud: A Review of F..." refers background in this paper

  • ...Another strength of model-free algorithms is such that they are scalable, they grow linearly with the number of features that are representing the environment [93]....

    [...]

  • ...with the environment and therefore are usually seen as yielding better performance, this is known as data efficiency [93]....

    [...]

  • ..., model-based and model-free approaches [93]....

    [...]

Journal ArticleDOI
TL;DR: The clouds are clearing the clouds away from the true potential and obstacles posed by this computing capability.
Abstract: Clearing the clouds away from the true potential and obstacles posed by this computing capability.

9,282 citations


"Toward a Smart Cloud: A Review of F..." refers background in this paper

  • ...THE use of cloud computing [1] has witnessed a significant amount of growth over the past decade....

    [...]

Journal ArticleDOI
TL;DR: A parallel message-passing implementation of a molecular dynamics program that is useful for bio(macro)molecules in aqueous environment is described and can handle rectangular periodic boundary conditions with temperature and pressure scaling.

8,195 citations

Journal ArticleDOI
Jeffrey O. Kephart1, David M. Chess1
TL;DR: A 2001 IBM manifesto noted the almost impossible difficulty of managing current and planned computing systems, which require integrating several heterogeneous environments into corporate-wide computing systems that extend into the Internet.
Abstract: A 2001 IBM manifesto observed that a looming software complexity crisis -caused by applications and environments that number into the tens of millions of lines of code - threatened to halt progress in computing. The manifesto noted the almost impossible difficulty of managing current and planned computing systems, which require integrating several heterogeneous environments into corporate-wide computing systems that extend into the Internet. Autonomic computing, perhaps the most attractive approach to solving this problem, creates systems that can manage themselves when given high-level objectives from administrators. Systems manage themselves according to an administrator's goals. New components integrate as effortlessly as a new cell establishes itself in the human body. These ideas are not science fiction, but elements of the grand challenge to create self-managing computing systems.

6,527 citations


"Toward a Smart Cloud: A Review of F..." refers background in this paper

  • ...Such systems are made of multiple components that are deployed on multiple VMs [65], [66], [67], [68]....

    [...]

  • ...Such systems are made of multiple components that are deployed onmultiple VMs [65], [66], [67], [68]....

    [...]

Journal ArticleDOI
TL;DR: Tapping into the "folk knowledge" needed to advance machine learning applications is a natural next step in the development of artificial intelligence systems.
Abstract: Machine learning algorithms can figure out how to perform important tasks by generalizing from examples. This is often feasible and cost-effective where manual programming is not. As more data becomes available, more ambitious problems can be tackled. As a result, machine learning is widely used in computer science and other fields. However, developing successful machine learning applications requires a substantial amount of “black art” that is hard to find in textbooks. This article summarizes twelve key lessons that machine learning researchers and practitioners have learned. These include pitfalls to avoid, important issues to focus on, and answers to common questions.

2,482 citations


"Toward a Smart Cloud: A Review of F..." refers methods in this paper

  • ...Cross-validation can be used to combat the problem of overfitting [98]....

    [...]