Characterizing cloud computing hardware reliability

doi:10.1145/1807128.1807161

Proceedings ArticleDOI

Characterizing cloud computing hardware reliability

Kashi Venkatesh Vishwanath, +1 more

- pp 193-204

Chats0

TLDR

This paper is the first attempt to study server failures and hardware repairs for large datacenters and presents a detailed analysis of failure characteristics as well as a preliminary analysis on failure predictors.

Abstract:

Modern day datacenters host hundreds of thousands of servers that coordinate tasks in order to deliver highly available cloud computing services. These servers consist of multiple hard disks, memory modules, network cards, processors etc., each of which while carefully engineered are capable of failing. While the probability of seeing any such failure in the lifetime (typically 3-5 years in industry) of a server can be somewhat small, these numbers get magnified across all devices hosted in a datacenter. At such a large scale, hardware component failure is the norm rather than an exception.Hardware failure can lead to a degradation in performance to end-users and can result in losses to the business. A sound understanding of the numbers as well as the causes behind these failures helps improve operational experience by not only allowing us to be better equipped to tolerate failures but also to bring down the hardware cost through engineering, directly leading to a saving for the company. To the best of our knowledge, this paper is the first attempt to study server failures and hardware repairs for large datacenters. We present a detailed analysis of failure characteristics as well as a preliminary analysis on failure predictors. We hope that the results presented in this paper will serve as motivation to foster further research in this area.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Network Function Virtualization: State-of-the-Art and Research Challenges

Rashid Mijumbi, +5 more

- 21 Jan 2016 -

IEEE Communications Surveys and Tutorial...

TL;DR: In this article, the authors survey the state-of-the-art in NFV and identify promising research directions in this area, and also overview key NFV projects, standardization efforts, early implementations, use cases, and commercial products.

...read moreread less

Proceedings ArticleDOI

Understanding network failures in data centers: measurement, analysis, and implications

Phillipa Gill, +2 more

TL;DR: The first large-scale analysis of failures in a data center network is presented, finding that data center networks show high reliability, commodity switches such as ToRs and AggS are highly reliable, and network redundancy is only 40% effective in reducing the median impact of failure.

...read moreread less

Proceedings Article

Maglev: a fast and reliable software network load balancer

Daniel Eugene Eisenbud, +9 more

TL;DR: Maglev is Google's network load balancer, a large distributed software system that runs on commodity Linux servers that is specifically optimized for packet processing performance.

...read moreread less

Journal ArticleDOI

A Manifesto for Future Generation Cloud Computing: Research Directions for the Next Decade

Rajkumar Buyya, +24 more

- 19 Nov 2018 -

ACM Computing Surveys

TL;DR: The proposed manifesto addresses the major open challenges in Cloud computing by identifying themajor open challenges, emerging trends, and impact areas, and offers research directions for the next decade, thus helping in the realisation of Future Generation Cloud Computing.

...read moreread less

Journal ArticleDOI

Fault Tolerance Management in Cloud Computing: A System-Level Perspective

Ravi Jhawar, +2 more

- 01 Jun 2013 -

IEEE Systems Journal

TL;DR: An innovative, system-level, modular perspective on creating and managing fault tolerance in Clouds is introduced and a comprehensive high-level approach to shading the implementation details of the fault tolerance techniques to application developers and users by means of a dedicated service layer is proposed.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

A scalable, commodity data center network architecture

Mohammad Al-Fares, +2 more

TL;DR: This paper shows how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements and argues that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions.

...read moreread less

Proceedings ArticleDOI

VL2: a scalable and flexible data center network

Albert Greenberg, +8 more

TL;DR: VL2 is a practical network architecture that scales to support huge data centers with uniform high capacity between servers, performance isolation between services, and Ethernet layer-2 semantics, and is built on a working prototype.

...read moreread less

Book

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines

Luiz Andre Barroso, +1 more

TL;DR: The architecture of WSCs is described, the main factors influencing their design, operation, and cost structure, and the characteristics of their software base are described.

...read moreread less

Proceedings ArticleDOI

PortLand: a scalable fault-tolerant layer 2 data center network fabric

Radhika Niranjan Mysore, +7 more

TL;DR: Through the design and implementation of PortLand, a scalable, fault tolerant layer 2 routing and forwarding protocol for data center environments, it is shown that PortLand holds promise for supporting a ``plug-and-play" large-scale, data center network.

...read moreread less

Journal ArticleDOI

Web search for a planet: The Google cluster architecture

Luiz Andre Barroso, +2 more

- 01 Mar 2003 -

IEEE Micro

TL;DR: Googless architecture features clusters of more than 15,000 commodity-class PCs with fault tolerant software that achieves superior performance at a fraction of the cost of a system built from fewer, but more expensive, high-end servers.

...read moreread less