scispace - formally typeset
Proceedings ArticleDOI

Characterizing cloud computing hardware reliability

Reads0
Chats0
TLDR
This paper is the first attempt to study server failures and hardware repairs for large datacenters and presents a detailed analysis of failure characteristics as well as a preliminary analysis on failure predictors.
Abstract
Modern day datacenters host hundreds of thousands of servers that coordinate tasks in order to deliver highly available cloud computing services. These servers consist of multiple hard disks, memory modules, network cards, processors etc., each of which while carefully engineered are capable of failing. While the probability of seeing any such failure in the lifetime (typically 3-5 years in industry) of a server can be somewhat small, these numbers get magnified across all devices hosted in a datacenter. At such a large scale, hardware component failure is the norm rather than an exception.Hardware failure can lead to a degradation in performance to end-users and can result in losses to the business. A sound understanding of the numbers as well as the causes behind these failures helps improve operational experience by not only allowing us to be better equipped to tolerate failures but also to bring down the hardware cost through engineering, directly leading to a saving for the company. To the best of our knowledge, this paper is the first attempt to study server failures and hardware repairs for large datacenters. We present a detailed analysis of failure characteristics as well as a preliminary analysis on failure predictors. We hope that the results presented in this paper will serve as motivation to foster further research in this area.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Network Function Virtualization: State-of-the-Art and Research Challenges

TL;DR: In this article, the authors survey the state-of-the-art in NFV and identify promising research directions in this area, and also overview key NFV projects, standardization efforts, early implementations, use cases, and commercial products.
Proceedings ArticleDOI

Understanding network failures in data centers: measurement, analysis, and implications

TL;DR: The first large-scale analysis of failures in a data center network is presented, finding that data center networks show high reliability, commodity switches such as ToRs and AggS are highly reliable, and network redundancy is only 40% effective in reducing the median impact of failure.
Proceedings Article

Maglev: a fast and reliable software network load balancer

TL;DR: Maglev is Google's network load balancer, a large distributed software system that runs on commodity Linux servers that is specifically optimized for packet processing performance.
Journal ArticleDOI

Fault Tolerance Management in Cloud Computing: A System-Level Perspective

TL;DR: An innovative, system-level, modular perspective on creating and managing fault tolerance in Clouds is introduced and a comprehensive high-level approach to shading the implementation details of the fault tolerance techniques to application developers and users by means of a dedicated service layer is proposed.
References
More filters
Journal ArticleDOI

A scalable, commodity data center network architecture

TL;DR: This paper shows how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements and argues that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions.
Proceedings ArticleDOI

VL2: a scalable and flexible data center network

TL;DR: VL2 is a practical network architecture that scales to support huge data centers with uniform high capacity between servers, performance isolation between services, and Ethernet layer-2 semantics, and is built on a working prototype.
Book

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines

TL;DR: The architecture of WSCs is described, the main factors influencing their design, operation, and cost structure, and the characteristics of their software base are described.
Proceedings ArticleDOI

PortLand: a scalable fault-tolerant layer 2 data center network fabric

TL;DR: Through the design and implementation of PortLand, a scalable, fault tolerant layer 2 routing and forwarding protocol for data center environments, it is shown that PortLand holds promise for supporting a ``plug-and-play" large-scale, data center network.
Journal ArticleDOI

Web search for a planet: The Google cluster architecture

TL;DR: Googless architecture features clusters of more than 15,000 commodity-class PCs with fault tolerant software that achieves superior performance at a fraction of the cost of a system built from fewer, but more expensive, high-end servers.
Related Papers (5)