Understanding network failures in data centers: measurement, analysis, and implications

doi:10.1145/2018436.2018477

Proceedings ArticleDOI

Understanding network failures in data centers: measurement, analysis, and implications

Phillipa Gill, +2 more

- Vol. 41, Iss: 4, pp 350-361

Chats0

TLDR

The first large-scale analysis of failures in a data center network is presented, finding that data center networks show high reliability, commodity switches such as ToRs and AggS are highly reliable, and network redundancy is only 40% effective in reducing the median impact of failure.

Abstract:

We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how effective is network redundancy? We answer these questions using multiple data sources commonly collected by network operators. The key findings of our study are that (1) data center networks show high reliability, (2) commodity switches such as ToRs and AggS are highly reliable, (3) load balancers dominate in terms of failure occurrences with many short-lived software related faults,(4) failures have potential to cause loss of many small packets such as keep alive messages and ACKs, and (5) network redundancy is only 40% effective in reducing the median impact of failure.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

SIMPLE-fying middlebox policy enforcement using SDN

Zafar Ayyub Qazi, +5 more

TL;DR: SIMPLE, a SDN-based policy enforcement layer for efficient middlebox-specific "traffic steering", is presented, a significant step toward addressing industry concerns surrounding the ability of SDN to integrate with existing infrastructure and support L4-L7 capabilities.

...read moreread less

Proceedings ArticleDOI

CONGA: distributed congestion-aware load balancing for datacenters

Mohammad Alizadeh, +9 more

TL;DR: It is argued that datacenter fabric load balancing is best done in the network, and requires global schemes such as CONGA to handle asymmetry, and CONGA is nearly as effective as a centralized scheduler while being able to react to congestion in microseconds.

...read moreread less

Journal ArticleDOI

Exascale computing and big data

Daniel A. Reed, +1 more

- 25 Jun 2015 -

Communications of The ACM

TL;DR: This work unifies traditionally separated high-performance computing and big data analytics in one place to accelerate scientific discovery and engineering innovation and foster new ideas in science and engineering.

...read moreread less

Proceedings ArticleDOI

Integrating scale out and fault tolerance in stream processing using operator state management

Raul Fernandez, +3 more

TL;DR: The key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives that can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures.

...read moreread less

Proceedings ArticleDOI

Dynamic scheduling of network updates

Xin Jin, +7 more

TL;DR: Dionysus encodes as a graph the consistency-related dependencies among updates at individual switches, and it then dynamically schedules these updates based on runtime differences in the update speeds of different switches, which increases the system's speed.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

OpenFlow: enabling innovation in campus networks

Nick McKeown, +7 more

TL;DR: This whitepaper proposes OpenFlow: a way for researchers to run experimental protocols in the networks they use every day, based on an Ethernet switch, with an internal flow-table, and a standardized interface to add and remove flow entries.

...read moreread less

Journal ArticleDOI

A scalable, commodity data center network architecture

Mohammad Al-Fares, +2 more

TL;DR: This paper shows how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements and argues that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions.

...read moreread less

Proceedings ArticleDOI

VL2: a scalable and flexible data center network

Albert Greenberg, +8 more

TL;DR: VL2 is a practical network architecture that scales to support huge data centers with uniform high capacity between servers, performance isolation between services, and Ethernet layer-2 semantics, and is built on a working prototype.

...read moreread less

Proceedings ArticleDOI

Network traffic characteristics of data centers in the wild

Theophilus Benson, +2 more

TL;DR: An empirical study of the network traffic in 10 data centers belonging to three different categories, including university, enterprise campus, and cloud data centers, which includes not only data centers employed by large online service providers offering Internet-facing applications but also data centers used to host data-intensive (MapReduce style) applications.

...read moreread less

Proceedings ArticleDOI

Data center TCP (DCTCP)

Mohammad Alizadeh, +7 more

TL;DR: DCTCP enables the applications to handle 10X the current background traffic, without impacting foreground traffic, thus largely eliminating incast problems, and delivers the same or better throughput than TCP, while using 90% less buffer space.

...read moreread less

Understanding network failures in data centers: measurement, analysis, and implications

Citations

SIMPLE-fying middlebox policy enforcement using SDN

CONGA: distributed congestion-aware load balancing for datacenters

Exascale computing and big data

Integrating scale out and fault tolerance in stream processing using operator state management

Dynamic scheduling of network updates

References

OpenFlow: enabling innovation in campus networks

A scalable, commodity data center network architecture

VL2: a scalable and flexible data center network

Network traffic characteristics of data centers in the wild

Data center TCP (DCTCP)

Related Papers (5)

A scalable, commodity data center network architecture

VL2: a scalable and flexible data center network

BCube: a high performance, server-centric network architecture for modular data centers

OpenFlow: enabling innovation in campus networks

PortLand: a scalable fault-tolerant layer 2 data center network fabric