scispace - formally typeset
Proceedings ArticleDOI

Understanding network failures in data centers: measurement, analysis, and implications

Reads0
Chats0
TLDR
The first large-scale analysis of failures in a data center network is presented, finding that data center networks show high reliability, commodity switches such as ToRs and AggS are highly reliable, and network redundancy is only 40% effective in reducing the median impact of failure.
Abstract
We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how effective is network redundancy? We answer these questions using multiple data sources commonly collected by network operators. The key findings of our study are that (1) data center networks show high reliability, (2) commodity switches such as ToRs and AggS are highly reliable, (3) load balancers dominate in terms of failure occurrences with many short-lived software related faults,(4) failures have potential to cause loss of many small packets such as keep alive messages and ACKs, and (5) network redundancy is only 40% effective in reducing the median impact of failure.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

SIMPLE-fying middlebox policy enforcement using SDN

TL;DR: SIMPLE, a SDN-based policy enforcement layer for efficient middlebox-specific "traffic steering", is presented, a significant step toward addressing industry concerns surrounding the ability of SDN to integrate with existing infrastructure and support L4-L7 capabilities.
Proceedings ArticleDOI

CONGA: distributed congestion-aware load balancing for datacenters

TL;DR: It is argued that datacenter fabric load balancing is best done in the network, and requires global schemes such as CONGA to handle asymmetry, and CONGA is nearly as effective as a centralized scheduler while being able to react to congestion in microseconds.
Journal ArticleDOI

Exascale computing and big data

TL;DR: This work unifies traditionally separated high-performance computing and big data analytics in one place to accelerate scientific discovery and engineering innovation and foster new ideas in science and engineering.
Proceedings ArticleDOI

Integrating scale out and fault tolerance in stream processing using operator state management

TL;DR: The key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives that can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures.
Proceedings ArticleDOI

Dynamic scheduling of network updates

TL;DR: Dionysus encodes as a graph the consistency-related dependencies among updates at individual switches, and it then dynamically schedules these updates based on runtime differences in the update speeds of different switches, which increases the system's speed.
References
More filters
Journal ArticleDOI

OpenFlow: enabling innovation in campus networks

TL;DR: This whitepaper proposes OpenFlow: a way for researchers to run experimental protocols in the networks they use every day, based on an Ethernet switch, with an internal flow-table, and a standardized interface to add and remove flow entries.
Journal ArticleDOI

A scalable, commodity data center network architecture

TL;DR: This paper shows how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements and argues that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions.
Proceedings ArticleDOI

VL2: a scalable and flexible data center network

TL;DR: VL2 is a practical network architecture that scales to support huge data centers with uniform high capacity between servers, performance isolation between services, and Ethernet layer-2 semantics, and is built on a working prototype.
Proceedings ArticleDOI

Network traffic characteristics of data centers in the wild

TL;DR: An empirical study of the network traffic in 10 data centers belonging to three different categories, including university, enterprise campus, and cloud data centers, which includes not only data centers employed by large online service providers offering Internet-facing applications but also data centers used to host data-intensive (MapReduce style) applications.
Proceedings ArticleDOI

Data center TCP (DCTCP)

TL;DR: DCTCP enables the applications to handle 10X the current background traffic, without impacting foreground traffic, thus largely eliminating incast problems, and delivers the same or better throughput than TCP, while using 90% less buffer space.
Related Papers (5)