A Survey of Fast Recovery Mechanisms in the Data Plane
30 Jun 2020-
TL;DR: This survey presents a systematic, tutorial-like overview of packet-based fast-recovery mechanisms in the data plane, focusing on concepts but structured around different networking technologies, from traditional link-layer and IP-based mechanisms, over BGP and MPLS to emerging software-defined networks and programmable data planes.
Abstract: In order to meet their stringent dependability requirements, most modern communication networks support fast-recovery mechanisms in the data plane. While reactions to failures in the data plane can be significantly faster compared to control plane mechanisms, implementing fast recovery in the data plane is challenging, and has recently received much attention in the literature. This survey presents a systematic, tutorial-like overview of packet-based fast-recovery mechanisms in the data plane, focusing on concepts but structured around different networking technologies, from traditional link-layer and IP-based mechanisms, over BGP and MPLS to emerging software-defined networks and programmable data planes. We examine the evolution of fast-recovery standards and mechanisms over time, and identify and discuss the fundamental principles and algorithms underlying different mechanisms. We then present a taxonomy of the state of the art and compile open research questions.
Citations
More filters
••
TL;DR: In this article, a Fast Re-Routing (FRR) primitive for programmable data planes, PURR, is proposed, which provides low failover latency and high switch throughput, by avoiding packet recirculation.
Abstract: Highly dependable communication networks usually rely on some kind of Fast Re-Route (FRR) mechanism which allows to quickly re-route traffic upon failures, entirely in the data plane. This paper studies the design of FRR mechanisms for emerging reconfigurable switches. Our main contribution is an FRR primitive for programmable data planes, PURR, which provides low failover latency and high switch throughput, by avoiding packet recirculation . PURR tolerates multiple concurrent failures and comes with minimal memory requirements, ensuring compact forwarding tables, by unveiling an intriguing connection to classic “string theory” ( i.e. , stringology), and in particular, the shortest common supersequence problem. PURR is well-suited for high-speed match-action forwarding architectures ( e.g. , PISA) and supports the implementation of a broad variety of FRR mechanisms. Our simulations and prototype implementation (on an FPGA and a Tofino switch) show that PURR improves TCAM memory occupancy by a factor of $1.5 \times$ – $10.8 \times$ compared to a naive encoding when implementing state-of-the-art FRR mechanisms. PURR also improves the latency and throughput of datacenter traffic up to a factor of $2.8 \times$ – $5.5 \times$ and $1.2 \times$ – $2 \times$ , respectively, compared to approaches based on recirculating packets.
12 citations
•
TL;DR: It is proved that it is impossible to achieve perfect resilience on any non-planar graph, and it is shown that graph families which are closed under the subdivision of links, can allow for simple and efficient failover algorithms which simply skip failed links.
Abstract: In order to provide a high resilience and to react quickly to link failures, modern computer networks support fully decentralized flow rerouting, also known as local fast failover. In a nutshell, the task of a local fast failover algorithm is to pre-define fast failover rules for each node using locally available information only. These rules determine for each incoming link from which a packet may arrive and the set of local link failures (i.e., the failed links incident to a node), on which outgoing link a packet should be forwarded. Ideally, such a local fast failover algorithm provides a perfect resilience deterministically: a packet emitted from any source can reach any target, as long as the underlying network remains connected. Feigenbaum et al. showed that it is not always possible to provide perfect resilience and showed how to tolerate a single failure in any network. Interestingly, not much more is known currently about the feasibility of perfect resilience.
This paper revisits perfect resilience with local fast failover, both in a model where the source can and cannot be used for forwarding decisions. We first derive several fairly general impossibility results: By establishing a connection between graph minors and resilience, we prove that it is impossible to achieve perfect resilience on any non-planar graph; furthermore, while planarity is necessary, it is also not sufficient for perfect resilience.
On the positive side, we show that graph families which are closed under the subdivision of links, can allow for simple and efficient failover algorithms which simply skip failed links. We demonstrate this technique by deriving perfect resilience for outerplanar graphs and related scenarios, as well as for scenarios where the source and target are topologically close after failures.
10 citations
Cites background from "A Survey of Fast Recovery Mechanism..."
...[15], the design of local fast failover algorithms has already been studied intensively, see the recent survey [17]....
[...]
••
10 May 2021
TL;DR: In this article, the authors present several fast rerouting algorithms which are not limited by spanning trees, but rather extend and combine multiple spanning arborescences to improve resilience.
Abstract: To provide a high availability and to be able to quickly react to link failures, most communication networks feature fast rerouting (FRR) mechanisms in the data plane. However, configuring these mechanisms to provide a high resilience against multiple failures is algorithmically challenging, as rerouting rules can only depend on local failure information and need to be pre-defined. This paper is motivated by the observation that the common approach to design fast rerouting algorithms, based on spanning trees and covering arborescences, comes at a cost of reduced resilience as it does not fully exploit the available links in heterogeneous topologies. We present several novel fast rerouting algorithms which are not limited by spanning trees, but rather extend and combine ("graft") multiple spanning arborescences to improve resilience. We compare our algorithms analytically and empirically, and show that they can significantly improve not only the resilience, but also accelerate the preprocessing to generate the local fast failover rules.
8 citations
••
TL;DR: The paper presents the proposal of the new Enhanced Bit Repair (EB-REP)IP FRR mechanism, which offers significant improvements over its predecessor, the B-REP mechanism, and is an advanced contribution to solving IP FRR-related problems.
Abstract: The massive development of virtualized infrastructures, Internet of Things (IoT), and Wireless Sensor Network (WSN) in recent years has led to an increase in quality requirements for the management and reliability of underlay communication networks. Existing converged networks must therefore guarantee specific quantitative and qualitative parameters of different network communication services to meet customer requirements. However, the quality of the services operated is very negatively affected by an unpredictable failure of a communication link or a network node. In such situations, communication is typically interrupted for a period that is difficult to predict, and which can lead to significant financial losses and other negative effects. Internet Protocol Fast Reroute (IP FRR) technology was developed for these reasons. The paper presents the proposal of the new Enhanced Bit Repair (EB-REP) IP FRR mechanism, which offers significant improvements over its predecessor, the B-REP mechanism. The B-REP offers protection against a single failure and only for selected critical IP flows. The EB-REP provides advanced protection against multiple failures in a protected network domain and the protection can be provided for all network flows. The EB-REP calculates alternative paths in advance based on link metrics, but also allows the construction of alternative paths independently of them. The construction of alternative FRR paths uses a standardized tunneling approach via a unique field Bit-String. Thanks to these features, EB-REP is an advanced contribution to solving IP FRR-related problems, which enables the use of EB-REP in many network deployments, but especially in network solutions that require reliable data transmission.
2 citations
Cites background from "A Survey of Fast Recovery Mechanism..."
...Provider topologies are diverse, and there may be situations where line metrics do not meet the requirements of a specific FRR mechanism to calculate a new next-hop router [60,61]....
[...]
...Scientific articles present various results of measurements of repair coverage of specific FRR mechanisms [60,62,63]....
[...]
••
16 Nov 2020TL;DR: In this paper, a flow-based model of QoE Fast ReRouting is proposed, which is based on the implementation of the single path or multipath routing, the conditions of flow conservation, which are introduced for routing variables that regulate the construction of both primary and backup paths.
Abstract: In the work, the flow-based model of the QoE Fast ReRouting is proposed. The model is based on the implementation of the single path or multipath routing, the conditions of flow conservation, which are introduced for routing variables that regulate the construction of both primary and backup paths. In addition, restrictions have been introduced to prevent the overloading of communication links with packet flows, the implementation of which actually provided bandwidth protection. The model was supplemented by the conditions of structural network elements protection (node, link, and route). The peculiarity of these conditions is the consideration of possible packet losses due to congestion of router interfaces. In order to obtain analytical expressions for the calculation of the R-factor for each of the VoIP-flows, a tensor generalization of the mathematical model of routing has been performed. Based on the tensor description of the network, it was possible to obtain expressions for calculating the average multipath end-to-end delay and packet loss probability, which allowed to formulate in analytical form the R-factor calculation expressions for each of the VoIP-Flows. The novelty of the proposed model is the formulation of the problem of QoE Fast ReRouting in the optimization form when the optimality criterion was the maximum of the additive form, represented by the sum of weighted according to the IP-priority values of R-factor for each of the VoIP-Flows. The results of the study of the proposed model confirmed its efficiency and adequacy, which was especially evident in the conditions of complex network topologies, high network congestion, and flow differentiation relative to the IP-priority values of packets.
2 citations
References
More filters
••
TL;DR: The aim is to explicate a set of general concepts, of relevance across a wide range of situations and, therefore, helping communication and cooperation among a number of scientific and technical communities, including ones that are concentrating on particular types of system, of system failures, or of causes of systems failures.
Abstract: This paper gives the main definitions relating to dependability, a generic concept including a special case of such attributes as reliability, availability, safety, integrity, maintainability, etc. Security brings in concerns for confidentiality, in addition to availability and integrity. Basic definitions are given first. They are then commented upon, and supplemented by additional definitions, which address the threats to dependability and security (faults, errors, failures), their attributes, and the means for their achievement (fault prevention, fault tolerance, fault removal, fault forecasting). The aim is to explicate a set of general concepts, of relevance across a wide range of situations and, therefore, helping communication and cooperation among a number of scientific and technical communities, including ones that are concentrating on particular types of system, of system failures, or of causes of system failures.
4,695 citations
•
01 Jul 1989
TL;DR: This fourth edition of the introduction to Optimum Design has been reorganized, rewritten in parts, and enhanced with new material, making the book even more appealing to instructors regardless of course level.
Abstract: Introduction to Optimum Design, Fourth Edition, carries on the tradition of the most widely used textbook in engineering optimization and optimum design courses. It is intended for use in a first course on engineering design and optimization at the undergraduate or graduate level in engineering departments of all disciplines, with a primary focus on mechanical, aerospace, and civil engineering courses. Through a basic and organized approach, the text describes engineering design optimization in a rigorous, yet simplified manner, illustrates various concepts and procedures with simple examples, and demonstrates their applicability to engineering design problems. Formulation of a design problem as an optimization problem is emphasized and illustrated throughout the text using Excel and MATLAB as learning and teaching aids. This fourth edition has been reorganized, rewritten in parts, and enhanced with new material, making the book even more appealing to instructors regardless of course level. * Includes basic concepts of optimality conditions and numerical methods that are described with simple and practical examples, making the material highly teachable and learnable* Presents applications of optimization methods for structural, mechanical, aerospace, and industrial engineering problems* Provides practical design examples that introduce students to the use of optimization methods early in the book* Contains chapter on several advanced optimum design topics that serve the needs of instructors who teach more advanced courses
2,595 citations
••
28 Jul 2014TL;DR: This paper proposes P4 as a strawman proposal for how OpenFlow should evolve in the future, and describes how to use P4 to configure a switch to add a new hierarchical label.
Abstract: P4 is a high-level language for programming protocol-independent packet processors. P4 works in conjunction with SDN control protocols like OpenFlow. In its current form, OpenFlow explicitly specifies protocol headers on which it operates. This set has grown from 12 to 41 fields in a few years, increasing the complexity of the specification while still not providing the flexibility to add new headers. In this paper we propose P4 as a strawman proposal for how OpenFlow should evolve in the future. We have three goals: (1) Reconfigurability in the field: Programmers should be able to change the way switches process packets once they are deployed. (2) Protocol independence: Switches should not be tied to any specific network protocols. (3) Target independence: Programmers should be able to describe packet-processing functionality independently of the specifics of the underlying hardware. As an example, we describe how to use P4 to configure a switch to add a new hierarchical label.
2,214 citations
••
06 Oct 2005
TL;DR: This work advocate a complete refactoring of the functionality and proposes three key principles--network-level objectives, network-wide views, and direct control--that it believes should underlie a new architecture, called 4D, after the architecture's four planes: decision, dissemination, discovery, and data.
Abstract: Today's data networks are surprisingly fragile and difficult to manage. We argue that the root of these problems lies in the complexity of the control and management planes--the software and protocols coordinating network elements--and particularly the way the decision logic and the distributed-systems issues are inexorably intertwined. We advocate a complete refactoring of the functionality and propose three key principles--network-level objectives, network-wide views, and direct control--that we believe should underlie a new architecture. Following these principles, we identify an extreme design point that we call "4D," after the architecture's four planes: decision, dissemination, discovery, and data. The 4D architecture completely separates an AS's decision logic from pro-tocols that govern the interaction among network elements. The AS-level objectives are specified in the decision plane, and en-forced through direct configuration of the state that drives how the data plane forwards packets. In the 4D architecture, the routers and switches simply forward packets at the behest of the decision plane, and collect measurement data to aid the decision plane in controlling the network. Although 4D would involve substantial changes to today's control and management planes, the format of data packets does not need to change; this eases the deployment path for the 4D architecture, while still enabling substantial innovation in network control and management. We hope that exploring an extreme design point will help focus the attention of the research and industrial communities on this crucially important and intellectually challenging area.
805 citations
••
15 Aug 2011TL;DR: The first large-scale analysis of failures in a data center network is presented, finding that data center networks show high reliability, commodity switches such as ToRs and AggS are highly reliable, and network redundancy is only 40% effective in reducing the median impact of failure.
Abstract: We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how effective is network redundancy? We answer these questions using multiple data sources commonly collected by network operators. The key findings of our study are that (1) data center networks show high reliability, (2) commodity switches such as ToRs and AggS are highly reliable, (3) load balancers dominate in terms of failure occurrences with many short-lived software related faults,(4) failures have potential to cause loss of many small packets such as keep alive messages and ACKs, and (5) network redundancy is only 40% effective in reducing the median impact of failure.
703 citations