Proceedings Article•DOI•

Troubleshooting blackbox SDN control software with minimal causal sequences

Colin Scott¹, Andreas Wundsam², Barath Raghavan³, Aurojit Panda¹, Andrew Or¹, Jefferson Lai¹, Eugene Huang¹, Zhi Liu⁴, Ahmed El-Hassany³, Sam Whitlock⁵, Hrishikesh B. Acharya³, Kyriakos Zarifis³, Scott Shenker¹ - Show less +9 more•Institutions (5)

University of California, Berkeley¹, Switch², Institute of Company Secretaries of India³, Tsinghua University⁴, École Polytechnique Fédérale de Lausanne⁵

17 Aug 2014-Vol. 44, Iss: 4, pp 395-406

TL;DR: This paper presents a technique for automatically identifying a minimal sequence of inputs responsible for triggering a given bug, without making assumptions about the language or instrumentation of the software under test.

read less

Abstract: Software bugs are inevitable in software-defined networking control software, and troubleshooting is a tedious, time-consuming task. In this paper we discuss how to improve control software troubleshooting by presenting a technique for automatically identifying a minimal sequence of inputs responsible for triggering a given bug, without making assumptions about the language or instrumentation of the software under test. We apply our technique to five open source SDN control platforms---Floodlight, NOX, POX, Pyretic, ONOS---and illustrate how the minimal causal sequences our system found aided the troubleshooting process.

...read moreread less

Summary (7 min read)

Jump to: [1. INTRODUCTION] – [2. BACKGROUND] – [3. PROBLEM DEFINITION] – [4. MINIMIZING TRACES] – [4.1 Searching for Subsequences] – [Internal Message] – [Input Type Implementation] – [4.3 Complexity] – [5. SYSTEMS CHALLENGES] – [5.2 Mitigating Non-Determinism] – [5.3 Checkpointing] – [5.4 Timing Heuristics] – [5.5 Root Causing Tools] – [5.6 Limitations] – [Lack of Guarantees.] – [6. EVALUATION] – [6.1 New Bugs] – [6.2 Known bugs] – [6.3 Synthetic bugs] – [6.4 Overall Results & Discussion] – [6.5 Coping with Non-determinism] – [6.6 Instrumentation Complexity] – [6.7 Scalability] – [6.8 Parameters] – [7. DISCUSSION] – [8. RELATED WORK] and [9. CONCLUSION]

1. INTRODUCTION

Software-defined networking (SDN) proposes to simplify network management by providing a simple logically-centralized API upon which network management programs can be written.
All complicated distributed systems are prone to bugs, and from their first-hand familiarity with five open source controllers and three major commercial controllers the authors can attest that SDN is no exception.
This act of "troubleshooting" (which precedes the act of debugging the code) is highly time-consuming, as developers spend hours poring over multigigabyte execution traces.
The authors therefore need to carefully control the interleaving of events in the face of asynchrony, concurrency and non-determinism in order to reproduce bugs throughout the minimization process.
After the bug has been fixed, the MCS can serve as a test case to prevent regression, and can help identify redundant bug reports where the MCSes are the same.

2. BACKGROUND

Network operating systems, the key component of SDN software infrastructure, consist of control software running on a replicated set of servers, each running a controller instance.
Controllers coordinate between themselves, and receive input events (e.g. link failure notifications) and statistics from switches (either physical or virtual), policy changes via a management interface, and possibly dataplane packets.
Invariants can be violated because the system was improperly configured (e.g. the management system [2] or a human improperly specified their goals), or because there is a bug within the SDN control plane itself.
The QA engineers exercise automated test scenarios that involve sequences of external events such as failures on large (software emulated or hardware) network testbeds.
If they detect an invariant violation, they hand the resulting trace to a developer for analysis.

3. PROBLEM DEFINITION

A replay of log L involves replaying the external events EL, possibly taking into account the occurrence of internal events IL as observed by the orchestrator.
The goal of their work is, when given a log L that exhibited an invariant violation, 3 to find a small, replayable sequence of events that reproduces that invariant violation.
Note that an MCS is not necessarily globally minimal, in that there could be smaller subsequences of EL that reproduce this violation, but are not a subsequence of this MCS.
The authors find approximate MCSes by deciding which external events to eliminate and, more importantly, when to inject external events.
The authors describe this process in the next section.

4. MINIMIZING TRACES

Given a log L generated from testing infrastructure, 3 their goal is to find an approximate MCS, so that a human can examine the MCS rather than the full log.
Searching through subsequences of EL, and deciding when to inject external events for each subsequence so that, whenever possible, the invariant violation is retriggered, also known as This involves two tasks.

4.1 Searching for Subsequences

Checking random subsequences of EL would be one viable but inefficient approach to achieving their first task.
The input subsequences chosen by delta debugging are not always valid.
Of the possible inputs sequences the authors generate (shown in Table 2 ), it is not sensible to replay a recovery event without a preceding failure event, nor to replay a host migration event without modifying its starting position when a preceding host migration event has been pruned.
These two heuristics account for validity of all network events shown in Table 2 .
The authors do not yet support network policy changes as events, which have more complex semantic dependencies.

Internal Message

Masked Values OpenFlow messages xac id, cookie, buffer id, stats packet_out/in payload all values except src, dst, data Log statements varargs parameters to printf Table 1 : Internal messages and their masked values.
Previous best-effort execution minimization techniques [14, 53] also allow alternate code paths, but do not systematically consider concurrency and asynchrony.
It optionally obtains partial visibility into (b) by instrumenting controller software with a simple interposition layer (to be described in §5.2).
Internal events may differ syntactically (e.g. sequence numbers of control packets may all differ) when replaying a subsequence of the original log.
The authors apply this observation by defining masks over semantically extraneous fields of internal events.

Input Type Implementation

The authors then consider an internal event i observed in replay equivalent (in the sense of inheriting all of its happens-before relations) to an internal event i from the original log if and only if all unmasked fields have the same value and i occurs between i 's preceding and succeeding inputs in the happens-before relation.
Some internal events from the original log that "happen before" some external input may be absent when replaying a subsequence.
If the authors prune a link failure, the corresponding notification message will not arise.
The authors heuristic is to proceed normally if there are new internal events, always injecting the next input when its last expected predecessor either occurs or times out.
This ensures that the authors always find state transition suffixes that contain a subsequence of the original internal events, but leaves open the possibility of finding divergent suffixes that lead to the invariant violation.

4.3 Complexity

The delta debugging algorithm terminates after Ω(log n) invocations of replay in the best case, and O(n) in the worst case, where n is the number of inputs in the original trace [58] .
Each invocation of replay takes O(n) time (one iteration for PEEK and one iteration for the replay itself), for an overall runtime of Ω(n log n) best case and O(n 2 ) worst case replayed inputs.
The runtime can be decreased by parallelizing delta debugging: speculatively replaying subsequences in parallel, and joining the results.
Storing periodic checkpoints of the system state throughout testing can also reduce runtime, as it allows us to replay starting from a recent checkpoint rather than the beginning of the trace.

5. SYSTEMS CHALLENGES

Thus far the authors have assumed that they are given a faulty execution trace.
The authors now provide an overview of how they obtain traces, and then describe their system for minimizing them.
The mock network manages the execution of events from a single location, which allows it to record a serial event ordering.
STS also optionally makes use of Open vSwitch [46] as an interposition point between controllers.
In designing STS the authors aimed to make it possible for engineering organizations to implement the technology within their existing QA test infrastructure.

5.2 Mitigating Non-Determinism

When non-determinism is acute, one might seek to prevent it altogether.
Short of ensuring full determinism, the authors place STS in a position to record and replay all network events in serial order, and ensure that all data structures within STS are unaffected by randomness.
The authors also optionally interpose on the controller software itself.
STS may need visibility into the control software's internal state transitions to properly maintain happens-before relations during replay.
Such coarse-grained visibility into internal state transitions does not handle all cases, but the authors find it suffices in practice.

5.3 Checkpointing

To efficiently implement the PEEK algorithm depicted in Figure 2 the authors assume the ability to record checkpoints of the state of the system under test.
The authors currently implement checkpointing for the POX controller 18 by telling it to fork itself and suspend its child, transparently cloning the sockets of the parent (which constitute shared state between the parent and child processes), and later resuming the child.
This simple mechanism does not work for controllers that use other shared state such as disk.
Alternatively, they can avoid PEEK and solely use the event scheduling heuristics described in §5.
By shortening the replay time, checkpointing coincidentally helps cope with the effects of nondeterminism, as there is less opportunity for divergence in timing.

5.4 Timing Heuristics

The authors have found three heuristics useful for ensuring that invariant violations are consistently reproduced.
The authors find that keeping the wall-clock spacing between replay events close to the recorded timing helps (but does not alone suffice) to ensure that invariant violations are consistently reproduced.
Upon further examination the authors found in these cases that LLDP and OpenFlow echo packets periodically sent by the control software were staying in STS's buffers too long during replay, such that the control software would time out on them.
To avoid these differences, the authors added an option to always pass through keepalive messages.
Dataplane forward/drop events constitute a substantial portion of overall events.

5.5 Root Causing Tools

Throughout their experimentation with STS, the authors often found that MCSes alone were insufficient to pinpoint the root causes of bugs.
The authors therefore implemented a number of complementary root causing tools, which they use along with Unix utilities to finish the debugging process.
STS supports an interactive replay mode similar to OFRewind [56] that allows troubleshooters to query the network state, filter events, check additional invariants, and even induce new events that were not part of the original event trace.
The OpenFlow commands sent by controller software are often redundant, e.g. they may override routing entries, allow them to expire, or periodically flush and later repopulate them.
The authors often found it informative to visualize the ordering of message deliveries and internal state transitions.

5.6 Limitations

Having detailed the specifics of their approach the authors now clarify the scope of their technique's use.
The authors event scheduling algorithm assumes that it has visibility into the occurrence of relevant internal events.
For some software this may require substantial instrumentation beyond preexisting log statements, though as the authors show in §6, most bugs they encountered can be minimized without perfect visibility.
When non-determinism is present STS (i) replays multiple times per subsequence, and (ii) employs software techniques for mitigating non-determinism, but it may nonetheless output a non-minimal MCS.
In the worst case STS leaves the developer where they started: an unpruned log.

Lack of Guarantees.

Due to partial visibility and nondeterminism, the authors do not provide guarantees on MCS minimality.
The authors goal is not to find the root cause of individual component failures in the system (e.g. misbehaving routers, link failures).
Performance overhead from interposing on messages may prevent STS from minimizing bugs triggered by high message rates.
19 Similarly, STS's design may prevent it from minimizing extremely large traces, as the authors evaluate in §6.
The authors are primarily focused on correctness bugs, not performance bugs.

6. EVALUATION

The authors first demonstrate STS's viability in troubleshooting real bugs.
Second, the authors demonstrate the boundaries of where STS works well and where it does not by finding MCSes for previously known and synthetic bugs that span a range of bug types encountered in practice.
The authors ultimate goal is to reduce effort spent on troubleshooting bugs.
Interactive visualizations and replayable event traces for all of these case studies are publicly available at ucb-sts.github.com/experiments.

6.1 New Bugs

The authors discovered a loop when fuzzing Pyretic's hub module, whose purpose is to flood packets along a minimum spanning tree.
The loop seemed to persist until Pyretic periodically flushed all flow entries.
During this window, a PacketIn (LLDP packet) was forwarded to POX's discovery module, which in turned raised a LinkEvent to l2_multi, which then failed because it expected SwitchUp to occur first.
The authors noticed after examining POX's code that there might be some corner cases related to host migrations.
The authors instead used the console output from the shortest subsequence that did produce the bug (21 inputs, 3 more than the MCS) to debug this trace.

6.2 Known bugs

The authors were able to reproduce a known problem [17] in Floodlight's distributed controller failover logic with STS.
The authors were able to successfully isolate the two-event MCS: the controller crash and the link failure.
They make this decision by electing the controller with the higher ID as the master for that link.
As a result, POX began randomly load balancing each subsequent packet for a given flow over the servers, causing session state to be lost.
The authors were able to minimize the MCS for this bug to 24 elements (there were two preexisting flow entries in each routing table, so 24 additional flows made the 26 (N+1) entries needed to overflow the table).

6.3 Synthetic bugs

The authors injected a crash on a code path that was highly dependent on internal timers firing within POX.
The authors were able to trigger the code path during fuzzing, but were unable to reproduce the bug during replay after five attempts.
The authors modified POX's reactive routing module to create a loop upon receiving a particular sequence of dataplane packets.
The authors found that the 7 event MCS was inflated by at least two events: a link failure and a link recovery that they did not believe were relevant to triggering the bug.
The authors created a case that would take STS very long to minimize: a memory leak that eventually caused a crash in POX.

6.4 Overall Results & Discussion

The authors note that with the exception of Delicate Timer Interleaving and ONOS Database Locking, STS was able to significantly reduce input traces.
The MCS WI column, showing the MCS sizes the authors produced when ignoring internal events entirely, indicates that their techniques for interleaving events are often crucial.
In this case the authors found better results by simply turning off interposition on internal events.
This requires many re-iterations through the code and logs using standard debugging tools (e.g. source level debuggers), and is highly tedious on human timescales.
Bugs that depend on fine-grained thread-interleaving or timers inside of the controller are the worst-case for STS.

6.5 Coping with Non-determinism

Recall that STS optionally replays each subsequence multiple times to mitigate the effects of non-determinism.
The authors evaluate the effectiveness of this approach by varying the maximum number of replays per subsequence while minimizing a synthetic nondeterministic loop created by Floodlight.
Figure 5 demonstrates that the size of the resulting MCS decreases with the maximum number of replays, at the cost of additional runtime; 10 replays per subsequence took 12.8 total hours, versus 6.1 hours without retries.

6.6 Instrumentation Complexity

For POX and Floodlight, the authors added shim layers to the control software to redirect gettimeofday, interpose on logging statements, and demultiplex sockets.
For Floodlight the authors needed 722 lines of Java, and for POX they needed 415 lines of Python.

6.7 Scalability

Mocking the network in a single process potentially prevents STS from triggering bugs that only appear at large scale.
At that point, the machine started thrashing, but this limitation could easily be removed by running on a machine with >6GB of memory.
Note that STS is not designed for high-throughput dataplane traffic; the authors only forward what is necessary to exercise the controller software.
In proactive SDN setups, dataplane events are not relevant for the control software, except perhaps for host discovery.
Runtime for bootstrapping FatTree networks, cutting 5% of links, and processing the controller's response, also known as Figure 6.

6.8 Parameters

The authors found throughout their experimentation that STS leaves open several parameters that need to be set properly.
Setting fuzzing parameters remains an important part of experiment setup.
This delay implies that invariant violations such as loops or blackholes can appear before the controller(s) have time to correct the network configuration.
In many cases such transient invariant violations are not of interest to developers.
The authors found that the number of events they timed out on while isolating the MCS became stable for values above 25 milliseconds.

7. DISCUSSION

Based on conversations with engineers and their own industrial experience, two facts seem to hold.
Second, the larger the trace, the more effort is spent on debugging, since humans can only keep a small number of facts in working memory [41] .
As one developer puts it, "Automatically shrinking test cases to the minimal case is immensely helpful" [52] .
The authors are currently evaluating their technique on other distributed systems, and believe it to be generally applicable.
Finally, without care, a single input event may appear multiple times in the distributed logs.

9. CONCLUSION

SDN aims to make networks easier to manage.
SDN does this, however, by pushing complexity into SDN control software itself.
Just as sophisticated compilers are hard to write, but make programming easy, SDN control software makes network management easier, but only by forcing the developers of SDN control software to confront the challenges of asynchrony, partial failure, and other notoriously hard problems inherent to all distributed systems.
Current techniques for troubleshooting SDN control software are primitive; they essentially involve manual inspection of logs in the hope of identifying the triggering inputs.
Here the authors developed a technique for automatically identifying a minimal sequence of inputs responsible for triggering a given bug, without making assumptions about the language or instrumentation of the software under test.

Did you find this useful? Give us your feedback

Figures (9)

Figure 5: Effectiveness of replaying subsequences multiple times in mitigating non-determinism.

Figure 6: Runtime for bootstrapping FatTree networks, cutting 5% of links, and processing the controller’s response.

Table 1: Internal messages and their masked values.

Figure 4: Minimization runtime behavior.

Table 3: Overview of Case Studies. ‘WI’ denotes ‘Without Interposition’, and ‘NR’ denotes ‘Not Replayable’.

Figure 1: Automated Delta Debugging Algorithm from [58]. ⊆ and ⊂ denote subsequence relations.

Table 2: Input types currently supported by STS.

Figure 2: PEEK determines which internal events from the original sequence occur for a given subsequence.

Figure 3: STS runs mock network devices, and interposes on all communication channels.

Content maybe subject to copyright Report

Troubleshooting Blackbox SDN Control Software with

Minimal Causal Sequences

Colin Scott



Andreas Wundsam

†?

Barath Raghavan

Aurojit Panda



Andrew Or



Jefferson Lai



Eugene Huang



Zhi Liu

Ahmed El-Hassany

Sam Whitlock

H.B. Acharya

Kyriakos Zariﬁs

‡?

Scott Shenker

?



UC Berkeley

†

Big Switch Networks

ICSI

Tshinghua University

]

EPFL

‡

USC

ABSTRACT

Software bugs are inevitable in software-deﬁned networking con-

trol software, and troubleshooting is a tedious, time-consuming

task. In this paper we discuss how to improve control software

troubleshooting by presenting a technique for automatically iden-

tifying a minimal sequence of inputs responsible for triggering a

given bug, without making assumptions about the language or in-

strumentation of the software under test. We apply our technique to

ﬁve open source SDN control platforms—Floodlight, NOX, POX,

Pyretic, ONOS—and illustrate how the minimal causal sequences

our system found aided the troubleshooting process.

Categories and Subject Descriptors

C.2.4 [Computer-Communication Networks]: Distributed Sys-

tems—Network operating systems; D.2.5 [Software Engineering]:

Testing and Debugging—Debugging aids

Keywords

Test case minimization; Troubleshooting; SDN control software

1. INTRODUCTION

Software-deﬁned networking (SDN) proposes to simplify net-

work management by providing a simple logically-centralized API

upon which network management programs can be written. How-

ever, the software used to support this API is anything but sim-

ple: the SDN control plane (consisting of the network operat-

ing system and higher layers) is a complicated distributed system

that must react quickly and correctly to failures, host migrations,

policy-conﬁguration changes and other events. All complicated

distributed systems are prone to bugs, and from our ﬁrst-hand fa-

miliarity with ﬁve open source controllers and three major com-

mercial controllers we can attest that SDN is no exception.

When faced with symptoms of a network problem (e.g. a persis-

tent loop) that suggest the presence of a bug in the control plane

software, software developers need to identify which events are

triggering this apparent bug before they can begin to isolate and

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full citation

on the ﬁrst page. Copyrights for components of this work owned by others than the

author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

SIGCOMM’14, August 17–22, 2014, Chicago, Illinois, USA.

ACM 978-1-4503-2836-4/14/08 ...$15.00.

http://dx.doi.org/10.1145/2619239.2626304.

ﬁx it. This act of “troubleshooting” (which precedes the act of de-

bugging the code) is highly time-consuming, as developers spend

hours poring over multigigabyte execution traces.

Our aim is to re-

duce effort spent on troubleshooting distributed systems like SDN

control software, by automatically eliminating events from buggy

traces that are not causally related to the bug, producing a “minimal

causal sequence” (MCS) of triggering events.

Our goal of minimizing traces is in the spirit of delta debug-

ging [58], but our problem is complicated by the distributed nature

of control software: our input is not a single ﬁle fed to a single point

of execution, but an ongoing sequence of events involving multiple

actors. We therefore need to carefully control the interleaving of

events in the face of asynchrony, concurrency and non-determinism

in order to reproduce bugs throughout the minimization process.

Crucially, we aim to minimize traces without making assumptions

about the language or instrumentation of the control software.

We have built a troubleshooting system that, as far as we know,

is the ﬁrst to meet these challenges (as we discuss further in §8).

Once it reduces a given execution trace to an MCS (or an approxi-

mation thereof), the developer embarks on the debugging process.

We claim that the greatly reduced size of the trace makes it easier

for the developer to ﬁgure out which code path contains the under-

lying bug, allowing them to focus their effort on the task of ﬁxing

the problematic code itself. After the bug has been ﬁxed, the MCS

can serve as a test case to prevent regression, and can help identify

redundant bug reports where the MCSes are the same.

Our troubleshooting system, which we call STS (SDN Trou-

bleshooting System), consists of 23,000 lines of Python, and is de-

signed so that organizations can implement the technology within

their existing QA infrastructure (discussed in §5); over the last year

we have worked with a commercial SDN company to integrate

STS. We evaluate STS in two ways. First and most signiﬁcantly,

we use STS to troubleshoot seven previously unknown bugs—

involving concurrent events, faulty failover logic, broken state ma-

chines, and deadlock in a distributed database—that we found by

fuzz testing ﬁve controllers (Floodlight [16], NOX [23], POX [39],

Pyretic [19], ONOS [43]) written in three different languages (Java,

C++, Python). Second, we demonstrate the boundaries of where

STS works well by ﬁnding MCSes for previously known and syn-

thetic bugs that span a range of bug types. In our evaluation, we

quantitatively show that STS is able to minimize (non-synthetic)

bug traces by up to 98%, and we anecdotally found that reducing

traces to MCSes made it easy to understand their root causes.

Software developers in general spend roughly half (49% ac-

cording to one study [21]) of their time troubleshooting and debug-

ging, and spend considerable time troubleshooting bugs that are

difﬁcult to trigger (the same study found that 70% of the reported

concurrency bugs take days to months to ﬁx).

2. BACKGROUND

Network operating systems, the key component of SDN soft-

ware infrastructure, consist of control software running on a repli-

cated set of servers, each running a controller instance. Controllers

coordinate between themselves, and receive input events (e.g. link

failure notiﬁcations) and statistics from switches (either physical or

virtual), policy changes via a management interface, and possibly

dataplane packets. In response, the controllers issue forwarding

instructions to switches. All input events are asynchronous, and

individual controllers may fail at any time. The controllers either

communicate with each other over the dataplane network, or use a

separate dedicated network, and may become partitioned.

The goal of the network control plane is to conﬁgure the switch

forwarding entries so as to enforce one or more invariants, such as

connectivity (i.e. ensuring that a route exists between every end-

point pair), isolation and access control (i.e. various limitations on

connectivity), and virtualization (i.e. ensuring that packets are han-

dled in a manner consistent with the speciﬁed virtual network). A

bug causes an invariant to be violated. Invariants can be violated

because the system was improperly conﬁgured (e.g. the manage-

ment system [2] or a human improperly speciﬁed their goals), or

because there is a bug within the SDN control plane itself. In this

paper we focus on troubleshooting bugs in the SDN control plane

after it has been given a policy conﬁguration.

In commercial SDN development, software developers work

with a team of QA engineers whose job is to ﬁnd bugs. The QA

engineers exercise automated test scenarios that involve sequences

of external (input) events such as failures on large (software em-

ulated or hardware) network testbeds. If they detect an invariant

violation, they hand the resulting trace to a developer for analysis.

The space of possible bugs is enormous, and it is difﬁcult and

time consuming to link the symptom of a bug (e.g. a routing loop)

to the sequence of events in the QA trace (which includes both

external events and internal monitoring data), since QA traces con-

tain a wealth of extraneous events. Consider that an hour long QA

test emulating event rates observed in production could contain 8.5

network error events per minute [22] and 500 VM migrations per

hour [49], for a total of 8.5 · 60 + 500 ≈ 1000 inputs.

3. PROBLEM DEFINITION

We represent the forwarding state of the network at a particular

time as a conﬁguration c, which contains all the forwarding en-

tries in the network as well as the liveness of the various network

elements. The control software is a system consisting of one or

more controller processes that takes a sequence of external network

events E =

→

→···

(e.g. link failures) as inputs, and pro-

duces a sequence of network conﬁgurations C = c

, c

, . . . , c

An invariant is a predicate P over forwarding state (a safety con-

dition, e.g. loop-freedom). We say that conﬁguration c violates the

invariant if P(c) is false, denoted P (c).

We are given a log L generated by a centralized QA test orches-

trator.

The log L contains a sequence of events

→

→···

which includes external events E

···

injected by the

orchestrator, and internal events I

···

triggered by

the control software (e.g. OpenFlow messages). The events E

include timestamps {(

, t

)} from the orchestrator’s clock.

This does not preclude us from troubleshooting misspeciﬁed

policies so long as test invariants [31] are speciﬁed separately.

We discuss how these logs are generated in §5.

A replay of log L involves replaying the external events E

possibly taking into account the occurrence of internal events I

as observed by the orchestrator. We denote a replay attempt by

replay(τ ). The output of replay is a sequence of conﬁgurations

= ˆc

, ˆc

, . . . , ˆc

. Ideally replay(τ

) reproduces the original

conﬁguration sequence, but this does not always hold.

If the conﬁguration sequence C

= c

, c

, . . . , c

associated

with the log L violated predicate P (i.e. ∃

∈C

.P (c

)) then we

say replay(·) = C

reproduces that violation if C

contains an

equivalent faulty conﬁguration (i.e. ∃

ˆc

∈C

.P (ˆc

)).

The goal of our work is, when given a log L that exhibited an

invariant violation,

to ﬁnd a small, replayable sequence of events

that reproduces that invariant violation. Formally, we deﬁne a mini-

mal causal sequence (MCS) to be a sequence τ

where the external

events E

∈ τ

are a subsequence of E

such that replay(τ

)

reproduces the invariant violation, but for all proper subsequences

of E

there is no sequence τ

such that replay(τ

) repro-

duces the violation. Note that an MCS is not necessarily globally

minimal, in that there could be smaller subsequences of E

that

reproduce this violation, but are not a subsequence of this MCS.

We ﬁnd approximate MCSes by deciding which external events

to eliminate and, more importantly, when to inject external events.

We describe this process in the next section.

4. MINIMIZING TRACES

Given a log L generated from testing infrastructure,

our goal

is to ﬁnd an approximate MCS, so that a human can examine the

MCS rather than the full log. This involves two tasks: searching

through subsequences of E

, and deciding when to inject external

events for each subsequence so that, whenever possible, the invari-

ant violation is retriggered.

4.1 Searching for Subsequences

Checking random subsequences of E

would be one viable but

inefﬁcient approach to achieving our ﬁrst task. We do better by em-

ploying the delta debugging algorithm [58], a divide-and-conquer

algorithm for isolating fault-inducing inputs. We use delta debug-

ging to iteratively select subsequences of E

and replay each sub-

sequence with some timing T . If the bug persists for a given sub-

sequence, delta debugging ignores the other inputs, and proceeds

with the search for an MCS within this subsequence. The delta

debugging algorithm we implement is shown in Figure 1.

The input subsequences chosen by delta debugging are not al-

ways valid. Of the possible inputs sequences we generate (shown

in Table 2), it is not sensible to replay a recovery event without a

preceding failure event, nor to replay a host migration event with-

out modifying its starting position when a preceding host migration

event has been pruned. Our implementation of delta debugging

therefore prunes failure/recovery event pairs as a single unit, and

updates initial host locations whenever host migration events are

pruned so that hosts do not magically appear at new locations.

These two heuristics account for validity of all network events

Handling invalid inputs is crucial for ensuring that the delta

debugging algorithm ﬁnds a minimal causal subsequence. The al-

gorithm we employ [58] makes three assumptions about inputs:

monotonicity, unambiguity, and consistency. An event trace that

violates monotonicity may contain events that “undo” the invariant

violation triggered by the MCS, and may therefore exhibit slightly

inﬂated MCSes. An event trace that violates unambiguity may ex-

hibit multiple MCSes; delta debugging will return one of them.

The most important assumption is consistency, which requires that

the test outcome can always be determined. We guarantee neither

monotonicity nor unambiguity, but we guarantee consistency by

ensuring that subsequences are always semantically valid by ap-

plying the two heuristics described above. Zeller wrote a follow-on

shown in Table 2. We do not yet support network policy changes

as events, which have more complex semantic dependencies.

4.2 Searching for Timings

Simply exploring subsequences E

of E

is insufﬁcient for ﬁnd-

ing MCSes: the timing of when we inject the external events during

replay is crucial for reproducing violations.

Existing Approaches. The most natural approach to scheduling

external events is to maintain the original wall-clock timing inter-

vals between them. If this is able to ﬁnd all minimization oppor-

tunities, i.e. reproduce the violation for all subsequences that are

a supersequence of some MCS, we say that the inputs are isolated.

The original applications of delta debugging [6,47,58,59] make this

assumption (where a single input is fed to a single program), as well

as QuickCheck’s input “shrinking” [12] when applied to blackbox

systems like synchronous telecommunications protocols [4].

We tried this approach, but were rarely able to reproduce invari-

ant violations. As our case studies demonstrate (§6), this is largely

due to the concurrent, asynchronous nature of distributed systems;

consider that the network can reorder or delay messages, or that

controllers may process multiple inputs simultaneously. Inputs in-

jected according to wall-clock time are not guaranteed to coincide

correctly with the current state of the control software.

We must therefore consider the control software’s internal

events. To deterministically reproduce bugs, we would need visibil-

ity into every I/O request and response (e.g. clock values or socket

reads), as well as all thread scheduling decisions for each controller.

This information is the starting point for techniques that seek to

minimize thread interleavings leading up to race conditions. These

approaches involve iteratively feeding a single input (the thread

schedule) to a single entity (a deterministic scheduler) [11, 13, 28],

or statically analyzing feasible thread schedules [26].

A crucial constraint of these approaches is that they must keep

the inputs ﬁxed; that is, behavior must depend uniquely on the

thread schedule. Otherwise, the controllers may take a divergent

code path. If this occurs some processes might issue a previously

unobserved I/O request, and the replayer will not have a recorded

response; worse yet, a divergent process might deschedule itself at

a different point than it did originally, so that the remainder of the

recorded thread schedule is unusable to the replayer.

Because they keep the inputs ﬁxed, these approaches strive for a

subtly different goal than ours: minimizing thread context switches

rather than input events. At best, these approaches can indirectly

minimize input events by truncating individual thread executions.

With additional information obtained by program ﬂow analy-

sis [27, 34, 50] however, the inputs no longer need to be ﬁxed.

The internal events considered by these program ﬂow reduction

techniques are individual instructions executed by the programs

(obtained by instrumenting the language runtime), in addition to

I/O responses and the thread schedule. With this information they

can compute program ﬂow dependencies, and thereby remove in-

put events from anywhere in the trace as long as they can prove that

doing so cannot possibly cause the faulty execution path to diverge.

While program ﬂow reduction is able to minimize inputs, these

techniques are not able to explore alternate code paths that still trig-

ger the invariant violation. They are also overly conservative in re-

moving inputs (e.g. EFF takes the transitive closure of all possible

dependencies [34]) causing them to miss opportunities to remove

paper [59] that removes the need for these assumptions, but incurs

an additional factor of n in complexity in doing so.

If codifying the semantic dependencies of policy changes turns

out to be difﬁcult, one could just employ the more expensive ver-

sion of delta debugging to account for inconsistency [59].

Internal Message Masked Values

OpenFlow messages xac id, cookie, buffer id, stats

packet_out/in payload all values except src, dst, data

Log statements varargs parameters to printf

Table 1: Internal messages and their masked values.

dependencies that actually semantically commute.

Allowing Divergence. Our approach is to allow processes to pro-

ceed along divergent paths rather than recording all low-level I/O

and thread scheduling decisions. This has several advantages. Un-

like the other approaches, we can ﬁnd shorter alternate code paths

that still trigger the invariant violation. Previous best-effort exe-

cution minimization techniques [14, 53] also allow alternate code

paths, but do not systematically consider concurrency and asyn-

chrony.

We also avoid the performance overhead of recording

all I/O requests and later replaying them (e.g. EFF incurs ~10x

slowdown during replay [34]). Lastly, we avoid the extensive ef-

fort required to instrument the control software’s language runtime,

needed by the other approaches to implement a deterministic thread

scheduler, interpose on syscalls, or perform program ﬂow analysis.

By avoiding assumptions about the language of the control soft-

ware, we were able to easily apply our system to ﬁve different con-

trol platforms written in three different languages.

Accounting for Interleavings. To reproduce the invariant viola-

tion (whenever E

is a supersequence of an MCS) we need to inject

each input event

only after all other events, including internal

events, that precede it in the happens-before relation [33] from the

original execution ({i | i →

}) have occurred [51].

The internal events we consider are (a) message delivery events,

either between controllers (e.g. database synchronization mes-

sages) or between controllers and switches (e.g. OpenFlow mes-

sages), and (b) state transitions within controllers (e.g. a backup

node deciding to become master). Our replay orchestrator obtains

visibility into (a) by interposing on all messages within the test en-

vironment (to be described in §5). It optionally obtains partial vis-

ibility into (b) by instrumenting controller software with a simple

interposition layer (to be described in §5.2).

Given a subsequence E

, our goal is to ﬁnd an execution that

obeys the original happens-before relation. We do not control the

occurrence of internal events, but we can manipulate when they are

delivered through our interposition layer,

and we also decide when

to inject the external events E

. The key challenges in choosing a

schedule stem from the fact that the original execution has been

modiﬁed: internal events may differ syntactically, some expected

internal events may no longer occur, and new internal events may

occur that were not observed at all in the original execution.

Functional Equivalence. Internal events may differ syntactically

(e.g. sequence numbers of control packets may all differ) when re-

playing a subsequence of the original log. We observe that many

internal events are functionally equivalent, in the sense that they

have the same effect on the state of the system with respect to trig-

gering the invariant violation. For example, flow_mod messages

may cause switches to make the same change to their forwarding

behavior even if their transaction ids differ.

We apply this observation by deﬁning masks over semantically

extraneous ﬁelds of internal events.

We show the ﬁelds we mask

PRES explores alternate code paths in best-effort replay of

multithreaded executions, but does not minimize executions [45].

In this way we totally order messages. Without interposition

on process scheduling however, the system may still be concurrent.

One consequence of applying masks is that bugs involving

masked ﬁelds are outside the purview of our approach.

Input: T

s.t. T

is a trace and test(T

) = 8. Output: T

= ddmin(T

) s.t. T

⊆ T

, test(T

) = 8, and T

is minimal.

ddmin(T

) = ddmin

, ∅) where

ddmin

, R) =











if |T

| = 1 (“base case”)

ddmin



, R



else if test(T

∪ R) = 8 (“in T

”)

ddmin



, R



else if test(T

∪ R) = 8 (“in T

”)

ddmin



, T

∪ R



∪ ddmin



, T

∪ R



otherwise (“interference”)

where test(T ) denotes the state of the system after executing the trace T , 8 denotes an invariant violation,

⊂ T

, T

⊂ T

, T

∪ T

= T

, T

∩ T

= ∅, and |T

| ≈ |T

|/2 hold.

Figure 1: Automated Delta Debugging Algorithm from [58]. ⊆ and ⊂ denote subsequence relations.

Input Type Implementation

Switch failure/recovery TCP teardown

Controller failure/recovery SIGKILL

Link failure/recovery ofp_port_status

Controller partition iptables

Dataplane packet injection Network namespaces

Dataplane packet drop Dataplane interposition

Dataplane packet delay Dataplane interposition

Host migration ofp_port_status

Control message delay Controlplane interposition

Non-deterministic TCAMs Modiﬁed switches

Table 2: Input types currently supported by STS.

procedure PEEK(input subsequence)

inferred ← [ ]

for e

in subsequence











checkpoint system

inject e

∆ ← |e

i+1

.time − e

.time| + 

record events for ∆ seconds

matched ← original events & recorded events

inferred ← inferred + [e

] + matched

restore checkpoint

return inferred

Figure 2: PEEK determines which internal events from the original

sequence occur for a given subsequence.

in Table 1. Note that these masks only need to be speciﬁed once,

and can later be applied programmatically.

We then consider an internal event i

observed in replay equiva-

lent (in the sense of inheriting all of its happens-before relations) to

an internal event i from the original log if and only if all unmasked

ﬁelds have the same value and i occurs between i

’s preceding and

succeeding inputs in the happens-before relation.

Handling Absent Internal Events. Some internal events from the

original log that “happen before” some external input may be ab-

sent when replaying a subsequence. For instance, if we prune a link

failure, the corresponding notiﬁcation message will not arise.

To avoid waiting forever we infer the presence of internal

events before we replay each subsequence. Our algorithm (called

PEEK()) for inferring the presence of internal events is depicted in

Figure 2. The algorithm injects each input, records a checkpoint

of the network and the control software’s state, allows the system to

proceed up until the following input (plus a small time ), records

the observed events, and matches the recorded events with the func-

tionally equivalent internal events observed in the original trace.

We discuss the implementation details of checkpointing in 5.3.

In the case that, due to non-determinism, an internal event oc-

curs during PEEK() but does not occur during replay, we time out

on internal events after  seconds of their expected occurrence.

Handling New Internal Events. The last possible induced change

is the occurrence of new internal events that were not observed in

the original log. New events present multiple possibilities for where

we should inject the next input. Consider the following case: if i

and i

are internal events observed during replay that are both in

the same equivalence class as a single event i

from the original

run, we could inject the next input after i

or after i

In the general case it is always possible to construct two state

machines that lead to differing outcomes: one that only leads to the

invariant violation when we inject the next input before a new in-

ternal event, and another only when we inject after a new internal

event. In other words, to be guaranteed to traverse any state transi-

tion sufﬁx that leads to the violation, we must recursively branch,

trying both possibilities for every new internal event. This implies

an exponential worst case number of possibilities to be explored.

Exponential search over these possibilities is not a practical op-

tion. Our heuristic is to proceed normally if there are new internal

events, always injecting the next input when its last expected prede-

cessor either occurs or times out. This ensures that we always ﬁnd

state transition sufﬁxes that contain a subsequence of the (equiv-

alent) original internal events, but leaves open the possibility of

ﬁnding divergent sufﬁxes that lead to the invariant violation.

Recap. We combine these heuristics to replay each subsequence

chosen by delta debugging: we compute functional equivalency for

all internal events intercepted by our test orchestrator’s interposi-

tion layer (§5), we invoke PEEK() to infer absent internal events,

and with these inferred causal dependencies we replay the input

subsequence, waiting to inject each input until each of its (func-

tionally equivalent) predecessors have occurred while allowing new

internal events through the interposition layer immediately.

4.3 Complexity

The delta debugging algorithm terminates after Ω(log n) invoca-

tions of replay in the best case, and O(n) in the worst case, where

n is the number of inputs in the original trace [58]. Each invocation

of replay takes O(n) time (one iteration for PEEK() and one itera-

tion for the replay itself), for an overall runtime of Ω(n log n) best

case and O(n

) worst case replayed inputs. The runtime can be de-

creased by parallelizing delta debugging: speculatively replaying

subsequences in parallel, and joining the results. Storing periodic

checkpoints of the system state throughout testing can also reduce

runtime, as it allows us to replay starting from a recent checkpoint

rather than the beginning of the trace.

5. SYSTEMS CHALLENGES

Thus far we have assumed that we are given a faulty execution

trace. We now provide an overview of how we obtain traces, and

then describe our system for minimizing them.

Obtaining Traces. All three of the commercial SDN companies

Figure 3: STS runs mock network devices, and interposes on all

communication channels.

that we know of employ a team of QA engineers to fuzz test their

control software on network testbeds. This fuzz testing infrastruc-

ture consists of the control software under test, the network testbed

(which may be software or hardware), and a centralized test or-

chestrator that chooses input sequences, drives the behavior of the

testbed, and periodically checks invariants.

We do not have access to such a QA testbed, and instead built our

own. Our testbed mocks out the control plane behavior of network

devices in lightweight software switches and hosts (with support

for minimal dataplane forwarding). We then run the control soft-

ware on top of this mock network and connect the switches to the

controller(s). The mock network manages the execution of events

from a single location, which allows it to record a serial event order-

ing. This design is similar to production software QA testbeds, and

is depicted in Figure 3. One distinguishing feature of our design is

that the mock network interposes on all communication channels,

allowing it to delay or drop messages to induce failure modes that

might be seen in real, asynchronous networks.

We use our mock network to ﬁnd bugs in control software. Most

commonly we generate random input sequences based on event

probabilities that we assign (cf. §6.8), and periodically check in-

variants on the network state.

We also run the mock network in-

teractively so that we can examine the state of the network and

manually induce event orderings that we believe may trigger bugs.

Performing Minimization. After discovering an invariant viola-

tion, we invoke delta debugging to minimize the recorded trace.

We use the testing infrastructure itself to replay each intermedi-

ate subsequence. During replay the mock network enforces event

orderings as needed to maintain the original happens-before rela-

tion, by using its interposition on message channels to manage the

order (functionally equivalent) messages are let through, and wait-

ing until the appropriate time to inject inputs. For example, if the

original trace included a link failure preceded by the arrival of a

heartbeat message, during replay the mock network waits until it

observes a functionally equivalent ping probe to arrive, allows the

probe through, then tells the switch to fail its link.

STS is our realization of this system, implemented in more than

23,000 lines of Python in addition to the Hassel network invari-

ant checking library [31]. STS also optionally makes use of Open

vSwitch [46] as an interposition point between controllers. We have

made the code for STS publicly available at ucb-sts.github.com/sts.

Integration With Existing Testbeds. In designing STS we aimed

We currently support the following invariants: (a) all-to-all

reachability, (b) loop freeness, (c) blackhole freeness, (d) controller

liveness, and (e) POX ACL compliance.

to make it possible for engineering organizations to implement the

technology within their existing QA test infrastructure. Organiza-

tions can add delta debugging to their test orchestrator, and option-

ally add interposition points throughout the testbed to control event

ordering during replay. In this way they can continue running large

scale networks with the switches, middleboxes, hosts, and routing

protocols they had already chosen to include in their QA testbed.

We avoid making assumptions about the language or instrumen-

tation of the software under test in order to facilitate integration

with preexisting software. Many of the heuristics we describe be-

low are approximations that might be made more precise if we had

more visibility and control over the system, e.g. if we could deter-

ministically specify the thread schedule of each controller.

5.1 Coping with Non-Determinism

Non-determinism in concurrent executions stems from differ-

ences in system call return values, process scheduling decisions

(which can even affect the result of individual instructions, such

as x86’s interruptible block memory instructions [15]), and asyn-

chronous signal delivery. These sources of non-determinism can

affect whether STS is able to reproduce violations during replay.

The QA testing frameworks we are trying to improve do not

mitigate non-determinism. STS’s main approach to coping with

non-determinism is to replay each subsequence multiple times.

If the non-deterministic bug occurs with probability p, we can

model

the probability

that we will observe it within r replays as

1 − (1 − p)

. This exponential works strongly in our favor; for ex-

ample, even if the original bug is triggered in only 20% of replays,

the probability that we will not trigger it during an intermediate

replay is approximately 1% if we replay 20 times per subsequence.

5.2 Mitigating Non-Determinism

When non-determinism is acute, one might seek to prevent it al-

together. However, as discussed in §4.2, deterministic replay tech-

niques [15, 20] force the minimization process to stay on the origi-

nal code path, and incur substantial performance overhead.

Short of ensuring full determinism, we place STS in a position

to record and replay all network events in serial order, and ensure

that all data structures within STS are unaffected by randomness.

For example, we avoid using hashmaps that hash keys according to

their memory address, and sort all list return values.

We also optionally interpose on the controller software itself.

Routing the gettimeofday() syscall through STS helps ensure

timer accuracy.

1415

When sending data over multiple sockets, the

operating system exhibits non-determinism in the order it sched-

ules I/O operations. STS optionally ensures a deterministic order

of messages by multiplexing all sockets onto a single true socket.

On the controller side STS currently adds a shim layer atop the

control software’s socket library,

although this could be achieved

transparently with a libc shim layer [20].

STS may need visibility into the control software’s internal state

transitions to properly maintain happens-before relations during

replay. We gain visibility by making a small change to the control

See §6.5 for an experimental evaluation of this model.

This probability could be improved by guiding the thread

schedule towards known error-prone interleavings [44,45].

When the pruned trace differs from the original, we make a

best-effort guess at what the return values of these calls should be.

For example, if the altered execution invokes gettimeofday()

more times than we recorded in the initial run, we interpolate the

timestamps of neighboring events.

Only supported for POX and Floodlight at the moment.

Only supported for POX at the moment.

HTML Viewer

Frequently Asked Questions (12)

Q1. What are the contributions in "Troubleshooting blackbox sdn control software with minimal causal sequences" ?

In this paper the authors discuss how to improve control software troubleshooting by presenting a technique for automatically identifying a minimal sequence of inputs responsible for triggering a given bug, without making assumptions about the language or instrumentation of the software under test.

Q2. How did the developers verify that the loop no longer appeared?

By adding a timer before installing entries to allow for links to be discovered, the developers were able to verify that the loop no longer appeared.

Q3. What is the way to avoid redundant input events?

The most robust way to avoid redundant input events would be to employ perfect failure detectors [8], which log a failure iff the failure actually occurred.

Q4. What is the goal of minimizing traces?

Their goal of minimizing traces is in the spirit of delta debugging [58], but their problem is complicated by the distributed nature of control software: their input is not a single file fed to a single point of execution, but an ongoing sequence of events involving multiple actors.

Q5. What is the way to use PEEK()?

If developers do not choose to employ checkpointing, they can use their implementation of PEEK() that replays inputs from the beginning rather than a checkpoint, thereby increasing replay runtime by a factor of n.

Q6. How many elements did the authors need to minimize the MCS for this bug?

The authors were able to minimize the MCS for this bug to 24 elements (there were two preexisting flow entries in each routing table, so 24 additional flows made the 26 (N+1) entries needed to overflow the table).

Q7. How did the authors set the memory leak to happen?

The authors artificially set the memory leak to happen quickly after allocating 30 (M) objects created upon switch handshakes, and interspersed 691 other input events throughout switch reconnect events.

Q8. What is the heuristic for determining the validity of all network events?

These two heuristics account for validity of all network events 4Handling invalid inputs is crucial for ensuring that the delta debugging algorithm finds a minimal causal subsequence.

Q9. How do the authors model the probability that a non-deterministic bug occurs?

If the non-deterministic bug occurs with probability p, the authors can model12 the probability13 that the authors will observe it within r replays as 1− (1− p)r .

Q10. How can the authors reduce the runtime of a delta debugging algorithm?

The runtime can be decreased by parallelizing delta debugging: speculatively replaying subsequences in parallel, and joining the results.

Q11. How does the replay orchestrator obtain visibility into (a)?

Their replay orchestrator obtains visibility into (a) by interposing on all messages within the test environment (to be described in §5).

Q12. What are the other approaches to troubleshooting?

The authors characterize the other troubleshooting approaches as (i) instrumentation (tracing), (ii) bug detection (invariant checking), (iii) replay, and (iv) root cause analysis (of network device failures).

Troubleshooting blackbox SDN control software with minimal causal sequences

Summary (7 min read)

1. INTRODUCTION

2. BACKGROUND

3. PROBLEM DEFINITION

4. MINIMIZING TRACES

4.1 Searching for Subsequences

Internal Message

Input Type Implementation

4.3 Complexity

5. SYSTEMS CHALLENGES

5.2 Mitigating Non-Determinism

5.3 Checkpointing

5.4 Timing Heuristics

5.5 Root Causing Tools

5.6 Limitations

Lack of Guarantees.

6. EVALUATION

6.1 New Bugs

6.2 Known bugs

6.3 Synthetic bugs

6.4 Overall Results & Discussion

6.5 Coping with Non-determinism

6.6 Instrumentation Complexity

6.7 Scalability

6.8 Parameters

7. DISCUSSION

8. RELATED WORK

9. CONCLUSION

Figures (9)

Citations

References

Related Papers (5)

Frequently Asked Questions (12)

Q1. What are the contributions in "Troubleshooting blackbox sdn control software with minimal causal sequences" ?

Q2. How did the developers verify that the loop no longer appeared?

Q3. What is the way to avoid redundant input events?

Q4. What is the goal of minimizing traces?

Q5. What is the way to use PEEK()?

Q6. How many elements did the authors need to minimize the MCS for this bug?

Q7. How did the authors set the memory leak to happen?

Q8. What is the heuristic for determining the validity of all network events?

Q9. How do the authors model the probability that a non-deterministic bug occurs?

Q10. How can the authors reduce the runtime of a delta debugging algorithm?

Q11. How does the replay orchestrator obtain visibility into (a)?

Q12. What are the other approaches to troubleshooting?