scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Troubleshooting blackbox SDN control software with minimal causal sequences

TL;DR: This paper presents a technique for automatically identifying a minimal sequence of inputs responsible for triggering a given bug, without making assumptions about the language or instrumentation of the software under test.
Abstract: Software bugs are inevitable in software-defined networking control software, and troubleshooting is a tedious, time-consuming task. In this paper we discuss how to improve control software troubleshooting by presenting a technique for automatically identifying a minimal sequence of inputs responsible for triggering a given bug, without making assumptions about the language or instrumentation of the software under test. We apply our technique to five open source SDN control platforms---Floodlight, NOX, POX, Pyretic, ONOS---and illustrate how the minimal causal sequences our system found aided the troubleshooting process.

Summary (7 min read)

1. INTRODUCTION

  • Software-defined networking (SDN) proposes to simplify network management by providing a simple logically-centralized API upon which network management programs can be written.
  • All complicated distributed systems are prone to bugs, and from their first-hand familiarity with five open source controllers and three major commercial controllers the authors can attest that SDN is no exception.
  • This act of "troubleshooting" (which precedes the act of debugging the code) is highly time-consuming, as developers spend hours poring over multigigabyte execution traces.
  • The authors therefore need to carefully control the interleaving of events in the face of asynchrony, concurrency and non-determinism in order to reproduce bugs throughout the minimization process.
  • After the bug has been fixed, the MCS can serve as a test case to prevent regression, and can help identify redundant bug reports where the MCSes are the same.

2. BACKGROUND

  • Network operating systems, the key component of SDN software infrastructure, consist of control software running on a replicated set of servers, each running a controller instance.
  • Controllers coordinate between themselves, and receive input events (e.g. link failure notifications) and statistics from switches (either physical or virtual), policy changes via a management interface, and possibly dataplane packets.
  • Invariants can be violated because the system was improperly configured (e.g. the management system [2] or a human improperly specified their goals), or because there is a bug within the SDN control plane itself.
  • The QA engineers exercise automated test scenarios that involve sequences of external events such as failures on large (software emulated or hardware) network testbeds.
  • If they detect an invariant violation, they hand the resulting trace to a developer for analysis.

3. PROBLEM DEFINITION

  • A replay of log L involves replaying the external events EL, possibly taking into account the occurrence of internal events IL as observed by the orchestrator.
  • The goal of their work is, when given a log L that exhibited an invariant violation, 3 to find a small, replayable sequence of events that reproduces that invariant violation.
  • Note that an MCS is not necessarily globally minimal, in that there could be smaller subsequences of EL that reproduce this violation, but are not a subsequence of this MCS.
  • The authors find approximate MCSes by deciding which external events to eliminate and, more importantly, when to inject external events.
  • The authors describe this process in the next section.

4. MINIMIZING TRACES

  • Given a log L generated from testing infrastructure, 3 their goal is to find an approximate MCS, so that a human can examine the MCS rather than the full log.
  • Searching through subsequences of EL, and deciding when to inject external events for each subsequence so that, whenever possible, the invariant violation is retriggered, also known as This involves two tasks.

4.1 Searching for Subsequences

  • Checking random subsequences of EL would be one viable but inefficient approach to achieving their first task.
  • The input subsequences chosen by delta debugging are not always valid.
  • Of the possible inputs sequences the authors generate (shown in Table 2 ), it is not sensible to replay a recovery event without a preceding failure event, nor to replay a host migration event without modifying its starting position when a preceding host migration event has been pruned.
  • These two heuristics account for validity of all network events shown in Table 2 .
  • The authors do not yet support network policy changes as events, which have more complex semantic dependencies.

Internal Message

  • Masked Values OpenFlow messages xac id, cookie, buffer id, stats packet_out/in payload all values except src, dst, data Log statements varargs parameters to printf Table 1 : Internal messages and their masked values.
  • Previous best-effort execution minimization techniques [14, 53] also allow alternate code paths, but do not systematically consider concurrency and asynchrony.
  • It optionally obtains partial visibility into (b) by instrumenting controller software with a simple interposition layer (to be described in §5.2).
  • Internal events may differ syntactically (e.g. sequence numbers of control packets may all differ) when replaying a subsequence of the original log.
  • The authors apply this observation by defining masks over semantically extraneous fields of internal events.

Input Type Implementation

  • The authors then consider an internal event i observed in replay equivalent (in the sense of inheriting all of its happens-before relations) to an internal event i from the original log if and only if all unmasked fields have the same value and i occurs between i 's preceding and succeeding inputs in the happens-before relation.
  • Some internal events from the original log that "happen before" some external input may be absent when replaying a subsequence.
  • If the authors prune a link failure, the corresponding notification message will not arise.
  • The authors heuristic is to proceed normally if there are new internal events, always injecting the next input when its last expected predecessor either occurs or times out.
  • This ensures that the authors always find state transition suffixes that contain a subsequence of the original internal events, but leaves open the possibility of finding divergent suffixes that lead to the invariant violation.

4.3 Complexity

  • The delta debugging algorithm terminates after Ω(log n) invocations of replay in the best case, and O(n) in the worst case, where n is the number of inputs in the original trace [58] .
  • Each invocation of replay takes O(n) time (one iteration for PEEK and one iteration for the replay itself), for an overall runtime of Ω(n log n) best case and O(n 2 ) worst case replayed inputs.
  • The runtime can be decreased by parallelizing delta debugging: speculatively replaying subsequences in parallel, and joining the results.
  • Storing periodic checkpoints of the system state throughout testing can also reduce runtime, as it allows us to replay starting from a recent checkpoint rather than the beginning of the trace.

5. SYSTEMS CHALLENGES

  • Thus far the authors have assumed that they are given a faulty execution trace.
  • The authors now provide an overview of how they obtain traces, and then describe their system for minimizing them.
  • The mock network manages the execution of events from a single location, which allows it to record a serial event ordering.
  • STS also optionally makes use of Open vSwitch [46] as an interposition point between controllers.
  • In designing STS the authors aimed to make it possible for engineering organizations to implement the technology within their existing QA test infrastructure.

5.2 Mitigating Non-Determinism

  • When non-determinism is acute, one might seek to prevent it altogether.
  • Short of ensuring full determinism, the authors place STS in a position to record and replay all network events in serial order, and ensure that all data structures within STS are unaffected by randomness.
  • The authors also optionally interpose on the controller software itself.
  • STS may need visibility into the control software's internal state transitions to properly maintain happens-before relations during replay.
  • Such coarse-grained visibility into internal state transitions does not handle all cases, but the authors find it suffices in practice.

5.3 Checkpointing

  • To efficiently implement the PEEK algorithm depicted in Figure 2 the authors assume the ability to record checkpoints of the state of the system under test.
  • The authors currently implement checkpointing for the POX controller 18 by telling it to fork itself and suspend its child, transparently cloning the sockets of the parent (which constitute shared state between the parent and child processes), and later resuming the child.
  • This simple mechanism does not work for controllers that use other shared state such as disk.
  • Alternatively, they can avoid PEEK and solely use the event scheduling heuristics described in §5.
  • By shortening the replay time, checkpointing coincidentally helps cope with the effects of nondeterminism, as there is less opportunity for divergence in timing.

5.4 Timing Heuristics

  • The authors have found three heuristics useful for ensuring that invariant violations are consistently reproduced.
  • The authors find that keeping the wall-clock spacing between replay events close to the recorded timing helps (but does not alone suffice) to ensure that invariant violations are consistently reproduced.
  • Upon further examination the authors found in these cases that LLDP and OpenFlow echo packets periodically sent by the control software were staying in STS's buffers too long during replay, such that the control software would time out on them.
  • To avoid these differences, the authors added an option to always pass through keepalive messages.
  • Dataplane forward/drop events constitute a substantial portion of overall events.

5.5 Root Causing Tools

  • Throughout their experimentation with STS, the authors often found that MCSes alone were insufficient to pinpoint the root causes of bugs.
  • The authors therefore implemented a number of complementary root causing tools, which they use along with Unix utilities to finish the debugging process.
  • STS supports an interactive replay mode similar to OFRewind [56] that allows troubleshooters to query the network state, filter events, check additional invariants, and even induce new events that were not part of the original event trace.
  • The OpenFlow commands sent by controller software are often redundant, e.g. they may override routing entries, allow them to expire, or periodically flush and later repopulate them.
  • The authors often found it informative to visualize the ordering of message deliveries and internal state transitions.

5.6 Limitations

  • Having detailed the specifics of their approach the authors now clarify the scope of their technique's use.
  • The authors event scheduling algorithm assumes that it has visibility into the occurrence of relevant internal events.
  • For some software this may require substantial instrumentation beyond preexisting log statements, though as the authors show in §6, most bugs they encountered can be minimized without perfect visibility.
  • When non-determinism is present STS (i) replays multiple times per subsequence, and (ii) employs software techniques for mitigating non-determinism, but it may nonetheless output a non-minimal MCS.
  • In the worst case STS leaves the developer where they started: an unpruned log.

Lack of Guarantees.

  • Due to partial visibility and nondeterminism, the authors do not provide guarantees on MCS minimality.
  • The authors goal is not to find the root cause of individual component failures in the system (e.g. misbehaving routers, link failures).
  • Performance overhead from interposing on messages may prevent STS from minimizing bugs triggered by high message rates.
  • 19 Similarly, STS's design may prevent it from minimizing extremely large traces, as the authors evaluate in §6.
  • The authors are primarily focused on correctness bugs, not performance bugs.

6. EVALUATION

  • The authors first demonstrate STS's viability in troubleshooting real bugs.
  • Second, the authors demonstrate the boundaries of where STS works well and where it does not by finding MCSes for previously known and synthetic bugs that span a range of bug types encountered in practice.
  • The authors ultimate goal is to reduce effort spent on troubleshooting bugs.
  • Interactive visualizations and replayable event traces for all of these case studies are publicly available at ucb-sts.github.com/experiments.

6.1 New Bugs

  • The authors discovered a loop when fuzzing Pyretic's hub module, whose purpose is to flood packets along a minimum spanning tree.
  • The loop seemed to persist until Pyretic periodically flushed all flow entries.
  • During this window, a PacketIn (LLDP packet) was forwarded to POX's discovery module, which in turned raised a LinkEvent to l2_multi, which then failed because it expected SwitchUp to occur first.
  • The authors noticed after examining POX's code that there might be some corner cases related to host migrations.
  • The authors instead used the console output from the shortest subsequence that did produce the bug (21 inputs, 3 more than the MCS) to debug this trace.

6.2 Known bugs

  • The authors were able to reproduce a known problem [17] in Floodlight's distributed controller failover logic with STS.
  • The authors were able to successfully isolate the two-event MCS: the controller crash and the link failure.
  • They make this decision by electing the controller with the higher ID as the master for that link.
  • As a result, POX began randomly load balancing each subsequent packet for a given flow over the servers, causing session state to be lost.
  • The authors were able to minimize the MCS for this bug to 24 elements (there were two preexisting flow entries in each routing table, so 24 additional flows made the 26 (N+1) entries needed to overflow the table).

6.3 Synthetic bugs

  • The authors injected a crash on a code path that was highly dependent on internal timers firing within POX.
  • The authors were able to trigger the code path during fuzzing, but were unable to reproduce the bug during replay after five attempts.
  • The authors modified POX's reactive routing module to create a loop upon receiving a particular sequence of dataplane packets.
  • The authors found that the 7 event MCS was inflated by at least two events: a link failure and a link recovery that they did not believe were relevant to triggering the bug.
  • The authors created a case that would take STS very long to minimize: a memory leak that eventually caused a crash in POX.

6.4 Overall Results & Discussion

  • The authors note that with the exception of Delicate Timer Interleaving and ONOS Database Locking, STS was able to significantly reduce input traces.
  • The MCS WI column, showing the MCS sizes the authors produced when ignoring internal events entirely, indicates that their techniques for interleaving events are often crucial.
  • In this case the authors found better results by simply turning off interposition on internal events.
  • This requires many re-iterations through the code and logs using standard debugging tools (e.g. source level debuggers), and is highly tedious on human timescales.
  • Bugs that depend on fine-grained thread-interleaving or timers inside of the controller are the worst-case for STS.

6.5 Coping with Non-determinism

  • Recall that STS optionally replays each subsequence multiple times to mitigate the effects of non-determinism.
  • The authors evaluate the effectiveness of this approach by varying the maximum number of replays per subsequence while minimizing a synthetic nondeterministic loop created by Floodlight.
  • Figure 5 demonstrates that the size of the resulting MCS decreases with the maximum number of replays, at the cost of additional runtime; 10 replays per subsequence took 12.8 total hours, versus 6.1 hours without retries.

6.6 Instrumentation Complexity

  • For POX and Floodlight, the authors added shim layers to the control software to redirect gettimeofday, interpose on logging statements, and demultiplex sockets.
  • For Floodlight the authors needed 722 lines of Java, and for POX they needed 415 lines of Python.

6.7 Scalability

  • Mocking the network in a single process potentially prevents STS from triggering bugs that only appear at large scale.
  • At that point, the machine started thrashing, but this limitation could easily be removed by running on a machine with >6GB of memory.
  • Note that STS is not designed for high-throughput dataplane traffic; the authors only forward what is necessary to exercise the controller software.
  • In proactive SDN setups, dataplane events are not relevant for the control software, except perhaps for host discovery.
  • Runtime for bootstrapping FatTree networks, cutting 5% of links, and processing the controller's response, also known as Figure 6.

6.8 Parameters

  • The authors found throughout their experimentation that STS leaves open several parameters that need to be set properly.
  • Setting fuzzing parameters remains an important part of experiment setup.
  • This delay implies that invariant violations such as loops or blackholes can appear before the controller(s) have time to correct the network configuration.
  • In many cases such transient invariant violations are not of interest to developers.
  • The authors found that the number of events they timed out on while isolating the MCS became stable for values above 25 milliseconds.

7. DISCUSSION

  • Based on conversations with engineers and their own industrial experience, two facts seem to hold.
  • Second, the larger the trace, the more effort is spent on debugging, since humans can only keep a small number of facts in working memory [41] .
  • As one developer puts it, "Automatically shrinking test cases to the minimal case is immensely helpful" [52] .
  • The authors are currently evaluating their technique on other distributed systems, and believe it to be generally applicable.
  • Finally, without care, a single input event may appear multiple times in the distributed logs.

9. CONCLUSION

  • SDN aims to make networks easier to manage.
  • SDN does this, however, by pushing complexity into SDN control software itself.
  • Just as sophisticated compilers are hard to write, but make programming easy, SDN control software makes network management easier, but only by forcing the developers of SDN control software to confront the challenges of asynchrony, partial failure, and other notoriously hard problems inherent to all distributed systems.
  • Current techniques for troubleshooting SDN control software are primitive; they essentially involve manual inspection of logs in the hope of identifying the triggering inputs.
  • Here the authors developed a technique for automatically identifying a minimal sequence of inputs responsible for triggering a given bug, without making assumptions about the language or instrumentation of the software under test.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Troubleshooting Blackbox SDN Control Software with
Minimal Causal Sequences
Colin Scott
Andreas Wundsam
?
Barath Raghavan
?
Aurojit Panda
Andrew Or
Jefferson Lai
Eugene Huang
Zhi Liu
ø
Ahmed El-Hassany
?
Sam Whitlock
]?
H.B. Acharya
?
Kyriakos Zarifis
?
Scott Shenker
?
UC Berkeley
Big Switch Networks
?
ICSI
ø
Tshinghua University
]
EPFL
USC
ABSTRACT
Software bugs are inevitable in software-defined networking con-
trol software, and troubleshooting is a tedious, time-consuming
task. In this paper we discuss how to improve control software
troubleshooting by presenting a technique for automatically iden-
tifying a minimal sequence of inputs responsible for triggering a
given bug, without making assumptions about the language or in-
strumentation of the software under test. We apply our technique to
five open source SDN control platforms—Floodlight, NOX, POX,
Pyretic, ONOS—and illustrate how the minimal causal sequences
our system found aided the troubleshooting process.
Categories and Subject Descriptors
C.2.4 [Computer-Communication Networks]: Distributed Sys-
tems—Network operating systems; D.2.5 [Software Engineering]:
Testing and Debugging—Debugging aids
Keywords
Test case minimization; Troubleshooting; SDN control software
1. INTRODUCTION
Software-defined networking (SDN) proposes to simplify net-
work management by providing a simple logically-centralized API
upon which network management programs can be written. How-
ever, the software used to support this API is anything but sim-
ple: the SDN control plane (consisting of the network operat-
ing system and higher layers) is a complicated distributed system
that must react quickly and correctly to failures, host migrations,
policy-configuration changes and other events. All complicated
distributed systems are prone to bugs, and from our first-hand fa-
miliarity with five open source controllers and three major com-
mercial controllers we can attest that SDN is no exception.
When faced with symptoms of a network problem (e.g. a persis-
tent loop) that suggest the presence of a bug in the control plane
software, software developers need to identify which events are
triggering this apparent bug before they can begin to isolate and
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
SIGCOMM’14, August 17–22, 2014, Chicago, Illinois, USA.
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 978-1-4503-2836-4/14/08 ...$15.00.
http://dx.doi.org/10.1145/2619239.2626304.
fix it. This act of “troubleshooting” (which precedes the act of de-
bugging the code) is highly time-consuming, as developers spend
hours poring over multigigabyte execution traces.
1
Our aim is to re-
duce effort spent on troubleshooting distributed systems like SDN
control software, by automatically eliminating events from buggy
traces that are not causally related to the bug, producing a “minimal
causal sequence” (MCS) of triggering events.
Our goal of minimizing traces is in the spirit of delta debug-
ging [58], but our problem is complicated by the distributed nature
of control software: our input is not a single file fed to a single point
of execution, but an ongoing sequence of events involving multiple
actors. We therefore need to carefully control the interleaving of
events in the face of asynchrony, concurrency and non-determinism
in order to reproduce bugs throughout the minimization process.
Crucially, we aim to minimize traces without making assumptions
about the language or instrumentation of the control software.
We have built a troubleshooting system that, as far as we know,
is the first to meet these challenges (as we discuss further in §8).
Once it reduces a given execution trace to an MCS (or an approxi-
mation thereof), the developer embarks on the debugging process.
We claim that the greatly reduced size of the trace makes it easier
for the developer to figure out which code path contains the under-
lying bug, allowing them to focus their effort on the task of fixing
the problematic code itself. After the bug has been fixed, the MCS
can serve as a test case to prevent regression, and can help identify
redundant bug reports where the MCSes are the same.
Our troubleshooting system, which we call STS (SDN Trou-
bleshooting System), consists of 23,000 lines of Python, and is de-
signed so that organizations can implement the technology within
their existing QA infrastructure (discussed in §5); over the last year
we have worked with a commercial SDN company to integrate
STS. We evaluate STS in two ways. First and most significantly,
we use STS to troubleshoot seven previously unknown bugs—
involving concurrent events, faulty failover logic, broken state ma-
chines, and deadlock in a distributed database—that we found by
fuzz testing ve controllers (Floodlight [16], NOX [23], POX [39],
Pyretic [19], ONOS [43]) written in three different languages (Java,
C++, Python). Second, we demonstrate the boundaries of where
STS works well by finding MCSes for previously known and syn-
thetic bugs that span a range of bug types. In our evaluation, we
quantitatively show that STS is able to minimize (non-synthetic)
bug traces by up to 98%, and we anecdotally found that reducing
traces to MCSes made it easy to understand their root causes.
1
Software developers in general spend roughly half (49% ac-
cording to one study [21]) of their time troubleshooting and debug-
ging, and spend considerable time troubleshooting bugs that are
difficult to trigger (the same study found that 70% of the reported
concurrency bugs take days to months to fix).

2. BACKGROUND
Network operating systems, the key component of SDN soft-
ware infrastructure, consist of control software running on a repli-
cated set of servers, each running a controller instance. Controllers
coordinate between themselves, and receive input events (e.g. link
failure notifications) and statistics from switches (either physical or
virtual), policy changes via a management interface, and possibly
dataplane packets. In response, the controllers issue forwarding
instructions to switches. All input events are asynchronous, and
individual controllers may fail at any time. The controllers either
communicate with each other over the dataplane network, or use a
separate dedicated network, and may become partitioned.
The goal of the network control plane is to configure the switch
forwarding entries so as to enforce one or more invariants, such as
connectivity (i.e. ensuring that a route exists between every end-
point pair), isolation and access control (i.e. various limitations on
connectivity), and virtualization (i.e. ensuring that packets are han-
dled in a manner consistent with the specified virtual network). A
bug causes an invariant to be violated. Invariants can be violated
because the system was improperly configured (e.g. the manage-
ment system [2] or a human improperly specified their goals), or
because there is a bug within the SDN control plane itself. In this
paper we focus on troubleshooting bugs in the SDN control plane
after it has been given a policy configuration.
2
In commercial SDN development, software developers work
with a team of QA engineers whose job is to find bugs. The QA
engineers exercise automated test scenarios that involve sequences
of external (input) events such as failures on large (software em-
ulated or hardware) network testbeds. If they detect an invariant
violation, they hand the resulting trace to a developer for analysis.
The space of possible bugs is enormous, and it is difficult and
time consuming to link the symptom of a bug (e.g. a routing loop)
to the sequence of events in the QA trace (which includes both
external events and internal monitoring data), since QA traces con-
tain a wealth of extraneous events. Consider that an hour long QA
test emulating event rates observed in production could contain 8.5
network error events per minute [22] and 500 VM migrations per
hour [49], for a total of 8.5 · 60 + 500 1000 inputs.
3. PROBLEM DEFINITION
We represent the forwarding state of the network at a particular
time as a configuration c, which contains all the forwarding en-
tries in the network as well as the liveness of the various network
elements. The control software is a system consisting of one or
more controller processes that takes a sequence of external network
events E =
e
1
e
2
···
e
m
(e.g. link failures) as inputs, and pro-
duces a sequence of network configurations C = c
1
, c
2
, . . . , c
n
.
An invariant is a predicate P over forwarding state (a safety con-
dition, e.g. loop-freedom). We say that configuration c violates the
invariant if P(c) is false, denoted P (c).
We are given a log L generated by a centralized QA test orches-
trator.
3
The log L contains a sequence of events
τ
L
=
e
1
i
1
i
2
e
2
···
e
m
···
i
p
which includes external events E
L
=
e
1
,
e
2
···
e
m
injected by the
orchestrator, and internal events I
L
=
i
1
,
i
2
···
i
p
triggered by
the control software (e.g. OpenFlow messages). The events E
L
include timestamps {(
e
k
, t
k
)} from the orchestrator’s clock.
2
This does not preclude us from troubleshooting misspecified
policies so long as test invariants [31] are specified separately.
3
We discuss how these logs are generated in §5.
A replay of log L involves replaying the external events E
L
,
possibly taking into account the occurrence of internal events I
L
as observed by the orchestrator. We denote a replay attempt by
replay(τ ). The output of replay is a sequence of configurations
C
R
= ˆc
1
, ˆc
2
, . . . , ˆc
n
. Ideally replay(τ
L
) reproduces the original
configuration sequence, but this does not always hold.
If the configuration sequence C
L
= c
1
, c
2
, . . . , c
n
associated
with the log L violated predicate P (i.e.
c
i
C
L
.P (c
i
)) then we
say replay(·) = C
R
reproduces that violation if C
R
contains an
equivalent faulty configuration (i.e.
ˆc
i
C
R
.P (ˆc
i
)).
The goal of our work is, when given a log L that exhibited an
invariant violation,
3
to find a small, replayable sequence of events
that reproduces that invariant violation. Formally, we define a mini-
mal causal sequence (MCS) to be a sequence τ
M
where the external
events E
M
τ
M
are a subsequence of E
L
such that replay(τ
M
)
reproduces the invariant violation, but for all proper subsequences
E
N
of E
M
there is no sequence τ
N
such that replay(τ
N
) repro-
duces the violation. Note that an MCS is not necessarily globally
minimal, in that there could be smaller subsequences of E
L
that
reproduce this violation, but are not a subsequence of this MCS.
We find approximate MCSes by deciding which external events
to eliminate and, more importantly, when to inject external events.
We describe this process in the next section.
4. MINIMIZING TRACES
Given a log L generated from testing infrastructure,
3
our goal
is to find an approximate MCS, so that a human can examine the
MCS rather than the full log. This involves two tasks: searching
through subsequences of E
L
, and deciding when to inject external
events for each subsequence so that, whenever possible, the invari-
ant violation is retriggered.
4.1 Searching for Subsequences
Checking random subsequences of E
L
would be one viable but
inefficient approach to achieving our first task. We do better by em-
ploying the delta debugging algorithm [58], a divide-and-conquer
algorithm for isolating fault-inducing inputs. We use delta debug-
ging to iteratively select subsequences of E
L
and replay each sub-
sequence with some timing T . If the bug persists for a given sub-
sequence, delta debugging ignores the other inputs, and proceeds
with the search for an MCS within this subsequence. The delta
debugging algorithm we implement is shown in Figure 1.
The input subsequences chosen by delta debugging are not al-
ways valid. Of the possible inputs sequences we generate (shown
in Table 2), it is not sensible to replay a recovery event without a
preceding failure event, nor to replay a host migration event with-
out modifying its starting position when a preceding host migration
event has been pruned. Our implementation of delta debugging
therefore prunes failure/recovery event pairs as a single unit, and
updates initial host locations whenever host migration events are
pruned so that hosts do not magically appear at new locations.
4
These two heuristics account for validity of all network events
4
Handling invalid inputs is crucial for ensuring that the delta
debugging algorithm finds a minimal causal subsequence. The al-
gorithm we employ [58] makes three assumptions about inputs:
monotonicity, unambiguity, and consistency. An event trace that
violates monotonicity may contain events that “undo” the invariant
violation triggered by the MCS, and may therefore exhibit slightly
inflated MCSes. An event trace that violates unambiguity may ex-
hibit multiple MCSes; delta debugging will return one of them.
The most important assumption is consistency, which requires that
the test outcome can always be determined. We guarantee neither
monotonicity nor unambiguity, but we guarantee consistency by
ensuring that subsequences are always semantically valid by ap-
plying the two heuristics described above. Zeller wrote a follow-on

shown in Table 2. We do not yet support network policy changes
as events, which have more complex semantic dependencies.
5
4.2 Searching for Timings
Simply exploring subsequences E
S
of E
L
is insufficient for find-
ing MCSes: the timing of when we inject the external events during
replay is crucial for reproducing violations.
Existing Approaches. The most natural approach to scheduling
external events is to maintain the original wall-clock timing inter-
vals between them. If this is able to find all minimization oppor-
tunities, i.e. reproduce the violation for all subsequences that are
a supersequence of some MCS, we say that the inputs are isolated.
The original applications of delta debugging [6,47,58,59] make this
assumption (where a single input is fed to a single program), as well
as QuickCheck’s input “shrinking” [12] when applied to blackbox
systems like synchronous telecommunications protocols [4].
We tried this approach, but were rarely able to reproduce invari-
ant violations. As our case studies demonstrate 6), this is largely
due to the concurrent, asynchronous nature of distributed systems;
consider that the network can reorder or delay messages, or that
controllers may process multiple inputs simultaneously. Inputs in-
jected according to wall-clock time are not guaranteed to coincide
correctly with the current state of the control software.
We must therefore consider the control software’s internal
events. To deterministically reproduce bugs, we would need visibil-
ity into every I/O request and response (e.g. clock values or socket
reads), as well as all thread scheduling decisions for each controller.
This information is the starting point for techniques that seek to
minimize thread interleavings leading up to race conditions. These
approaches involve iteratively feeding a single input (the thread
schedule) to a single entity (a deterministic scheduler) [11, 13, 28],
or statically analyzing feasible thread schedules [26].
A crucial constraint of these approaches is that they must keep
the inputs fixed; that is, behavior must depend uniquely on the
thread schedule. Otherwise, the controllers may take a divergent
code path. If this occurs some processes might issue a previously
unobserved I/O request, and the replayer will not have a recorded
response; worse yet, a divergent process might deschedule itself at
a different point than it did originally, so that the remainder of the
recorded thread schedule is unusable to the replayer.
Because they keep the inputs fixed, these approaches strive for a
subtly different goal than ours: minimizing thread context switches
rather than input events. At best, these approaches can indirectly
minimize input events by truncating individual thread executions.
With additional information obtained by program flow analy-
sis [27, 34, 50] however, the inputs no longer need to be fixed.
The internal events considered by these program flow reduction
techniques are individual instructions executed by the programs
(obtained by instrumenting the language runtime), in addition to
I/O responses and the thread schedule. With this information they
can compute program flow dependencies, and thereby remove in-
put events from anywhere in the trace as long as they can prove that
doing so cannot possibly cause the faulty execution path to diverge.
While program flow reduction is able to minimize inputs, these
techniques are not able to explore alternate code paths that still trig-
ger the invariant violation. They are also overly conservative in re-
moving inputs (e.g. EFF takes the transitive closure of all possible
dependencies [34]) causing them to miss opportunities to remove
paper [59] that removes the need for these assumptions, but incurs
an additional factor of n in complexity in doing so.
5
If codifying the semantic dependencies of policy changes turns
out to be difficult, one could just employ the more expensive ver-
sion of delta debugging to account for inconsistency [59].
Internal Message Masked Values
OpenFlow messages xac id, cookie, buffer id, stats
packet_out/in payload all values except src, dst, data
Log statements varargs parameters to printf
Table 1: Internal messages and their masked values.
dependencies that actually semantically commute.
Allowing Divergence. Our approach is to allow processes to pro-
ceed along divergent paths rather than recording all low-level I/O
and thread scheduling decisions. This has several advantages. Un-
like the other approaches, we can find shorter alternate code paths
that still trigger the invariant violation. Previous best-effort exe-
cution minimization techniques [14, 53] also allow alternate code
paths, but do not systematically consider concurrency and asyn-
chrony.
6
We also avoid the performance overhead of recording
all I/O requests and later replaying them (e.g. EFF incurs ~10x
slowdown during replay [34]). Lastly, we avoid the extensive ef-
fort required to instrument the control software’s language runtime,
needed by the other approaches to implement a deterministic thread
scheduler, interpose on syscalls, or perform program flow analysis.
By avoiding assumptions about the language of the control soft-
ware, we were able to easily apply our system to five different con-
trol platforms written in three different languages.
Accounting for Interleavings. To reproduce the invariant viola-
tion (whenever E
S
is a supersequence of an MCS) we need to inject
each input event
e
only after all other events, including internal
events, that precede it in the happens-before relation [33] from the
original execution ({i | i
e
}) have occurred [51].
The internal events we consider are (a) message delivery events,
either between controllers (e.g. database synchronization mes-
sages) or between controllers and switches (e.g. OpenFlow mes-
sages), and (b) state transitions within controllers (e.g. a backup
node deciding to become master). Our replay orchestrator obtains
visibility into (a) by interposing on all messages within the test en-
vironment (to be described in §5). It optionally obtains partial vis-
ibility into (b) by instrumenting controller software with a simple
interposition layer (to be described in §5.2).
Given a subsequence E
S
, our goal is to find an execution that
obeys the original happens-before relation. We do not control the
occurrence of internal events, but we can manipulate when they are
delivered through our interposition layer,
7
and we also decide when
to inject the external events E
S
. The key challenges in choosing a
schedule stem from the fact that the original execution has been
modified: internal events may differ syntactically, some expected
internal events may no longer occur, and new internal events may
occur that were not observed at all in the original execution.
Functional Equivalence. Internal events may differ syntactically
(e.g. sequence numbers of control packets may all differ) when re-
playing a subsequence of the original log. We observe that many
internal events are functionally equivalent, in the sense that they
have the same effect on the state of the system with respect to trig-
gering the invariant violation. For example, flow_mod messages
may cause switches to make the same change to their forwarding
behavior even if their transaction ids differ.
We apply this observation by defining masks over semantically
extraneous fields of internal events.
8
We show the fields we mask
6
PRES explores alternate code paths in best-effort replay of
multithreaded executions, but does not minimize executions [45].
7
In this way we totally order messages. Without interposition
on process scheduling however, the system may still be concurrent.
8
One consequence of applying masks is that bugs involving
masked fields are outside the purview of our approach.

Input: T
8
s.t. T
8
is a trace and test(T
8
) = 8. Output: T
0
8
= ddmin(T
8
) s.t. T
0
8
T
8
, test(T
0
8
) = 8, and T
0
8
is minimal.
ddmin(T
8
) = ddmin
2
(T
8
, ) where
ddmin
2
(T
0
8
, R) =
T
0
8
if |T
0
8
| = 1 (“base case”)
ddmin
2
T
1
, R
else if test(T
1
R) = 8 (“in T
1
”)
ddmin
2
T
2
, R
else if test(T
2
R) = 8 (“in T
2
”)
ddmin
2
T
1
, T
2
R
ddmin
2
T
2
, T
1
R
otherwise (“interference”)
where test(T ) denotes the state of the system after executing the trace T , 8 denotes an invariant violation,
T
1
T
0
8
, T
2
T
0
8
, T
1
T
2
= T
0
8
, T
1
T
2
= , and |T
1
| |T
2
| |T
0
8
|/2 hold.
Figure 1: Automated Delta Debugging Algorithm from [58]. and denote subsequence relations.
Input Type Implementation
Switch failure/recovery TCP teardown
Controller failure/recovery SIGKILL
Link failure/recovery ofp_port_status
Controller partition iptables
Dataplane packet injection Network namespaces
Dataplane packet drop Dataplane interposition
Dataplane packet delay Dataplane interposition
Host migration ofp_port_status
Control message delay Controlplane interposition
Non-deterministic TCAMs Modified switches
Table 2: Input types currently supported by STS.
procedure PEEK(input subsequence)
inferred [ ]
for e
i
in subsequence
checkpoint system
inject e
i
|e
i+1
.time e
i
.time| +
record events for seconds
matched original events & recorded events
inferred inferred + [e
i
] + matched
restore checkpoint
return inferred
Figure 2: PEEK determines which internal events from the original
sequence occur for a given subsequence.
in Table 1. Note that these masks only need to be specified once,
and can later be applied programmatically.
We then consider an internal event i
0
observed in replay equiva-
lent (in the sense of inheriting all of its happens-before relations) to
an internal event i from the original log if and only if all unmasked
fields have the same value and i occurs between i
0
s preceding and
succeeding inputs in the happens-before relation.
Handling Absent Internal Events. Some internal events from the
original log that “happen before” some external input may be ab-
sent when replaying a subsequence. For instance, if we prune a link
failure, the corresponding notification message will not arise.
To avoid waiting forever we infer the presence of internal
events before we replay each subsequence. Our algorithm (called
PEEK()) for inferring the presence of internal events is depicted in
Figure 2. The algorithm injects each input, records a checkpoint
9
of the network and the control software’s state, allows the system to
proceed up until the following input (plus a small time ), records
the observed events, and matches the recorded events with the func-
tionally equivalent internal events observed in the original trace.
10
9
We discuss the implementation details of checkpointing in 5.3.
10
In the case that, due to non-determinism, an internal event oc-
curs during PEEK() but does not occur during replay, we time out
on internal events after seconds of their expected occurrence.
Handling New Internal Events. The last possible induced change
is the occurrence of new internal events that were not observed in
the original log. New events present multiple possibilities for where
we should inject the next input. Consider the following case: if i
2
and i
3
are internal events observed during replay that are both in
the same equivalence class as a single event i
1
from the original
run, we could inject the next input after i
2
or after i
3
.
In the general case it is always possible to construct two state
machines that lead to differing outcomes: one that only leads to the
invariant violation when we inject the next input before a new in-
ternal event, and another only when we inject after a new internal
event. In other words, to be guaranteed to traverse any state transi-
tion suffix that leads to the violation, we must recursively branch,
trying both possibilities for every new internal event. This implies
an exponential worst case number of possibilities to be explored.
Exponential search over these possibilities is not a practical op-
tion. Our heuristic is to proceed normally if there are new internal
events, always injecting the next input when its last expected prede-
cessor either occurs or times out. This ensures that we always find
state transition suffixes that contain a subsequence of the (equiv-
alent) original internal events, but leaves open the possibility of
finding divergent suffixes that lead to the invariant violation.
Recap. We combine these heuristics to replay each subsequence
chosen by delta debugging: we compute functional equivalency for
all internal events intercepted by our test orchestrator’s interposi-
tion layer 5), we invoke PEEK() to infer absent internal events,
and with these inferred causal dependencies we replay the input
subsequence, waiting to inject each input until each of its (func-
tionally equivalent) predecessors have occurred while allowing new
internal events through the interposition layer immediately.
4.3 Complexity
The delta debugging algorithm terminates after Ω(log n) invoca-
tions of replay in the best case, and O(n) in the worst case, where
n is the number of inputs in the original trace [58]. Each invocation
of replay takes O(n) time (one iteration for PEEK() and one itera-
tion for the replay itself), for an overall runtime of Ω(n log n) best
case and O(n
2
) worst case replayed inputs. The runtime can be de-
creased by parallelizing delta debugging: speculatively replaying
subsequences in parallel, and joining the results. Storing periodic
checkpoints of the system state throughout testing can also reduce
runtime, as it allows us to replay starting from a recent checkpoint
rather than the beginning of the trace.
5. SYSTEMS CHALLENGES
Thus far we have assumed that we are given a faulty execution
trace. We now provide an overview of how we obtain traces, and
then describe our system for minimizing them.
Obtaining Traces. All three of the commercial SDN companies

Figure 3: STS runs mock network devices, and interposes on all
communication channels.
that we know of employ a team of QA engineers to fuzz test their
control software on network testbeds. This fuzz testing infrastruc-
ture consists of the control software under test, the network testbed
(which may be software or hardware), and a centralized test or-
chestrator that chooses input sequences, drives the behavior of the
testbed, and periodically checks invariants.
We do not have access to such a QA testbed, and instead built our
own. Our testbed mocks out the control plane behavior of network
devices in lightweight software switches and hosts (with support
for minimal dataplane forwarding). We then run the control soft-
ware on top of this mock network and connect the switches to the
controller(s). The mock network manages the execution of events
from a single location, which allows it to record a serial event order-
ing. This design is similar to production software QA testbeds, and
is depicted in Figure 3. One distinguishing feature of our design is
that the mock network interposes on all communication channels,
allowing it to delay or drop messages to induce failure modes that
might be seen in real, asynchronous networks.
We use our mock network to find bugs in control software. Most
commonly we generate random input sequences based on event
probabilities that we assign (cf. §6.8), and periodically check in-
variants on the network state.
11
We also run the mock network in-
teractively so that we can examine the state of the network and
manually induce event orderings that we believe may trigger bugs.
Performing Minimization. After discovering an invariant viola-
tion, we invoke delta debugging to minimize the recorded trace.
We use the testing infrastructure itself to replay each intermedi-
ate subsequence. During replay the mock network enforces event
orderings as needed to maintain the original happens-before rela-
tion, by using its interposition on message channels to manage the
order (functionally equivalent) messages are let through, and wait-
ing until the appropriate time to inject inputs. For example, if the
original trace included a link failure preceded by the arrival of a
heartbeat message, during replay the mock network waits until it
observes a functionally equivalent ping probe to arrive, allows the
probe through, then tells the switch to fail its link.
STS is our realization of this system, implemented in more than
23,000 lines of Python in addition to the Hassel network invari-
ant checking library [31]. STS also optionally makes use of Open
vSwitch [46] as an interposition point between controllers. We have
made the code for STS publicly available at ucb-sts.github.com/sts.
Integration With Existing Testbeds. In designing STS we aimed
11
We currently support the following invariants: (a) all-to-all
reachability, (b) loop freeness, (c) blackhole freeness, (d) controller
liveness, and (e) POX ACL compliance.
to make it possible for engineering organizations to implement the
technology within their existing QA test infrastructure. Organiza-
tions can add delta debugging to their test orchestrator, and option-
ally add interposition points throughout the testbed to control event
ordering during replay. In this way they can continue running large
scale networks with the switches, middleboxes, hosts, and routing
protocols they had already chosen to include in their QA testbed.
We avoid making assumptions about the language or instrumen-
tation of the software under test in order to facilitate integration
with preexisting software. Many of the heuristics we describe be-
low are approximations that might be made more precise if we had
more visibility and control over the system, e.g. if we could deter-
ministically specify the thread schedule of each controller.
5.1 Coping with Non-Determinism
Non-determinism in concurrent executions stems from differ-
ences in system call return values, process scheduling decisions
(which can even affect the result of individual instructions, such
as x86’s interruptible block memory instructions [15]), and asyn-
chronous signal delivery. These sources of non-determinism can
affect whether STS is able to reproduce violations during replay.
The QA testing frameworks we are trying to improve do not
mitigate non-determinism. STS’s main approach to coping with
non-determinism is to replay each subsequence multiple times.
If the non-deterministic bug occurs with probability p, we can
model
12
the probability
13
that we will observe it within r replays as
1 (1 p)
r
. This exponential works strongly in our favor; for ex-
ample, even if the original bug is triggered in only 20% of replays,
the probability that we will not trigger it during an intermediate
replay is approximately 1% if we replay 20 times per subsequence.
5.2 Mitigating Non-Determinism
When non-determinism is acute, one might seek to prevent it al-
together. However, as discussed in §4.2, deterministic replay tech-
niques [15, 20] force the minimization process to stay on the origi-
nal code path, and incur substantial performance overhead.
Short of ensuring full determinism, we place STS in a position
to record and replay all network events in serial order, and ensure
that all data structures within STS are unaffected by randomness.
For example, we avoid using hashmaps that hash keys according to
their memory address, and sort all list return values.
We also optionally interpose on the controller software itself.
Routing the gettimeofday() syscall through STS helps ensure
timer accuracy.
1415
When sending data over multiple sockets, the
operating system exhibits non-determinism in the order it sched-
ules I/O operations. STS optionally ensures a deterministic order
of messages by multiplexing all sockets onto a single true socket.
On the controller side STS currently adds a shim layer atop the
control software’s socket library,
16
although this could be achieved
transparently with a libc shim layer [20].
STS may need visibility into the control software’s internal state
transitions to properly maintain happens-before relations during
replay. We gain visibility by making a small change to the control
12
See §6.5 for an experimental evaluation of this model.
13
This probability could be improved by guiding the thread
schedule towards known error-prone interleavings [44,45].
14
When the pruned trace differs from the original, we make a
best-effort guess at what the return values of these calls should be.
For example, if the altered execution invokes gettimeofday()
more times than we recorded in the initial run, we interpolate the
timestamps of neighboring events.
15
Only supported for POX and Floodlight at the moment.
16
Only supported for POX at the moment.

Citations
More filters
Proceedings ArticleDOI
22 Jun 2015
TL;DR: This paper addresses one serious SDN-specific attack, i.e., data-to-control plane saturation attack, which overloads the infrastructure of SDN networks and introduces an efficient, lightweight and protocol-independent defense framework forSDN networks.
Abstract: This paper addresses one serious SDN-specific attack, i.e., data-to-control plane saturation attack, which overloads the infrastructure of SDN networks. In this attack, an attacker can produce a large amount of table-miss packet_in messages to consume resources in both control plane and data plane. To mitigate this security threat, we introduce an efficient, lightweight and protocol-independent defense framework for SDN networks. Our solution, called FloodGuard, contains two new techniques/modules: proactive flow rule analyzer and packet migration. To preserve network policy enforcement, proactive flow rule analyzer dynamically derives proactive flow rules by reasoning the runtime logic of the SDN/OpenFlow controller and its applications. To protect the controller from being overloaded, packet migration temporarily caches the flooding packets and submits them to the OpenFlow controller using rate limit and round-robin scheduling. We evaluate FloodGuard through a prototype implementation tested in both software and hardware environments. The results show that FloodGuard is effective with adding only minor overhead into the entire SDN/OpenFlow infrastructure.

306 citations

Journal ArticleDOI
TL;DR: This paper seeks to identify some of the many challenges where new and current researchers can still contribute to the advancement of SDN and further hasten its broadening adoption by network operators.
Abstract: Having gained momentum from its promise of centralized control over distributed network architectures at bargain costs, software-defined Networking (SDN) is an ever-increasing topic of research. SDN offers a simplified means to dynamically control multiple simple switches via a single controller program, which contrasts with current network infrastructures where individual network operators manage network devices individually. Already, SDN has realized some extraordinary use cases outside of academia with companies, such as Google, AT&T, Microsoft, and many others. However, SDN still presents many research and operational challenges for government, industry, and campus networks. Because of these challenges, many SDN solutions have developed in an ad hoc manner that are not easily adopted by other organizations. Hence, this paper seeks to identify some of the many challenges where new and current researchers can still contribute to the advancement of SDN and further hasten its broadening adoption by network operators.

185 citations

Proceedings ArticleDOI
17 Jun 2015
TL;DR: Ravana is introduced, a fault-tolerant SDN controller platform that processes the control messages transactionally and exactly once (at both the controllers and the switches), and maintains these guarantees in the face of both controller and switch crashes.
Abstract: Software-defined networking (SDN) offers greater flexibility than traditional distributed architectures, at the risk of the controller being a single point-of-failure. Unfortunately, existing fault-tolerance techniques, such as replicated state machine, are insufficient to ensure correct network behavior under controller failures. The challenge is that, in addition to the application state of the controllers, the switches maintain hard state that must be handled consistently. Thus, it is necessary to incorporate switch state into the system model to correctly offer a "logically centralized" controller.We introduce Ravana, a fault-tolerant SDN controller platform that processes the control messages transactionally and exactly once (at both the controllers and the switches). Ravana maintains these guarantees in the face of both controller and switch crashes. The key insight in Ravana is that replicated state machines can be extended with lightweight switch-side mechanisms to guarantee correctness, without involving the switches in an elaborate consensus protocol. Our prototype implementation of Ravana enables unmodified controller applications to execute in a fault-tolerant fashion. Experiments show that Ravana achieves high throughput with reasonable overhead, compared to a single controller, with a failover time under 100ms.

145 citations

Proceedings ArticleDOI
14 Jan 2015
TL;DR: The coalgebraic theory of NetKAT is developed, including a specialized version of the Brzozowski derivative, and a new efficient algorithm for deciding the equational theory using bisimulation is presented.
Abstract: NetKAT is a domain-specific language and logic for specifying and verifying network packet-processing functions. It consists of Kleene algebra with tests (KAT) augmented with primitives for testing and modifying packet headers and encoding network topologies. Previous work developed the design of the language and its standard semantics, proved the soundness and completeness of the logic, defined a PSPACE algorithm for deciding equivalence, and presented several practical applications. This paper develops the coalgebraic theory of NetKAT, including a specialized version of the Brzozowski derivative, and presents a new efficient algorithm for deciding the equational theory using bisimulation. The coalgebraic structure admits an efficient sparse representation that results in a significant reduction in the size of the state space. We discuss the details of our implementation and optimizations that exploit NetKAT's equational axioms and coalgebraic structure to yield significantly improved performance. We present results from experiments demonstrating that our tool is competitive with state-of-the-art tools on several benchmarks including all-pairs connectivity, loop-freedom, and translation validation.

102 citations

Journal ArticleDOI
TL;DR: An overview of fault management in SDN is presented, showing how different fault management threat vectors are introduced by each layer, as well as by the interface between layers.
Abstract: Software-defined networking (SDN) is an emerging paradigm that has become increasingly popular in recent years. The core idea is to separate the control and data planes, allowing the construction of network applications using high-level abstractions that are translated to network devices through a southbound interface. SDN architecture is composed of three layers: 1) infrastructure layer, responsible exclusively for data forwarding; 2) control layer, which maintains the network view and provides core network abstractions; and 3) application layer, which uses abstractions provided by the control layer to implement network applications. SDN provides features, such as flexibility and programmability, that are key enablers to meet current network requirements (e.g., multi-tenant cloud networks and elastic optical networks). However, along with its benefits, SDN also brings new issues. In this survey we focus on issues related to fault management. Different fault management threat vectors are introduced by each layer, as well as by the interface between layers. Nevertheless, besides addressing fault management issues of its architecture, SDN also must handle the same problems faced by legacy networks. However, programmability and centralized management might be used to provide flexibility to deal with those issues. This paper presents an overview of fault management in SDN. The major contributions of this paper are as follows: 1) identification of the main fault management issues in SDN and classification according to the affected layers; 2) survey of efforts that address those issues and classification according to the affected planes, issues concerned, general approaches, and features; and 3) discussion about trade-offs of different approaches and their suitability for different scenarios.

96 citations

References
More filters
Journal Article
TL;DR: The theory of information as discussed by the authors provides a yardstick for calibrating our stimulus materials and for measuring the performance of our subjects and provides a quantitative way of getting at some of these questions.
Abstract: First, the span of absolute judgment and the span of immediate memory impose severe limitations on the amount of information that we are able to receive, process, and remember. By organizing the stimulus input simultaneously into several dimensions and successively into a sequence or chunks, we manage to break (or at least stretch) this informational bottleneck. Second, the process of recoding is a very important one in human psychology and deserves much more explicit attention than it has received. In particular, the kind of linguistic recoding that people do seems to me to be the very lifeblood of the thought processes. Recoding procedures are a constant concern to clinicians, social psychologists, linguists, and anthropologists and yet, probably because recoding is less accessible to experimental manipulation than nonsense syllables or T mazes, the traditional experimental psychologist has contributed little or nothing to their analysis. Nevertheless, experimental techniques can be used, methods of recoding can be specified, behavioral indicants can be found. And I anticipate that we will find a very orderly set of relations describing what now seems an uncharted wilderness of individual differences. Third, the concepts and measures provided by the theory of information provide a quantitative way of getting at some of these questions. The theory provides us with a yardstick for calibrating our stimulus materials and for measuring the performance of our subjects. In the interests of communication I have suppressed the technical details of information measurement and have tried to express the ideas in more familiar terms; I hope this paraphrase will not lead you to think they are not useful in research. Informational concepts have already proved valuable in the study of discrimination and of language; they promise a great deal in the study of learning and memory; and it has even been proposed that they can be useful in the study of concept formation. A lot of questions that seemed fruitless twenty or thirty years ago may now be worth another look. In fact, I feel that my story here must stop just as it begins to get really interesting. And finally, what about the magical number seven? What about the seven wonders of the world, the seven seas, the seven deadly sins, the seven daughters of Atlas in the Pleiades, the seven ages of man, the seven levels of hell, the seven primary colors, the seven notes of the musical scale, and the seven days of the week? What about the seven-point rating scale, the seven categories for absolute judgment, the seven objects in the span of attention, and the seven digits in the span of immediate memory? For the present I propose to withhold judgment. Perhaps there is something deep and profound behind all these sevens, something just calling out for us to discover it. But I suspect that it is only a pernicious, Pythagorean coincidence.

19,835 citations

Book
01 Jan 1956
TL;DR: The theory provides us with a yardstick for calibrating the authors' stimulus materials and for measuring the performance of their subjects, and the concepts and measures provided by the theory provide a quantitative way of getting at some of these questions.
Abstract: First, the span of absolute judgment and the span of immediate memory impose severe limitations on the amount of information that we are able to receive, process, and remember. By organizing the stimulus input simultaneously into several dimensions and successively into a sequence or chunks, we manage to break (or at least stretch) this informational bottleneck. Second, the process of recoding is a very important one in human psychology and deserves much more explicit attention than it has received. In particular, the kind of linguistic recoding that people do seems to me to be the very lifeblood of the thought processes. Recoding procedures are a constant concern to clinicians, social psychologists, linguists, and anthropologists and yet, probably because recoding is less accessible to experimental manipulation than nonsense syllables or T mazes, the traditional experimental psychologist has contributed little or nothing to their analysis. Nevertheless, experimental techniques can be used, methods of recoding can be specified, behavioral indicants can be found. And I anticipate that we will find a very orderly set of relations describing what now seems an uncharted wilderness of individual differences. Third, the concepts and measures provided by the theory of information provide a quantitative way of getting at some of these questions. The theory provides us with a yardstick for calibrating our stimulus materials and for measuring the performance of our subjects. In the interests of communication I have suppressed the technical details of information measurement and have tried to express the ideas in more familiar terms; I hope this paraphrase will not lead you to think they are not useful in research. Informational concepts have already proved valuable in the study of discrimination and of language; they promise a great deal in the study of learning and memory; and it has even been proposed that they can be useful in the study of concept formation. A lot of questions that seemed fruitless twenty or thirty years ago may now be worth another look. In fact, I feel that my story here must stop just as it begins to get really interesting. And finally, what about the magical number seven? What about the seven wonders of the world, the seven seas, the seven deadly sins, the seven daughters of Atlas in the Pleiades, the seven ages of man, the seven levels of hell, the seven primary colors, the seven notes of the musical scale, and the seven days of the week? What about the seven-point rating scale, the seven categories for absolute judgment, the seven objects in the span of attention, and the seven digits in the span of immediate memory? For the present I propose to withhold judgment. Perhaps there is something deep and profound behind all these sevens, something just calling out for us to discover it. But I suspect that it is only a pernicious, Pythagorean coincidence.

16,902 citations

Journal ArticleDOI
31 Mar 2008
TL;DR: This whitepaper proposes OpenFlow: a way for researchers to run experimental protocols in the networks they use every day, based on an Ethernet switch, with an internal flow-table, and a standardized interface to add and remove flow entries.
Abstract: This whitepaper proposes OpenFlow: a way for researchers to run experimental protocols in the networks they use every day. OpenFlow is based on an Ethernet switch, with an internal flow-table, and a standardized interface to add and remove flow entries. Our goal is to encourage networking vendors to add OpenFlow to their switch products for deployment in college campus backbones and wiring closets. We believe that OpenFlow is a pragmatic compromise: on one hand, it allows researchers to run experiments on heterogeneous switches in a uniform way at line-rate and with high port-density; while on the other hand, vendors do not need to expose the internal workings of their switches. In addition to allowing researchers to evaluate their ideas in real-world traffic settings, OpenFlow could serve as a useful campus component in proposed large-scale testbeds like GENI. Two buildings at Stanford University will soon run OpenFlow networks, using commercial Ethernet switches and routers. We will work to encourage deployment at other schools; and We encourage you to consider deploying OpenFlow in your university network too

9,138 citations

Book ChapterDOI
Leslie Lamport1
TL;DR: In this paper, the concept of one event happening before another in a distributed system is examined, and a distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events.
Abstract: The concept of one event happening before another in a distributed system is examined, and is shown to define a partial ordering of the events. A distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events. The use of the total ordering is illustrated with a method for solving synchronization problems. The algorithm is then specialized for synchronizing physical clocks, and a bound is derived on how far out of synchrony the clocks can become.

8,381 citations

Journal ArticleDOI
Leslie Lamport1
TL;DR: In this article, the concept of one event happening before another in a distributed system is examined, and a distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events.
Abstract: The concept of one event happening before another in a distributed system is examined, and is shown to define a partial ordering of the events. A distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events. The use of the total ordering is illustrated with a method for solving synchronization problems. The algorithm is then specialized for synchronizing physical clocks, and a bound is derived on how far out of synchrony the clocks can become.

6,804 citations

Frequently Asked Questions (12)
Q1. What are the contributions in "Troubleshooting blackbox sdn control software with minimal causal sequences" ?

In this paper the authors discuss how to improve control software troubleshooting by presenting a technique for automatically identifying a minimal sequence of inputs responsible for triggering a given bug, without making assumptions about the language or instrumentation of the software under test. 

By adding a timer before installing entries to allow for links to be discovered, the developers were able to verify that the loop no longer appeared. 

The most robust way to avoid redundant input events would be to employ perfect failure detectors [8], which log a failure iff the failure actually occurred. 

Their goal of minimizing traces is in the spirit of delta debugging [58], but their problem is complicated by the distributed nature of control software: their input is not a single file fed to a single point of execution, but an ongoing sequence of events involving multiple actors. 

If developers do not choose to employ checkpointing, they can use their implementation of PEEK() that replays inputs from the beginning rather than a checkpoint, thereby increasing replay runtime by a factor of n. 

The authors were able to minimize the MCS for this bug to 24 elements (there were two preexisting flow entries in each routing table, so 24 additional flows made the 26 (N+1) entries needed to overflow the table). 

The authors artificially set the memory leak to happen quickly after allocating 30 (M) objects created upon switch handshakes, and interspersed 691 other input events throughout switch reconnect events. 

These two heuristics account for validity of all network events 4Handling invalid inputs is crucial for ensuring that the delta debugging algorithm finds a minimal causal subsequence. 

If the non-deterministic bug occurs with probability p, the authors can model12 the probability13 that the authors will observe it within r replays as 1− (1− p)r . 

The runtime can be decreased by parallelizing delta debugging: speculatively replaying subsequences in parallel, and joining the results. 

Their replay orchestrator obtains visibility into (a) by interposing on all messages within the test environment (to be described in §5). 

The authors characterize the other troubleshooting approaches as (i) instrumentation (tracing), (ii) bug detection (invariant checking), (iii) replay, and (iv) root cause analysis (of network device failures).