scispace - formally typeset
Open AccessJournal ArticleDOI

Giving applications access to Gb/s networking

Jonathan M. Smith, +1 more
- 01 Jul 1993 - 
- Vol. 7, Iss: 4, pp 44-52
Reads0
Chats0
TLDR
The implementation described provides resource scheduling for network users, and considerably reduces interrupt overhead, and has been demonstrated experimentally and is shown to deliver high throughputs.
Abstract
Operating systems (OSs) used in high-speed networks must reduce copying to deliver maximal throughput to applications and must deliver this throughput while preserving the capability of the host to perform applications processing. One way to reduce copying by enabling data transfers directly to and from buffers located in application-process address spaces is presented. The method has been demonstrated experimentally and is shown to deliver high throughputs. OS support must also include scheduling, which allows bandwidth-allocated traffic streams to be delivered. The implementation described provides resource scheduling for network users, and considerably reduces interrupt overhead. >

read more

Content maybe subject to copyright    Report

Giving Applications Access to Gb/s Networking
Jonathan M. Smith and C. Brendan S. Traw
Distributed Systems Laboratory, University of Pennsylvania
200 South 33rd St., Philadelphia, PA 19104-6389
ABSTRACT
Network fabrics with Gigabit per second (Gbps) bandwidths are available today, but these
bandwidths are not yet available to applications. The difficulties lie in the hardware and software
architecture through which application data travels between the network and host memory. The
hardware portion of the architecture is often called a host interface and the remainder of the pro-
tocol stack is implemented in host software.
In this paper, we outline a variety of approaches to the architecture of such systems, exam-
ine several design points, and study one example in detail. The detailed example, an ATM Host
Interface and Operating System support built at the University of Pennsylvania, illustrates design
tradeoffs for hardware and software, and some of the implications of these tradeoffs on applica-
tions performance.
1. Introduction
The past several years have seen a profusion of efforts to design and implement very-high speed networks which
deliver this speed ‘‘end-to-end’’. A good example is the AURORA Gigabit Testbed [6], one of several such testbeds
[2]. In AURORA, much of the focus has been on the development of technologies needed to deliver this performance
to workstation-class machines, rather than supercomputers. We believe these machines will be the majority of end-
points in future Gbps networks.
The difficulty posed by the choice of workstations is the mismatch between the performance of the machines
and the bandwidth provided by the network infrastructure such as switches and transmission lines. Specifically, the
network bandwidths are within an order of magnitude of the memory bandwidths of most workstations, and the bur-
den on a host’s memory architecture must be minimized for maximum performance, which as pointed out by Clark
and Tennenhouse [5], forces careful design of protocol processing architectures.
Efficiency can be achieved through many design features, but the main options [29] are: optimizing the pro-
cessing functions in the protocol architecture, optimizing the operating system support for data transport, and careful
placement of the hardware required for network attachment. In this paper, we will focus on operating system and
architectural issues, as we feel that high-performance protocol architecture features such as ordering, errors, dupli-
cates, coordination, and format conversion have been well-covered by others; see for example Feldmeier [16].
The remainder of Section 1 outlines approaches to host interface hardware and supporting software. Section
2 describes the design choices made in an implementation of an ATM host interface for the IBM RISC System/6000
workstation [27]. Section 3 analyzes the performance implications of the design choices for applications, and Sec-
tion 4 concludes the paper.
1.1. Host Interfaces
The design and implementation of host interfaces has been of interest since the earliest network implementations†.
Each succeeding generation has dealt with different types of hosts, networks, protocol architectures and networked

Detailed information on some of these interfaces and supporting software is available in a Special Issue of the IEEE Journal on Selected Areas
in Communications [23].
Preprint, July 1993 IEEE Network Special Issue, pp. 44-52

- 2 -
applications. Goals have included low cost, high throughput and low delay. Implementations have been optimized
towards achieving one or more of these goals in their operational environments. Some of the key implementation
decisions have been: (1) the portion of protocol architecture functions performed by the interface; (2) signaling
between host and interface; and (3) the placement of the interface in the host computer’s architecture. Much of the
migration towards hardware is intended to obtain an implementation-specific performance advantage - as Watson
and Mamrak [29] point out, performance is often due as much to implementation techniques as to careful protocol
design. The key question may be the selection of functions to optimize by placement in hardware.
1.2. Host Interface Attachment
System
Bus
Memory
Processor
Host
Interface
memory
objects
(e.g., words)
network
objects
(e.g., cells)
Figure 1: General Host Interface Architecture
Figure 1 illustrates a general architecture for a host interface and will allow us to discuss several options for host
attachment. Davie [11, 13] and Ahlgren [15] have also discussed such options. We assess their performance poten-
tial for receiving data; sending data is similar.
1. The Host Interface is capable of Direct-Memory Access (DMA), which means that it can communicate with
the system memory directly, without processor intervention. The typical system (e.g., UNIX) uses the Host
Interface DMA capability to copy the data from the network into a buffer managed by the operating system.
The data is then copied by the CPU from the system buffer to a buffer owned by the application, which also
resides in the system memory. Thus, a given piece of data travels over the System Bus three times: Host
Interface to Memory, Memory to CPU, and CPU to Memory.
2. The Host Interface is capable of DMA, and the operating system is able to arrange for data to be transferred
directly to the application address space. A variant of this scheme would allow the host interface to directly
allocate and free host memory, in effect turning management of the host’s memory into a peer-peer model
rather than a master-slave model. Thus, a given piece of data travels over the System Bus once, from Host
Interface to Memory. One potential problem with this approach is that transport protocols may wish to defer
data transfer to applications until header processing is complete.
3. The Host Interface has a processor-addressable memory area, which the operating system manages. When
data arrives in the Host Interface’s buffers, the operating system copies this data into the user address space.
In this model, the data must travel the bus twice, once from Host Interface to CPU, and then from the CPU to
Memory.
4. The Host Interface has a processor-addressable memory area, where the application buffers are located. This
means that data never traverses the system memory bus, or rather, traversal is deferred until it is referenced by
the processor. This makes the host interface ‘‘look like’’ a piece of system memory allocated to the applica-
tion.
5. The Host Interface can be connected directly to the CPU [6], as in augmenting the processing unit with a co-
processor. As in Scheme 4, there is no memory bus traversal, and further, the connection is to a system
Preprint, July 1993 IEEE Network Special Issue, pp. 44-52

- 3 -
component which typically operates at speeds higher than memory bandwidths.
Each of these schemes is affected by a number of other considerations.
First, most modern architectures include a cache, which decreases the access latency of frequently used data,
but must be kept in a state consistent with system memory. The cache is typically architecturally ‘‘close’’ to the pro-
cessing unit, so it either must be kept consistent or flushed when new data arrive. Maintaining consistency is con-
siderably easier when the data passes through the processor - Scheme 2 must flush the cache for areas affected by
the DMA, and Scheme 4 must flush the cache, either under control of the CPU or the Host Interface. This problem
can be avoided by putting the data in non-cached memory, but this may have other negative performance conse-
quences. Schemes 1 and 3 should be able to obtain up-to-date cache copies when the data is copied into the user
address space.
Second, host interfaces may also be used to support applications which require specialized peripherals, such
as video conferencing. Thus it is important to keep a good balance between I/O and memory accessibility. The
DMA based schemes do this, but the memory-on-interface schemes (Schemes 3 and 4) might present difficulties in
I/O operations to and from other devices.
Finally, Schemes 1 and 3 involve the processor in copying data across address-space boundaries. Thus, the
processor must reduce its processing capacity by the amount of time spent copying data. Scheme 5’s co-processor
approach likewise shares processing-unit capacity between computational load and network traffic.
Any evaluation of a host attachment strategy is subject to the specifics of the host, I/O bus, interface technolo-
gies and software, as well as the target applications. For these specifics, there are other constraints such as econom-
ics, portability, etc. Thus, there is no ‘‘best’’ approach - they can only be ranked relative to these constraints.
1.3. Interaction with Software
Operating system software plays a key role in the achievement of high end-to-end networking performance. The
abstraction provided by the host interface is that of a device which can transfer data between a network and the
computing system’s memory. The software builds an application communications model over this abstraction. A
significant constraint on such software is its embedding in the framework of an operating system which satisfies
other (possibly conflicting) requirements. Particular application needs include the transfer of data into application-
private address spaces, connection management, high throughput, low latency, and the ability to support a variety of
traffic types. Traffic types include traditional bursty data communications traffic (such as transaction-style traffic),
bulk data transfer, and the sustained bandwidth requirements of applications supporting continuous media. We
believe that approaches optimized towards a particular traffic type, such as low-latency transactional traffic [24],
will suffer if the traffic mix varies considerably.
The software operating on the host is usually partitioned functionally into a series of layers defined by protec-
tion boundaries. Typically, each software layer contains several protocol layers. The user’s applications are typi-
cally executable programs, or groups of such programs cooperating on a task. Applications which require network
access obtain it via abstract service primitives such as read(), write(), and sendto(). These service primitives pro-
vide access to an implementation of some layers of the network protocol, as in the UNIX system’s access to TCP/IP
through the socket abstraction. The protocol is often designed to mask the behavior of the network and the hardware
connecting the computer to the network. Its implementation can usually be split into device-independent and
device-dependent portions.
Significant portions of protocol implementations may be embedded in the operating system of the host, where
the service primitives are system entry points. The device-dependent portion is implemented as a ‘‘device driver.’’
This is not strictly necessary as demonstrated in Mach 3.0, which moves both protocols and much of the device
driver code out of the kernel. Device drivers have a rigidly specified programmer interface, mainly so that the
device-independent portions of system software can form a reasonable abstraction of their behavior. Placement of
the protocol functions within the operating system is dictated by two factors, policies and performance. The key
policies which an operating system can enforce through its scheduling are fairness (e.g., in multiplexing packet
streams) and the prevention of starvation. High performance may require the ability to control timing and task
scheduling, the ability to manipulate virtual memory directly, the ability to fully control peripheral devices, and the
ability to communicate efficiently (e.g., with a shared address space). All of these requirements can be met by
embedding the protocol functions in the host operating system. In practice, the main freedoms for the host interface
software designer lie in the design of the device driver, since it forms the boundary between the host’s device
Preprint, July 1993 IEEE Network Special Issue, pp. 44-52

- 4 -
independent software and the functions performed by the device.
The software architect is presented with the following choices as to detailed implementation strategy:
1. Based on the capabilities of the interface (e.g., its provision for programmed I/O, DMA, or bus mastered
transfers†), what is the partitioning of functionality between the host software and the host interface
hardware? For example, use of DMA or bus-master transfers removes the need for a copying loop in the dev-
ice driver to process programmed I/O, but may require a variety of locks and scheduling mechanisms to sup-
port the concurrent activities of copying and processing. Poor partitioning of functions can force the host
software to implement a complex protocol for communicating with the interface, and thereby reduce perfor-
mance.
2. Should existing protocol implementations be supported? On the one hand, many applications are immediately
available when an existing implementation is supported, e.g., TCP/IP or XNS. On the other, performance for
some applications might be gained by customizing stacks [5] using a new programmer interface. Multiple pro-
tocol stacks could be supported, at some cost in effort; this allows both older applications and new applica-
tions with greater bandwidth requirements to coexist. Methods such as the x-Kernel [18, 21] may provide a
method for customized stacks to be built on top of operating system support such as we describe in this paper.
3. How are services provided to applications? One key example is the support for paced data delivery, used for
multimedia applications. As the host interface software is a component in timely end-to-end delivery, it must
support real-time data delivery. This implies provision for process control, timers, etc. in the driver software.
4. How do design choices affect the remainder of the system? The host interface software may be assigned a
high priority, causing delays or losses elsewhere in the system. Use of polling for real-time service may affect
other interrupt service latencies. The correct choices for tradeoffs here are entirely a function of the worksta-
tion user’s desire for, and use of, network services. While any tradeoffs should not preclude interaction with
other components of the system, e.g., storage devices or frame buffers, increasing demand for network ser-
vices may bias decisions towards delivering network subsystem performance.
Given the cost of interrupts and their effect on processor performance, strategies which reduce the number of inter-
rupts per data transfer can be employed [19]. An example would be using an interrupt to signal grosser events, such
as half-full queues in the interface.
1.4. Communicating State Changes between Host and Interface
One of the key issues in the design of operating system features which support interactions with external events
(such as arriving data) is the state exchange protocol. There are three common approaches used:
1. Pure ‘‘busy-waiting’’, where the external event can be detected by a change in, e.g., an addressable status
register. The processor continuously examines the stateword until the change occurs, and then resumes pro-
cessing with the newly-arrived data. ‘‘Busy-waiting’’ is rarely if ever used in multitasking systems, since it
effectively precludes use of the processor until the event arrives. It is more commonly used by dedicated con-
trollers. ‘‘Busy-waiting’’ can be used with priorities to enforce some degree of isolation among activities on
the processor.
2. Interrupts are an artifact of the desire to timeshare processors among activities. The basic idea is that the
event arrival (most likely detected by a low-level busy-waiting scheme in the external device) causes the pro-
cessor to be interrupted, that is, to cease its current flow of control and to begin a new flow of control dictated
by data arrival. Typically, this involves transferring the data to a processor storage location where the data can
be processed later, using a normal flow of control. When interrupt service is complete, the processor resumes
the interrupted flow of control. The two difficulties with interrupts are their asynchronous arrival and cost.
The asynchronous arrival forces concurrency control techniques to be employed, and the interrupt service
time improves much more slowly than microprocessor speeds.
3. Clocked interrupts try to achieve a somewhat different balance of goals. A periodic software timer is used to
interrupt the flow of control of the processor as with any other interrupt. Interrupt service then consists of
examining changed statewords, as in the ‘‘busy-waiting’’ scheme. The tradeoffs here are closely tied to the

With programmed I/O, the CPU controls the transfer; with DMA a third party controls the transfer, and with bus mastered operation the peri-
pheral controls the transfer [9].
Preprint, July 1993 IEEE Network Special Issue, pp. 44-52

- 5 -
implementation environment, but an illustrative example is given by the UNIX [25] callout table design, used
for operating system management of pools of teletypewriter lines. Clocked interrupts represent a dynamic
midpoint between polling and data-driven interrupts; appropriate clocking rates can make the scheme resem-
ble either of the other two.

Using clocked interrupts is an engineering decision based on factors such as costs and traffic characteristics.
A simple calculation shows the tradeoff. Consider a system with an interrupt service overhead of C seconds, and k
active channels, each with events arriving at an average rate of λ events per second. Independent of interrupt ser-
vice, each event costs α seconds to service, e.g., to transfer the data from the device. The offered traffic is λ
.
k, and
in a system based on an interrupt-per-event, the total overhead will be λ
.
k
.
(C). Since the maximum number of
events serviced per second will be 1 / C, the relationship between parameters is that 1>λ
.
k
.
(C). Assuming
that C and α are for the most part fixed, we can increase the number of active channels and reduce the arrival rate
on each, or we can increase the arrival rate and decrease the number of active channels.
However, for clocked interrupts delivered at a rate β per second, the capacity limit is 1>β
.
C
.
k
.
α. Since α is
very small for small units such as characters, and C is very large, it makes sense to use clocked interrupts, especially
when a reasonable value of β can be employed. In the case of modern workstations, C is about 10
3
second. Note
that as the traffic level rises, more work is done on each clock ‘‘tick’’, so that the data transfer rate λ
.
k
.
α asymptoti-
cally bounds the system performance, rather than the interrupt service rate. To be fair, one should note that tradi-
tional interrupt service schemes can be improved, e.g., by aggregating traffic into larger packets (this reduces λ sig-
nificantly, while typically causing a slight increase in α), or by using an interrupt on one channel to prompt scanning
of other channels.

Box 1: Clocked Interrupts
1.5. Example Host Interface Architectures
Table I summarizes some high-level design features for several host interface architectures presented in the litera-
ture.

Feature NAB CAB Bellcore Penn/HP Penn Cambridge Fore


Event Flag I Mbox I I CI I I

Event PDU* PDU* PDU* PDU PDU Cell or PDU Cell

Processor? Y Y Y N N N N

Bus Arch.? VME VME TURBOChannel SGC MCA TURBOChannel SBus

Table I: Signaling, Units, Intelligence and Attachment
Legend:
PDU - Protocol Data Unit, an object size dictated by the protocol
I - Interrupt
CI - Clocked Interrupt
* - Processor can signal arbitrary events, e.g., Cell, PDU, timeout, etc.
Several interfaces have attempted to accelerate transport protocol processing. The VMP Network Adapter Board
(NAB) [19] implementation accelerates processing of Cheriton’s Versatile Message Transaction Protocol (VMTP).
The goals were to reduce the latency required in ‘‘request-reply’’ communications, while delivering high
throughput for data-intensive applications. The NAB separated these two classes of traffic to optimize its perfor-
mance. The NAB included an on-board microcontroller.
The Nectar Communications Accelerator Board (CAB) [24] includes a microcontroller with a complete mul-
tithreaded operating system. The host-CAB interaction is via messages sent over a VME bus, synchronized using a
mailbox scheme. The programmability can be used by applications to customize protocol processing. Cooper, et al.
[8], report that TCP/IP and a number of Nectar-specific protocols have been implemented on the CAB.
It remains unclear whether the entire transport protocol processing function needs to migrate to the interface;
Clark, et al. [4] argue that in the case of TCP/IP the actual protocol processing is of low cost and requires very few
Preprint, July 1993 IEEE Network Special Issue, pp. 44-52

Citations
More filters
Proceedings Article

Eliminating receive livelock in an interrupt-driven kernel

TL;DR: In this article, the authors modify an interrupt-driven networking implementation to eliminate receive livelock without degrading other aspects of system performance, and present measurements demonstrating the success of their approach.
Proceedings ArticleDOI

Fbufs: a high-bandwidth cross-domain transfer facility

TL;DR: The requirements for a cross-domain transfer facility are outlined, the design of the fbuf mechanism that meets these requirements are described, and the impact of fbufs on network performance is experimentally quantified.
Journal ArticleDOI

Eliminating receive livelock in an interrupt-driven kernel

TL;DR: This work modified an interrupt-driven networking implementation to do so, and eliminates receive livelock without degrading other aspects of system performance, including the use of polling when the system is heavily loaded, while retaining theUse of interrupts urJer lighter load.
Patent

Gigabit ethernet adapter supporting the iscsi and ipsec protocols

TL;DR: In this paper, the authors present a Gigabit Ethernet adapter that adapts to multiple communication protocols via a modular construction and design, and it provides a compact hardware solution to handling high network communication speeds.
Proceedings ArticleDOI

Lazy receiver processing (LRP): a network subsystem architecture for server systems

TL;DR: This work proposes and evaluates a new network subsystem architecture that provides improved fairness, stability, and increased throughput under high network load.
References
More filters
Proceedings ArticleDOI

Architectural considerations for a new generation of protocols

TL;DR: This paper identifies two new design principles, Application Level Framing and Integrated Layer Processing, and identifies the presentation layer as a key aspect of overall protocol performance.
Journal ArticleDOI

The x-Kernel: an architecture for implementing network protocols

TL;DR: The authors' experience implementing and evaluation several protocols in the x-Kernel shows that this architecture is general enough to accommodate a wide range of protocols, yet efficient enough to perform competitively with less-structured operating systems.
Journal ArticleDOI

An analysis of TCP processing overhead

TL;DR: A detailed study was made of the Transmission Control Protocol (TCP), the transport protocol from the Internet protocol suite, and it was concluded that TCP is in fact not the source of the overhead often observed in packet processing, and that it could support very high speeds if properly implemented.
Journal ArticleDOI

A dynamic network architecture

TL;DR: This paper describes a new way to organize network software that differs from conventional architectures in all three of these properties; the protocol graph is complex, individual protocols encapsulate a single function, and the topology of the graph is dynamic.
Journal ArticleDOI

The VMP network adapter board (NAB): high-performance network communication for multiprocessors

TL;DR: In this article, the authors propose a host-to-network adapter interface for high performance computer communication between multiprocessor nodes, which requires significant improvements over conventional HN adapters.
Frequently Asked Questions (17)
Q1. What are the contributions mentioned in the paper "Giving applications access to gb/s networking" ?

In this paper, the authors outline a variety of approaches to the architecture of such systems, examine several design points, and study one example in detail. 

The key policies which an operating system can enforce through its scheduling are fairness (e.g., in multiplexing packet streams) and the prevention of starvation. 

One of the key issues in the design of operating system features which support interactions with external events (such as arriving data) is the state exchange protocol. 

The VCI and MID are used, e.g., to specify header data to the segmenter card so that it can format a series of ATM cells for transmission. 

Much of the migration towards hardware is intended to obtain an implementation-specific performance advantage - as Watson and Mamrak [29] point out, performance is often due as much to implementation techniques as to careful protocol design. 

On the IBM RISC System/6000 Model 320, the bottleneck is the I/O Channel Controller (IOCC) which limits the performance to about 130 Mbps, while on the XIO-equipped IBM RISC/System 6000 Model 580, the physical layer data rate is the bottleneck at 135 Mbps. 

Efficiency can be achieved through many design features, but the main options [29] are: optimizing the processing functions in the protocol architecture, optimizing the operating system support for data transport, and careful placement of the hardware required for network attachment. 

Since application performance is the final validation, any experiments should be as close to true end-to-end experiments as possible. 

In practice, the main freedoms for the host interface software designer lie in the design of the device driver, since it forms the boundary between the host’s devicePreprint, July 1993 IEEE Network Special Issue, pp. 44-52independent software and the functions performed by the device. 

Use of pinned storage may present difficulties for system memory allocation if many channels with large PDU sizes are active concurrently. 

In particular, as pointed out in [14] multiplexing is a key issue, and in an end-to-end architecture, the end-points are processes. 

A key test of the various architectural hypotheses presented is their experimental evaluation; since many of these claims are related to performance, their experiments are focused on timing and throughput measurements, and analyses of these measurements. 

Legend: PDU - Protocol Data Unit, an object size dictated by the protocol The author- Interrupt CI - Clocked Interrupt * - Processor can signal arbitrary events, e.g., Cell, PDU, timeout, etc. 

One difficulty the authors have observed in practice with implementations employing on-board protocol processing is that communication with the interface requires a more complex protocol [24] than might otherwise be needed. 

Since the maximum number of events serviced per second will be 1 / C+α, the relationship between parameters is that 1>λ.k .(C+α). 

In both cases, the total impact on measured memory bandwidth (including the execution of the networking test application program, the device driver, and the transfer of data between the application and host interface) was less than 50%. 

The correct choices for tradeoffs here are entirely a function of the workstation user’s desire for, and use of, network services.