What is used to specify header data to the segmenter card?

The VCI and MID are used, e.g., to specify header data to the segmenter card so that it can format a series of ATM cells for transmission.

What is the bottleneck on the IBM RISC System/6000 Model 580?

On the IBM RISC System/6000 Model 320, the bottleneck is the I/O Channel Controller (IOCC) which limits the performance to about 130 Mbps, while on the XIO-equipped IBM RISC/System 6000 Model 580, the physical layer data rate is the bottleneck at 135 Mbps.

What is the final validation of the test?

Since application performance is the final validation, any experiments should be as close to true end-to-end experiments as possible.

What is the main reason for using pinned storage?

Use of pinned storage may present difficulties for system memory allocation if many channels with large PDU sizes are active concurrently.

What is the key issue in multiplexing?

In particular, as pointed out in [14] multiplexing is a key issue, and in an end-to-end architecture, the end-points are processes.

What is the key test of the architectural hypotheses presented?

A key test of the various architectural hypotheses presented is their experimental evaluation; since many of these claims are related to performance, their experiments are focused on timing and throughput measurements, and analyses of these measurements.

What is the protocol The authorused to determine the size of a PDU?

Legend: PDU - Protocol Data Unit, an object size dictated by the protocol The author- Interrupt CI - Clocked Interrupt * - Processor can signal arbitrary events, e.g., Cell, PDU, timeout, etc.

What is the problem with on-board protocol processing?

One difficulty the authors have observed in practice with implementations employing on-board protocol processing is that communication with the interface requires a more complex protocol [24] than might otherwise be needed.

How much bandwidth was used to transfer data between the application and the host interface?

In both cases, the total impact on measured memory bandwidth (including the execution of the networking test application program, the device driver, and the transfer of data between the application and host interface) was less than 50%.

(Open Access) Giving applications access to Gb/s networking (1993) | Jonathan M. Smith

Q: What are the contributions mentioned in the paper "Giving applications access to gb/s networking" ?

In this paper, the authors outline a variety of approaches to the architecture of such systems, examine several design points, and study one example in detail.

Giving Applications Access to Gb/s Networking

Jonathan M. Smith and C. Brendan S. Traw

Distributed Systems Laboratory, University of Pennsylvania

200 South 33rd St., Philadelphia, PA 19104-6389

ABSTRACT

Network fabrics with Gigabit per second (Gbps) bandwidths are available today, but these

bandwidths are not yet available to applications. The difficulties lie in the hardware and software

architecture through which application data travels between the network and host memory. The

hardware portion of the architecture is often called a host interface and the remainder of the pro-

tocol stack is implemented in host software.

In this paper, we outline a variety of approaches to the architecture of such systems, exam-

ine several design points, and study one example in detail. The detailed example, an ATM Host

Interface and Operating System support built at the University of Pennsylvania, illustrates design

tradeoffs for hardware and software, and some of the implications of these tradeoffs on applica-

tions performance.

1. Introduction

The past several years have seen a profusion of efforts to design and implement very-high speed networks which

deliver this speed ‘‘end-to-end’’. A good example is the AURORA Gigabit Testbed [6], one of several such testbeds

[2]. In AURORA, much of the focus has been on the development of technologies needed to deliver this performance

to workstation-class machines, rather than supercomputers. We believe these machines will be the majority of end-

points in future Gbps networks.

The difficulty posed by the choice of workstations is the mismatch between the performance of the machines

and the bandwidth provided by the network infrastructure such as switches and transmission lines. Specifically, the

network bandwidths are within an order of magnitude of the memory bandwidths of most workstations, and the bur-

den on a host’s memory architecture must be minimized for maximum performance, which as pointed out by Clark

and Tennenhouse [5], forces careful design of protocol processing architectures.

Efficiency can be achieved through many design features, but the main options [29] are: optimizing the pro-

cessing functions in the protocol architecture, optimizing the operating system support for data transport, and careful

placement of the hardware required for network attachment. In this paper, we will focus on operating system and

architectural issues, as we feel that high-performance protocol architecture features such as ordering, errors, dupli-

cates, coordination, and format conversion have been well-covered by others; see for example Feldmeier [16].

The remainder of Section 1 outlines approaches to host interface hardware and supporting software. Section

2 describes the design choices made in an implementation of an ATM host interface for the IBM RISC System/6000

workstation [27]. Section 3 analyzes the performance implications of the design choices for applications, and Sec-

tion 4 concludes the paper.

1.1. Host Interfaces

The design and implementation of host interfaces has been of interest since the earliest network implementations†.

Each succeeding generation has dealt with different types of hosts, networks, protocol architectures and networked



† Detailed information on some of these interfaces and supporting software is available in a Special Issue of the IEEE Journal on Selected Areas

in Communications [23].

Preprint, July 1993 IEEE Network Special Issue, pp. 44-52

- 2 -

applications. Goals have included low cost, high throughput and low delay. Implementations have been optimized

towards achieving one or more of these goals in their operational environments. Some of the key implementation

decisions have been: (1) the portion of protocol architecture functions performed by the interface; (2) signaling

between host and interface; and (3) the placement of the interface in the host computer’s architecture. Much of the

migration towards hardware is intended to obtain an implementation-specific performance advantage - as Watson

and Mamrak [29] point out, performance is often due as much to implementation techniques as to careful protocol

design. The key question may be the selection of functions to optimize by placement in hardware.

1.2. Host Interface Attachment

System

Bus

Memory

Processor

Host

Interface

memory

objects

(e.g., words)

network

objects

(e.g., cells)

Figure 1: General Host Interface Architecture

Figure 1 illustrates a general architecture for a host interface and will allow us to discuss several options for host

attachment. Davie [11, 13] and Ahlgren [15] have also discussed such options. We assess their performance poten-

tial for receiving data; sending data is similar.

1. The Host Interface is capable of Direct-Memory Access (DMA), which means that it can communicate with

the system memory directly, without processor intervention. The typical system (e.g., UNIX) uses the Host

Interface DMA capability to copy the data from the network into a buffer managed by the operating system.

The data is then copied by the CPU from the system buffer to a buffer owned by the application, which also

resides in the system memory. Thus, a given piece of data travels over the System Bus three times: Host

Interface to Memory, Memory to CPU, and CPU to Memory.

2. The Host Interface is capable of DMA, and the operating system is able to arrange for data to be transferred

directly to the application address space. A variant of this scheme would allow the host interface to directly

allocate and free host memory, in effect turning management of the host’s memory into a peer-peer model

rather than a master-slave model. Thus, a given piece of data travels over the System Bus once, from Host

Interface to Memory. One potential problem with this approach is that transport protocols may wish to defer

data transfer to applications until header processing is complete.

3. The Host Interface has a processor-addressable memory area, which the operating system manages. When

data arrives in the Host Interface’s buffers, the operating system copies this data into the user address space.

In this model, the data must travel the bus twice, once from Host Interface to CPU, and then from the CPU to

Memory.

4. The Host Interface has a processor-addressable memory area, where the application buffers are located. This

means that data never traverses the system memory bus, or rather, traversal is deferred until it is referenced by

the processor. This makes the host interface ‘‘look like’’ a piece of system memory allocated to the applica-

tion.

5. The Host Interface can be connected directly to the CPU [6], as in augmenting the processing unit with a co-

processor. As in Scheme 4, there is no memory bus traversal, and further, the connection is to a system

Preprint, July 1993 IEEE Network Special Issue, pp. 44-52

- 3 -

component which typically operates at speeds higher than memory bandwidths.

Each of these schemes is affected by a number of other considerations.

First, most modern architectures include a cache, which decreases the access latency of frequently used data,

but must be kept in a state consistent with system memory. The cache is typically architecturally ‘‘close’’ to the pro-

cessing unit, so it either must be kept consistent or flushed when new data arrive. Maintaining consistency is con-

siderably easier when the data passes through the processor - Scheme 2 must flush the cache for areas affected by

the DMA, and Scheme 4 must flush the cache, either under control of the CPU or the Host Interface. This problem

can be avoided by putting the data in non-cached memory, but this may have other negative performance conse-

quences. Schemes 1 and 3 should be able to obtain up-to-date cache copies when the data is copied into the user

address space.

Second, host interfaces may also be used to support applications which require specialized peripherals, such

as video conferencing. Thus it is important to keep a good balance between I/O and memory accessibility. The

DMA based schemes do this, but the memory-on-interface schemes (Schemes 3 and 4) might present difficulties in

I/O operations to and from other devices.

Finally, Schemes 1 and 3 involve the processor in copying data across address-space boundaries. Thus, the

processor must reduce its processing capacity by the amount of time spent copying data. Scheme 5’s co-processor

approach likewise shares processing-unit capacity between computational load and network traffic.

Any evaluation of a host attachment strategy is subject to the specifics of the host, I/O bus, interface technolo-

gies and software, as well as the target applications. For these specifics, there are other constraints such as econom-

ics, portability, etc. Thus, there is no ‘‘best’’ approach - they can only be ranked relative to these constraints.

1.3. Interaction with Software

Operating system software plays a key role in the achievement of high end-to-end networking performance. The

abstraction provided by the host interface is that of a device which can transfer data between a network and the

computing system’s memory. The software builds an application communications model over this abstraction. A

significant constraint on such software is its embedding in the framework of an operating system which satisfies

other (possibly conflicting) requirements. Particular application needs include the transfer of data into application-

private address spaces, connection management, high throughput, low latency, and the ability to support a variety of

traffic types. Traffic types include traditional bursty data communications traffic (such as transaction-style traffic),

bulk data transfer, and the sustained bandwidth requirements of applications supporting continuous media. We

believe that approaches optimized towards a particular traffic type, such as low-latency transactional traffic [24],

will suffer if the traffic mix varies considerably.

The software operating on the host is usually partitioned functionally into a series of layers defined by protec-

tion boundaries. Typically, each software layer contains several protocol layers. The user’s applications are typi-

cally executable programs, or groups of such programs cooperating on a task. Applications which require network

access obtain it via abstract service primitives such as read(), write(), and sendto(). These service primitives pro-

vide access to an implementation of some layers of the network protocol, as in the UNIX system’s access to TCP/IP

through the socket abstraction. The protocol is often designed to mask the behavior of the network and the hardware

connecting the computer to the network. Its implementation can usually be split into device-independent and

device-dependent portions.

Significant portions of protocol implementations may be embedded in the operating system of the host, where

the service primitives are system entry points. The device-dependent portion is implemented as a ‘‘device driver.’’

This is not strictly necessary as demonstrated in Mach 3.0, which moves both protocols and much of the device

driver code out of the kernel. Device drivers have a rigidly specified programmer interface, mainly so that the

device-independent portions of system software can form a reasonable abstraction of their behavior. Placement of

the protocol functions within the operating system is dictated by two factors, policies and performance. The key

policies which an operating system can enforce through its scheduling are fairness (e.g., in multiplexing packet

streams) and the prevention of starvation. High performance may require the ability to control timing and task

scheduling, the ability to manipulate virtual memory directly, the ability to fully control peripheral devices, and the

ability to communicate efficiently (e.g., with a shared address space). All of these requirements can be met by

embedding the protocol functions in the host operating system. In practice, the main freedoms for the host interface

software designer lie in the design of the device driver, since it forms the boundary between the host’s device

Preprint, July 1993 IEEE Network Special Issue, pp. 44-52

- 4 -

independent software and the functions performed by the device.

The software architect is presented with the following choices as to detailed implementation strategy:

1. Based on the capabilities of the interface (e.g., its provision for programmed I/O, DMA, or bus mastered

transfers†), what is the partitioning of functionality between the host software and the host interface

hardware? For example, use of DMA or bus-master transfers removes the need for a copying loop in the dev-

ice driver to process programmed I/O, but may require a variety of locks and scheduling mechanisms to sup-

port the concurrent activities of copying and processing. Poor partitioning of functions can force the host

software to implement a complex protocol for communicating with the interface, and thereby reduce perfor-

mance.

2. Should existing protocol implementations be supported? On the one hand, many applications are immediately

available when an existing implementation is supported, e.g., TCP/IP or XNS. On the other, performance for

some applications might be gained by customizing stacks [5] using a new programmer interface. Multiple pro-

tocol stacks could be supported, at some cost in effort; this allows both older applications and new applica-

tions with greater bandwidth requirements to coexist. Methods such as the x-Kernel [18, 21] may provide a

method for customized stacks to be built on top of operating system support such as we describe in this paper.

3. How are services provided to applications? One key example is the support for paced data delivery, used for

multimedia applications. As the host interface software is a component in timely end-to-end delivery, it must

support real-time data delivery. This implies provision for process control, timers, etc. in the driver software.

4. How do design choices affect the remainder of the system? The host interface software may be assigned a

high priority, causing delays or losses elsewhere in the system. Use of polling for real-time service may affect

other interrupt service latencies. The correct choices for tradeoffs here are entirely a function of the worksta-

tion user’s desire for, and use of, network services. While any tradeoffs should not preclude interaction with

other components of the system, e.g., storage devices or frame buffers, increasing demand for network ser-

vices may bias decisions towards delivering network subsystem performance.

Given the cost of interrupts and their effect on processor performance, strategies which reduce the number of inter-

rupts per data transfer can be employed [19]. An example would be using an interrupt to signal grosser events, such

as half-full queues in the interface.

1.4. Communicating State Changes between Host and Interface

One of the key issues in the design of operating system features which support interactions with external events

(such as arriving data) is the state exchange protocol. There are three common approaches used:

1. Pure ‘‘busy-waiting’’, where the external event can be detected by a change in, e.g., an addressable status

cessing with the newly-arrived data. ‘‘Busy-waiting’’ is rarely if ever used in multitasking systems, since it

effectively precludes use of the processor until the event arrives. It is more commonly used by dedicated con-

trollers. ‘‘Busy-waiting’’ can be used with priorities to enforce some degree of isolation among activities on

the processor.

2. Interrupts are an artifact of the desire to timeshare processors among activities. The basic idea is that the

event arrival (most likely detected by a low-level busy-waiting scheme in the external device) causes the pro-

cessor to be interrupted, that is, to cease its current flow of control and to begin a new flow of control dictated

by data arrival. Typically, this involves transferring the data to a processor storage location where the data can

be processed later, using a normal flow of control. When interrupt service is complete, the processor resumes

the interrupted flow of control. The two difficulties with interrupts are their asynchronous arrival and cost.

The asynchronous arrival forces concurrency control techniques to be employed, and the interrupt service

time improves much more slowly than microprocessor speeds.

3. Clocked interrupts try to achieve a somewhat different balance of goals. A periodic software timer is used to

interrupt the flow of control of the processor as with any other interrupt. Interrupt service then consists of

examining changed statewords, as in the ‘‘busy-waiting’’ scheme. The tradeoffs here are closely tied to the



† With programmed I/O, the CPU controls the transfer; with DMA a third party controls the transfer, and with bus mastered operation the peri-

pheral controls the transfer [9].

Preprint, July 1993 IEEE Network Special Issue, pp. 44-52

- 5 -

implementation environment, but an illustrative example is given by the UNIX [25] callout table design, used

for operating system management of pools of teletypewriter lines. Clocked interrupts represent a dynamic

midpoint between polling and data-driven interrupts; appropriate clocking rates can make the scheme resem-

ble either of the other two.



Using clocked interrupts is an engineering decision based on factors such as costs and traffic characteristics.

A simple calculation shows the tradeoff. Consider a system with an interrupt service overhead of C seconds, and k

active channels, each with events arriving at an average rate of λ events per second. Independent of interrupt ser-

vice, each event costs α seconds to service, e.g., to transfer the data from the device. The offered traffic is λ

k, and

in a system based on an interrupt-per-event, the total overhead will be λ

(C+α). Since the maximum number of

events serviced per second will be 1 / C+α, the relationship between parameters is that 1>λ

(C+α). Assuming

that C and α are for the most part fixed, we can increase the number of active channels and reduce the arrival rate

on each, or we can increase the arrival rate and decrease the number of active channels.

However, for clocked interrupts delivered at a rate β per second, the capacity limit is 1>β

C+λ

α. Since α is

very small for small units such as characters, and C is very large, it makes sense to use clocked interrupts, especially

when a reasonable value of β can be employed. In the case of modern workstations, C is about 10

−3

second. Note

that as the traffic level rises, more work is done on each clock ‘‘tick’’, so that the data transfer rate λ

α asymptoti-

cally bounds the system performance, rather than the interrupt service rate. To be fair, one should note that tradi-

tional interrupt service schemes can be improved, e.g., by aggregating traffic into larger packets (this reduces λ sig-

nificantly, while typically causing a slight increase in α), or by using an interrupt on one channel to prompt scanning

of other channels.





Box 1: Clocked Interrupts

1.5. Example Host Interface Architectures

Table I summarizes some high-level design features for several host interface architectures presented in the litera-

ture.



Feature NAB CAB Bellcore Penn/HP Penn Cambridge Fore



Event Flag I Mbox I I CI I I



Event PDU* PDU* PDU* PDU PDU Cell or PDU Cell



Processor? Y Y Y N N N N



Bus Arch.? VME VME TURBOChannel SGC MCA TURBOChannel SBus





Table I: Signaling, Units, Intelligence and Attachment

Legend:

PDU - Protocol Data Unit, an object size dictated by the protocol

I - Interrupt

CI - Clocked Interrupt

* - Processor can signal arbitrary events, e.g., Cell, PDU, timeout, etc.

Several interfaces have attempted to accelerate transport protocol processing. The VMP Network Adapter Board

(NAB) [19] implementation accelerates processing of Cheriton’s Versatile Message Transaction Protocol (VMTP).

The goals were to reduce the latency required in ‘‘request-reply’’ communications, while delivering high

throughput for data-intensive applications. The NAB separated these two classes of traffic to optimize its perfor-

mance. The NAB included an on-board microcontroller.

The Nectar Communications Accelerator Board (CAB) [24] includes a microcontroller with a complete mul-

tithreaded operating system. The host-CAB interaction is via messages sent over a VME bus, synchronized using a

mailbox scheme. The programmability can be used by applications to customize protocol processing. Cooper, et al.

[8], report that TCP/IP and a number of Nectar-specific protocols have been implemented on the CAB.

It remains unclear whether the entire transport protocol processing function needs to migrate to the interface;

Clark, et al. [4] argue that in the case of TCP/IP the actual protocol processing is of low cost and requires very few

Preprint, July 1993 IEEE Network Special Issue, pp. 44-52

Giving applications access to Gb/s networking

Figures

Citations

Eliminating receive livelock in an interrupt-driven kernel

Fbufs: a high-bandwidth cross-domain transfer facility

Eliminating receive livelock in an interrupt-driven kernel

Gigabit ethernet adapter supporting the iscsi and ipsec protocols

Lazy receiver processing (LRP): a network subsystem architecture for server systems

References

Architectural considerations for a new generation of protocols

The x-Kernel: an architecture for implementing network protocols

An analysis of TCP processing overhead

A dynamic network architecture

The VMP network adapter board (NAB): high-performance network communication for multiprocessors

Related Papers (5)

An analysis of TCP processing overhead

EMP: Zero-Copy OS-Bypass NIC-Driven Gigabit Ethernet Message Passing

Architectural considerations for a new generation of protocols

Fbufs: a high-bandwidth cross-domain transfer facility

U-Net: a user-level network interface for parallel and distributed computing

Frequently Asked Questions (17)

Q1. What are the contributions mentioned in the paper "Giving applications access to gb/s networking" ?

Q2. What are the key policies which an operating system can enforce through its scheduling?

Q3. What is the key issue in the design of operating systems?

Q4. What is used to specify header data to the segmenter card?

Q5. What are the main goals of the migration towards hardware?

Q6. What is the bottleneck on the IBM RISC System/6000 Model 580?

Q7. What are the main options for achieving efficiency?

Q8. What is the final validation of the test?

Q9. What are the main freedoms for the host interface software designer?

Q10. What is the main reason for using pinned storage?

Q11. What is the key issue in multiplexing?

Q12. What is the key test of the architectural hypotheses presented?

Q13. What is the protocol The authorused to determine the size of a PDU?

Q14. What is the problem with on-board protocol processing?

Q15. What is the relationship between the parameters of a system with an interrupt service overhead of 1?

Q16. How much bandwidth was used to transfer data between the application and the host interface?

Q17. What is the correct choice for tradeoffs?