5
packet are shown at the bottom of the figure. There are three
triples of bit sequences. Each triple is used by one of the
processors that are traversed. Note that the number of valid
triples may change with different routes. Also, the triples
are processed from right to left. Within a triple, the first bit
indicates if an application is to operate on the packet. If so,
the second bit sequence indicates the application identifier.
The last bit sequence indicates the routing according to the
directions shown in the lower right of the figure.
To setup (or change) the route of a flow or its processing
steps, the runtime system of the network processor simply
rewrites the control information in the tag table. This approach
allows for very easy control of the system without the need
to communicate with individual packet processing units.
Identification of flows is achieved through lookup operations
on flow table stored in the classification unit. Thus, by altering
entries of the flow table, a flow is able to access any service
inside the processing grid. In addition, the bypass path of each
PPU is isolated from the processing path to avoid blocking of
bypass packet transmission. Thus, the flow routing mechanism
allows for significant flexibility in the utilization of the pro-
cessing grid. For example, all PPUs can be chained together to
form a pipeline, or they can be logically parallelized (i.e., each
flow can only be served by exactly one PPU). More details
about application mapping on PPUs and the flow routing
algorithm can be found in [25].
C. Simplified Programming Abstraction
As discussed in [3], [4], one of the goals of our design
is to simplify code development for the network service
processing platform. To achieve the desired simplicity, the
packet processor is able to directly access on-chip memories,
in which instructions (program code for multiple services),
data and packets have been stored. As shown in Figure 2
the packet processor has an interface for reading program
instructions and data memory and an interface for access
to packet memory. In the instruction memory, the code for
running a particular service is placed at a fixed, well-known
offset. In the data memory we have placed the stack and global
pointers at well-known offsets as well. With this design, packet
processing and code development for packet processing is
simplified. Packet data can be accessed via referencing the data
memory on the (fixed) packet offset. Moreover, the program
code is placed in a fixed location in the instruction memory
and thus can be accessed easily by the processor.
An example of a piece of C code that accesses packet
memory is shown in Figure 6. The code reads the time-to-
live (TTL) field in the IP header and decrements it. Since the
context is automatically mapped, the IP header can simply be
accessed by a static reference. The hardware of the system
ensures that this memory access is mapped to the correct
physical address in the packet buffer that is currently in
use. Similarly, data memory (and instruction memory) can be
accessed. For example, to count the number of packets handled
by an application, a simple counter can be declared:
static int packet_count
This counter can be incremented once per packet:
#define IP_TTL 0x1000001E
#define pkt_get8(addr, data) \
data =
*
((volatile unsigned char
*
) addr)
#define pkt_put8(addr, data) \
*
((volatile unsigned char
*
) addr) = data
typedef unsigned char _u8;
_u8 ip_ttl;
pkt_get8(IP_TTL, ip_ttl);
if (ip_ttl != 0){
ip_ttl--; \\decrement TTL
pkt_put8(IP_TTL, ip_ttl);
} else {
...handle TTL expiration...
}
Fig. 6. Simple C program for accessing and decrementing the time-to-live
field in the IP header.
packet_count++
The automated context handling ensures that the memory state
is maintained for the application across packets, and thus
correct counting is possible.
To program other network processors, a programmer has
to specify the exact memory offset and memory bank (e.g.,
SRAM vs. DRAM) each and every time a data structure is
accessed. Compared to this complex method of referencing
memory, our programming model is considerably easier.
For our prototype implementation, we have implemented
two specific applications:
• IP forwarding: This application implements IP forward-
ing [26] using a simple destination IP lookup algorithm.
• IPsec encryption: This application implements the cryp-
tographic processing to encrypt IP headers and payload
for VPN transmission [27].
These two applications represent two extremes in the spec-
trum of processing complexity. IP forwarding implements the
minimum amount of processing that is necessary to forward
a packet. IPsec is extremely processing-intensive since each
byte of the packet has to be processed and since cryptographic
processing is very compute-intensive.
V. E
VALUATION
In this section, we discuss performance results obtained
from our prototype system. These results focus on functional-
ity, throughput performance, and scalability.
A. Experimental Setup and Correctness
Using three of the Ethernet ports on the NetFPGA system,
we connect the network processor to three workstation com-
puters for traffic generation and trace collection. The routing
and processing steps for flows on the network processor are set
up statically for each experiment. The IP forwarding and IPsec
application are instantiated as necessary on the processing
units.
The first important result is that the system operates cor-
rectly. Using network monitoring on the workstation comput-
ers, we can verify that IP forwarding is implemented correctly