Implementation of a simplified network processor

doi:10.1109/HPSR.2010.5580273

1

Implementation of a Simpliﬁed Network Processor

Qiang Wu, Danai Chasaki and Tilman Wolf

Department of Electrical and Computer Engineering

University of Massachusetts

Amherst, MA, USA

{qwu,dchasaki,wolf}@ecs.umass.edu

Abstract—Programmable packet processors have replaced tra-

ditional ﬁxed-function custom logic in the data path of routers.

Programmability of these systems allows the introduction of new

packet processing functions, which is essential for today’s Internet

as well as for next-generation network architectures. Software

development for many existing implementations of these network

processors requires a deep understanding of the architecture and

careful resource management by the software developer. Resource

management that is tied to application development makes it

difﬁcult for packet processors to adapt to changes in the workload

that are based on trafﬁc conditions and the deployment of new

functionality. Therefore, we present a network processor design

that separates programming from resource management, which

simpliﬁes the software development process and improves the

system’s ability to adapt to network conditions. Based on our

initial system design, we present a prototype implementation of

a 4-core network processor using the NetFPGA platform. We

demonstrate the operation of the system using header-processing

and payload-processing applications. For packet forwarding, our

simpliﬁed network processor can achieve a throughput of 2.79

Gigabits per second at a clock rate of only 62.5 MHz. Our results

indicate the proposed design can scale to conﬁgurations with

many more processors that operate at much higher clock rates

and thus can achieve considerable higher throughput while using

modest amounts of hardware resources.

Index Terms—Router design, network processor, next-

generation Internet, parallel processor, prototype

I. INTRODUCTION

Modern routers use programmable packet processors on

each port to implement packet forwarding and other advanced

protocol functionality. This programmability in the data path

is an important aspect of router designs in the current Inter-

net – in contrast to the traditional approach where custom

application-speciﬁc integrated circuits with ﬁxed functionality

are used. The ability to change a router’s operation by simply

changing the software processed on router ports makes it pos-

sible to introduce new functions (e.g., monitoring, accounting,

anomaly detection, blocking, etc.) without changing router

hardware. An essential requirement for these systems is the

availability of a high-performance packet processor that can

deliver packet processing at data rates of multiple Gigabits per

second. Such network processors (NPs) have been developed

and deployed over the last decade as systems-on-a-chip based

on multi-core architectures.

One of the key challenges in using a network processor to

implement advanced packet processing functionality is soft-

This material is based upon work supported by the National Science

Foundation under Grant Nos. CNS-0626690 and CNS-0447873.

ware development. Many programming environments for NPs

use very low levels of abstractions. While this approach helps

with achieving high throughput performance, it also poses

considerable challenges to the software developer. Distributing

processing workload between processor cores, coordinating

shared resources, and manually allocating data structures to

different memory types and banks is a difﬁcult process. In

environments where network functionality does not change

frequently, it is conceivable to dedicate considerable resources

to such software development. However, this software devel-

opment approach becomes less practical in highly dynamic

systems. The next-generation Internet is envisioned to be such

a dynamic environment.

The next-generation Internet architecture is expected to rely

on programmability in the infrastructure substrate to provide

isolated network “slices” with functionally different protocol

stacks [1], [2]. In such a network, the processing workload on

router changes dynamically as slices are added or removed or

as the amount of trafﬁc within a slice changes. These dynamics

require that the software on a network processor adapt at

runtime without the involvement of a software developer.

Thus, it is necessary to develop network processing systems

where these dynamics can be handled by the system. The

performance demands of packet processing do not allow the

use of a completely general operating system. An operating

system would use a considerable fraction of the network

processor’s resources. Instead, we focus on a solution where

resource management is built into the network processor

hardware and in turn allows a much simpliﬁed programming

process.

The simpliﬁed network processor that we present in this

paper attempts to hide the complexity of resource management

in the network processor hardware. The software developer

merely programs the functionality of packet processing. This

approach contrasts other network processors, where func-

tionality and resource management are tightly coupled (e.g.,

programmers need to explicitly choose allocation of data in

SRAM or DRAM). By separating functionality from resource

management, the system can more readily adapt to runtime

conditions that could not have been predicted by the software

developer.

We have described the general architecture of a simpliﬁed

network processor in [3], [4], and we review the main aspects

of the design in Section III. In this paper, we present a

prototype implementation of our simpliﬁed network processor.

Speciﬁcally, the contributions of our paper are:

2

• FPGA-based prototype implementation of simpliﬁed net-

work processor: To demonstrate the feasibility of our

simpliﬁed network processor design, we present a 4-

core prototype implementation based on the NetFPGA

system [5]. This implementation shows that the proposed

architecture can be realized with a moderate amount of

resources.

• Functional operation with header-processing and payload-

processing application: We present the results of two

applications operating on the prototype system to demon-

strate that the simpliﬁed network processor operates cor-

rectly and is able to process packet header and payloads.

We also illustrate the simplicity of software development

for the system.

• Performance results to demonstrate scalability: We

present results on system throughput to show how well

the system performs and how workload can be distributed

over all processor cores. We also present results to

indicate system scalability to larger number of cores and

higher clock rates on ASIC-based implementations.

Overall, the results presented in this paper demonstrate

that the simpliﬁed network processor architecture is feasible,

efﬁcient, and easier to program than conventional network

processors. We believe that these results present an important

step towards developing an efﬁcient, easy-to-use infrastructure

for packet processing in today’s networks and the future

Internet.

The remainder of this paper is organized as follows. Sec-

tion II discusses related work. The overall system design of

the simpliﬁed network processor is introduced in Section III.

Speciﬁc details on the prototype implementation are presented

in Section IV. Results from the prototype implementation

and its performance are presented in Section V. Section VI

summarizes and concludes this paper.

II. R

ELATED WORK

Programmability in the data path of routers has been in-

troduced as software extensions to workstation-based routers

(e.g., Click modular router [6], dynamically extensible router

[7]) as well as multi-core embedded network processors (e.g.,

Intel IXP platform [8], Cisco QuantumFlow processor [9],

EZchip NP-3 [10], and AMCC nP series [11]). Programmabil-

ity in the data path can be used to implement additional packet

processing functions beyond simple IPv4-forwarding [12] or

in-network data path service for next-generation networks

[13], [14].

Software development environments for data path program-

ming support general-purpose programmability [15], provide

a modular structure (e.g., NP-Click [16], router plugins [17]),

or implement abstraction layers to hide underlying hardware

details [18]. While the management of packet I/O is largely

simpliﬁed by software abstractions or libraries, control of

packet transmission is still tied to packet processing in these

approaches. Programming environments for network proces-

sors require software support for program partitioning (e.g.,

how to distribute workload over multiple processor cores) and

resource management (e.g., how to allocate program state to

different memories).

Fig. 1. Resource management in simpliﬁed network processor design.

In prior work, we have shown that it is possible to perform

automated program partitioning of workloads [19] as well

as to dynamically manage resources on multi-core network

processors [20], [21]. Based on these results, we have proposed

the basic architecture of the simpliﬁed network processor in

[3], [4]. While we envision a speciﬁc implementation based

on our prior work on workload partitioning and resource man-

agement, it is possible to use different runtime management

approaches (e.g., [22], [23]).

III. S

YSTEM ARCHITECTURE

We provide a brief overview on the system architecture

of the simpliﬁed network processor before discussing details

on the prototype implementation in Section IV. In principle,

the network processor consists of a grid of packet processors

that are locally connected to each other. A control system

determines how packets are moved between processors and

what processing is performed.

A. Resource Management

The system architecture of the simpliﬁed network processor

is based on the idea of removing explicit resource man-

agement from the software development process and instead

implementing this feature in the network processor hardware.

Figure 1 shows this difference. The idea to not have the

programmer handle resource management is not new – it has

been very successfully deployed in practically any operating

system. However, network processors (and many other em-

bedded systems) do not use operating systems for reasons

of performance. Network processors may use an embedded

operating system on their control processor, but processors in

the data – where performance really matters – are programmed

directly.

Since network processors are used for a very speciﬁc task

(i.e., processing packets), it is possible to provide resource

management operations as part of the network processor

system. Using a combination of special-purpose hardware

resources and software on the control processor, it is possible

to perform the following actions:

3

Fig. 2. Processing context in simpliﬁed network processor design.

• Move packets between processor cores,

• Switch processing context between different applications,

• Allocate processing resources to applications based on

processing requirements, and

• Allocate memories to data structures based on access

patterns.

We have discussed algorithms for the latter two functions

in prior work (see [20] for dynamic mapping of tasks to

processing resources and [21] for dynamic mapping of data

structures to memories). These algorithms are implemented

on the control processor of the system and thus are not

immediately related to the network processor hardware. In this

paper, we focus on the ﬁrst two issues since they are an integral

part of the hardware design.

B. Software Development

Using hardware support for moving packets between proces-

sor cores and switching processing contexts enables us to sig-

niﬁcantly simplify the software development process. Since the

software developer does not need to explicitly manage packets

or processing context, a much simpler processing environment

can be presented. Through careful memory management, pro-

cessing instructions, data structures, and the current packet can

all be mapped to (virtually) ﬁxed memory locations. Thus,

the program can simply access them through static references.

Figure 2 illustrates this simpliﬁed environment.

To make this approach practical, the network processor

hardware needs to handle the necessary context switching and

packet movement operations.

C. Packet Processor

The packet processor unit in the simpliﬁed network pro-

cessor design is shown in Figure 3. The address shifters are

used to map the most signiﬁcant bits of memory addresses to

the processing context that is currently in use by the processor.

Similarly, access to the packet is mapped to one out of several

in the packet buffer. For more details on the operation of the

address shifter, see [3].

A critical aspect of the system is determining when to map

memory accesses to which processing context. This step is

handled by the resource management system on the control

processor. Each packet is classiﬁed when entering the network

processor to determine what processing steps are required

(e.g., depending on the virtual slice to which a packet is sent).

The packet then carries control information that determines

its path and the processing steps that need be performed

on different processors. Using this control information, each

packet processor can determine what context needs to be

mapped using the address shifters. Thus, it can be ensured

that the correct processing steps are performed on the packet.

IV. P

ROTOTYPE IMPLEMENTATION

The high-level architecture of our four-core prototype sys-

tem, which we have implemented on a NetFPGA [5], is

shown in Figure 4. Packets enter through the I/O interface,

get classiﬁed into ﬂows and then distributed into the grid of

packet processing units (PPUs). Each processing unit has a

set of packet processing applications preloaded (as determined

by the runtime system), and is able to select the requested

application based on control information determined during

packet classiﬁcation. After the processing steps have been

completed, the packet is sent through the output arbiter to

the outgoing interface. The processor core is a 32-bit Plasma

processor [24], which uses the MIPS instruction set and

operates at 62.5MHz.

One of the key design aspects of this system is that

packet processors only use local memory. Avoiding the use

of a global, shared memory interface helps in preserving

the scalability of the design. As we show in the results in

Section V, the prototype implementation can scale to a 7×7

grid conﬁguration with a linear increase in chip resources.

A. Packet Processing Unit

The setup of the packet buffer system is also illustrated

in Figure 3. The packet buffers are used to store packets

that are received from neighboring processor units (or from

the ﬂow classiﬁcation unit). The processor can switch its

local context to the packet that is being processed. Completed

packets are stored until they can be passed to the neighboring

processor units (or to the output arbiter). Bypass buffers are

used for packets that do not need processing on the local

processor, but need to be passed to a neighbor. By using

4

Fig. 3. Packet processing unit with support for context mapping and packet handling.

Fig. 4. Network processing platform design.

separate buffers, blocking due to processor overload can be

avoided. Our prototype uses four packet buffers for processing

and two packet buffers for bypass. Larger numbers of buffers

would help in reducing potential packet drops due to blocking,

but limited on-chip memory resources impede larger buffer

designs.

B. Flow Routing Mechanism

The ﬂow classiﬁcation unit of the system determines the

transmission path of a packet through the system as well as

the set of applications that are executed along the way. Each

packet is augmented by control information that contains two

pieces of information for each processor that is traversed in

the grid:

A3

A1

PPU1

PPU0

PPU2

PPU3

A2

A0

Flow

Classification

Out p ut

Arbiter

A2

1 10 11 0 00 11 1 11 01

00

01

11

1 10 11 1 01 00 1 00 11

Tag Table

Fig. 5. Example of ﬂow-based packet routing/processing and control

information used by system.

•

Service tag: The service tag consists of an indicator if an

application is to be executed on the packet on the current

processor. If processing is required, the additional bits

in the service tag determine which application is used.

In the current prototype, we support processing of one

application per processor, but this design can be extended

to support multiple applications per packet per processor.

• Routing information: The routing information indicates

to which neighbor the packet should be passed after

processing is completed.

Figure 5 shows an example of this control information for

a 2×2 processor example with two active ﬂows. The ﬂows

are routed as shown in the ﬁgure and processing occurs when

the ﬂow encounters an application illustrated by a circle. The

tags that are kept in the tag table and that are added to the

5

packet are shown at the bottom of the ﬁgure. There are three

triples of bit sequences. Each triple is used by one of the

processors that are traversed. Note that the number of valid

triples may change with different routes. Also, the triples

are processed from right to left. Within a triple, the ﬁrst bit

indicates if an application is to operate on the packet. If so,

the second bit sequence indicates the application identiﬁer.

The last bit sequence indicates the routing according to the

directions shown in the lower right of the ﬁgure.

To setup (or change) the route of a ﬂow or its processing

steps, the runtime system of the network processor simply

rewrites the control information in the tag table. This approach

allows for very easy control of the system without the need

to communicate with individual packet processing units.

Identiﬁcation of ﬂows is achieved through lookup operations

on ﬂow table stored in the classiﬁcation unit. Thus, by altering

entries of the ﬂow table, a ﬂow is able to access any service

inside the processing grid. In addition, the bypass path of each

PPU is isolated from the processing path to avoid blocking of

bypass packet transmission. Thus, the ﬂow routing mechanism

allows for signiﬁcant ﬂexibility in the utilization of the pro-

cessing grid. For example, all PPUs can be chained together to

form a pipeline, or they can be logically parallelized (i.e., each

ﬂow can only be served by exactly one PPU). More details

about application mapping on PPUs and the ﬂow routing

algorithm can be found in [25].

C. Simpliﬁed Programming Abstraction

As discussed in [3], [4], one of the goals of our design

is to simplify code development for the network service

processing platform. To achieve the desired simplicity, the

packet processor is able to directly access on-chip memories,

in which instructions (program code for multiple services),

data and packets have been stored. As shown in Figure 2

the packet processor has an interface for reading program

instructions and data memory and an interface for access

to packet memory. In the instruction memory, the code for

running a particular service is placed at a ﬁxed, well-known

offset. In the data memory we have placed the stack and global

pointers at well-known offsets as well. With this design, packet

processing and code development for packet processing is

simpliﬁed. Packet data can be accessed via referencing the data

memory on the (ﬁxed) packet offset. Moreover, the program

code is placed in a ﬁxed location in the instruction memory

and thus can be accessed easily by the processor.

An example of a piece of C code that accesses packet

memory is shown in Figure 6. The code reads the time-to-

live (TTL) ﬁeld in the IP header and decrements it. Since the

context is automatically mapped, the IP header can simply be

accessed by a static reference. The hardware of the system

ensures that this memory access is mapped to the correct

physical address in the packet buffer that is currently in

use. Similarly, data memory (and instruction memory) can be

accessed. For example, to count the number of packets handled

by an application, a simple counter can be declared:

static int packet_count

This counter can be incremented once per packet:

#define IP_TTL 0x1000001E

#define pkt_get8(addr, data) \

data =

*

((volatile unsigned char

*

) addr)

#define pkt_put8(addr, data) \

*

((volatile unsigned char

*

) addr) = data

typedef unsigned char _u8;

_u8 ip_ttl;

pkt_get8(IP_TTL, ip_ttl);

if (ip_ttl != 0){

ip_ttl--; \\decrement TTL

pkt_put8(IP_TTL, ip_ttl);

} else {

...handle TTL expiration...

}

Fig. 6. Simple C program for accessing and decrementing the time-to-live

ﬁeld in the IP header.

packet_count++

The automated context handling ensures that the memory state

is maintained for the application across packets, and thus

correct counting is possible.

To program other network processors, a programmer has

to specify the exact memory offset and memory bank (e.g.,

SRAM vs. DRAM) each and every time a data structure is

accessed. Compared to this complex method of referencing

memory, our programming model is considerably easier.

For our prototype implementation, we have implemented

two speciﬁc applications:

• IP forwarding: This application implements IP forward-

ing [26] using a simple destination IP lookup algorithm.

• IPsec encryption: This application implements the cryp-

tographic processing to encrypt IP headers and payload

for VPN transmission [27].

These two applications represent two extremes in the spec-

trum of processing complexity. IP forwarding implements the

minimum amount of processing that is necessary to forward

a packet. IPsec is extremely processing-intensive since each

byte of the packet has to be processed and since cryptographic

processing is very compute-intensive.

V. E

VALUATION

In this section, we discuss performance results obtained

from our prototype system. These results focus on functional-

ity, throughput performance, and scalability.

A. Experimental Setup and Correctness

Using three of the Ethernet ports on the NetFPGA system,

we connect the network processor to three workstation com-

puters for trafﬁc generation and trace collection. The routing

and processing steps for ﬂows on the network processor are set

up statically for each experiment. The IP forwarding and IPsec

application are instantiated as necessary on the processing

units.

The ﬁrst important result is that the system operates cor-

rectly. Using network monitoring on the workstation comput-

ers, we can verify that IP forwarding is implemented correctly

Implementation of a simplified network processor

Figures (8)

Citations

Cites background from "Implementation of a simplified netw..."

Cites methods from "Implementation of a simplified netw..."

References

"Implementation of a simplified netw..." refers methods in this paper

"Implementation of a simplified netw..." refers background in this paper

"Implementation of a simplified netw..." refers methods in this paper

"Implementation of a simplified netw..." refers methods in this paper

Related Papers (5)