What are the contributions in this paper?

Deep packet processing is migrating to the edges of service provider networks to simplify and speed up core functions. This paper provides an overview of the IBM PowerNP NP4GS3 network processor and how it addresses these issues.

What is the importance of a unified packet-based network?

Scalability for traffic engineering, quality of service (QoS), and the integration of wireless networks in a unified packet-based next-generation network requires traffic differentiation and aggregation.

How many cycles per second can the egress DS run?

To sustain media speed with 48-byte packets, 6.1 million packets per second, the egress DS must run with a 10-clock cycle data store access window.

What are the main components of the PowerNP?

The PowerNP has the following main components: embedded processor complex (EPC), data flow (DF), scheduler, MACs, and coprocessors.

What is the function of the control store arbiter?

The control store arbiter (CSA) controls access to the control store (CS), which allocates memory bandwidth among the threads of all DPPUs.

What is the way to ensure consistency in the assignment and verification of GTP sequence numbers?

Consistency in the assignment and verification of GTP sequence numbers, and in operations on the reordering queues, is ensured by using the semaphore coprocessor.

How many threads can be executed simultaneously?

Although there are 32 independent threads, each CLP can execute the instructions of only one of its threads at a time, so at any instant up to 16 threads are executing simultaneously.

What is the purpose of the DS interface and arbiters?

The ingress and egress DS interface and arbiters are for controlling accesses to the DS, since only one thread at a time can access either DS.

What is the way to configure a DMU for Gigabit Ethernet?

To support 1 Gigabit Ethernet, a DMU can be configured as either a gigabit mediaindependent interface (GMII) or a ten-bit interface (TBI).

Why is it easier to develop new high-performance applications?

Because of the availability of associated advanced development and simulation tools, combined with extensive reference implementations, rapid prototyping and development of new high-performance applications are significantly easier than with either GPPs or ASICs.

What is the softwarearchitecture and programming model?

The softwarearchitecture and programming model describes the data plane functions and APIs, the control plane functions and APIs, and the communication model between these components.

What are the types of threads supported in the EPC?

Five types of threads are supported:● General data handler (GDH) Seven DPPUs contain the GDH threads for a total of 28 GDH threads.

What is the assembler for creating images designed to execute on the powerNP?

It generates files used to execute picocode on the chip-level simulation model or the PowerNP, as well as files that picocode programmers can use for debugging.

What is the main structure that manages the CS?

The lookup definition table (LuDefTable), an internal memory structure that contains 128 entries to define 128 trees, is the main structure that manages the CS.

(Open Access) IBM PowerNP network processor: Hardware, software, and applications (2003) | J. R. Allen

J. R. Allen, Jr.

B. M. Bass

C. Basso

R. H. Boivie

J. L. Calvignac

G. T. Davis

L. Frelechoux

M. Heddes

A. Herkersdorf

A. Kind

J. F. Logan

M. Peyravian

M. A. Rinaldi

R. K. Sabhikhi

M. S. Siegel

M. Waldvogel

IBM PowerNP network

processor: Hardware,

software, and applications

Deep packet processing is migrating to the edges of service

provider networks to simplify and speed up core functions. On

the other hand, the cores of such networks are migrating to the

switching of high-speed trafﬁc aggregates. As a result, more

services will have to be performed at the edges, on behalf of

both the core and the end users. Associated network equipment

will therefore require high ﬂexibility to support evolving high-

level services as well as extraordinary performance to deal with

the high packet rates. Whereas, in the past, network equipment

was based either on general-purpose processors (GPPs) or

application-speciﬁc integrated circuits (ASICs), favoring

ﬂexibility over speed or vice versa, the network processor

approach achieves both ﬂexibility and performance. The key

advantage of network processors is that hardware-level

performance is complemented by ﬂexible software architecture.

This paper provides an overview of the IBM PowerNP

NP4GS3 network processor and how it addresses these issues.

Its hardware and software design characteristics and its

comprehensive base operating software make it well suited

for a wide range of networking applications.

Introduction

The convergence of telecommunications and computer

networking into next-generation networks poses

challenging demands for high performance and ﬂexibility.

Because of the ever-increasing number of connected

end users and end devices, link speeds in the core will

probably exceed 40 Gb/s in the next few years. At the

same time, forwarding intelligence will migrate to the

edges of service provider networks to simplify and speed

up core functions.

Since high-speed trafﬁc aggregates will

be switched in the core, more services will be required at

the edge. In addition, more sophisticated end user services

lead to further demands on edge devices, calling for high

ﬂexibility to support evolving high-level services as well as

performance to deal with associated high packet rates.

Whereas, in the past, network products were based either

on GPPs or ASICs, favoring ﬂexibility over speed or vice

versa, the network processor approach achieves both

ﬂexibility and performance.

Current rapid developments in network protocols and

applications push the demands for routers and other

network devices far beyond doing destination address

lookups to determine the output port to which the packet

should be sent. Network devices must inspect deeper into

the packet to achieve content-based forwarding; perform

protocol termination and gateway functionality for server

ofﬂoading and load balancing; and require support for

higher-layer protocols. Traditional hardware design, in

which ASICs are used to perform the bulk of processing

load, is not suited for the complex operations required

and the new and evolving protocols that must be

processed. Ofﬂoading the entire packet processing to a

GPP, not designed for packet handling, causes additional

difﬁculties. Recently, ﬁeld-programmable gate arrays

(FPGAs) have been used. They allow processing to be

ofﬂoaded to dedicated hardware without having to

undergo the expensive and lengthy design cycles

commonly associated with ASICs. While FPGAs are

now large enough to accommodate the gates needed

for handling simple protocols, multiple and complex

protocols are still out of reach. This is further intensiﬁed

The term edge denotes the point at which trafﬁc from multiple customer premises

enters the service provider network to begin its journey toward the network core.

Core devices aggregate and move trafﬁc from many edge devices.

reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the ﬁrst page. The title and abstract, but no other portions, of this

paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of

this paper must be obtained from the Editor.

IBM J. RES. & DEV. VOL. 47 NO. 2/3 MARCH/MAY 2003 J. R. ALLEN, JR., ET AL.

177

First publ. in: IBM Journal of Research and Development, Vol. 47, (2003), 2/3, pp. 177-194

Konstanzer Online-Publikations-System (KOPS)

URL: http://www.ub.uni-konstanz.de/kops/volltexte/2007/2326/

URN: http://nbn-resolving.de/urn:nbn:de:bsz:352-opus-23264

by their relatively slow clock speeds and long on-chip

routing delays, which rule out FPGAs for complex

applications.

Typical network processors have a set of programmable

processors designed to efﬁciently execute an instruction

set speciﬁcally designed for packet processing and

forwarding. Overall performance is further enhanced with

the inclusion of specialized coprocessors (e.g., for table

lookup or checksum computation) and enhancements to

the data ﬂow supporting necessary packet modiﬁcations.

However, not only is the instruction set customized for

packet processing and forwarding; the entire design

of the network processor, including execution

environment, memory, hardware accelerators, and

bus architecture, is optimized for high-performance

packet handling.

The key advantage of network processors is that

hardware-level performance is complemented by

horizontally layered software architecture. On the lowest

layer, the forwarding instruction set together with the

overall system architecture determines the programming

model. At that layer, compilation tools may help to

abstract some of the speciﬁcs of the hardware layer by

providing support for high-level programming language

syntax and packet handling libraries [1]. The interface to

the next layer is typically implemented by an interprocess-

communication protocol so that control path functionality

can be executed on a control point (CP), which provides

extended and high-level control functions through a

traditional GPP. With a deﬁned application programming

interface (API) at this layer, a traditional software

engineering approach for the implementation of network

services can be followed. By providing an additional

software layer and API which spans more than one

network node, a highly programmable and ﬂexible

network can be implemented. These layers are shown

in Figure 1 as hardware, software, and applications,

respectively, and are supported by tools and a reference

implementation.

Flexibility through ease of programmability at line speed

is demanded by continuing increases in the number of

approaches to networking [2– 4]:

●

Scalability for trafﬁc engineering, quality of service

(QoS), and the integration of wireless networks in a

uniﬁed packet-based next-generation network requires

trafﬁc differentiation and aggregation. These functions

are based on information in packet headers at various

protocol layers. The higher up the protocol stack the

information originates, the higher the semantic content,

and the more challenging is the demand for ﬂexibility

and performance in the data path.

●

The adoption of the Internet by businesses,

governments, and other institutions has increased the

importance of security functions (e.g., encryption,

authentication, ﬁrewalling, and intrusion detection).

●

Large investments in legacy networks have forced

network providers to require a seamless migration

strategy from existing circuit-switched networks to next-

generation networks. Infrastructures must be capable of

incremental modiﬁcation of functionalities.

●

Networking equipment should be easy to adapt to

emerging standards, since the pace of the introduction

of new standards is accelerating.

●

Network equipment vendors see the need of service

providers for ﬂexible service differentiation and

increased time-to-market pressure.

This paper provides an overview of the IBM PowerNP*

NP4GS3

network processor platform, containing the

components of Figure 1, and how it addresses those needs.

The speciﬁc hardware and software design characteristics

and the comprehensive base operating software of this

network processor make it a complete solution for a wide

range of applications. Because of its associated advanced

development and testing tools combined with extensive

software and reference implementations, rapid prototyping

and development of new high-performance applications

are signiﬁcantly easier than with either GPPs or ASICs.

System architecture

From a system architecture viewpoint, network processors

can be divided into two general models: the run-to-

In this paper the abbreviated term PowerNP is used to designate the IBM

PowerNP NP4GS3, which is a high-end member of the IBM network processor

family.

Figure 1

Components of a network processor platform.

Network processor applications

(e.g., virtual private network,

load balancing, firewall)

Network processor software

(e.g., management services,

transport services, protocol services,

traffic engineering services)

Network processor hardware

(e.g., processors, coprocessors,

flow control, packet alteration,

classification)

Network processor tools

(e.g., assembler, debugger, simulator)

Network processor reference

implementation

J. R. ALLEN, JR., ET AL. IBM J. RES. & DEV. VOL. 47 NO. 2/3 MARCH/MAY 2003

178

completion (RTC) and pipeline models, as shown in Figure 2.

The RTC model provides a simple programming approach

which allows the programmer to see a single thread that

can access the entire instruction memory space and all

of the shared resources such as control memory, tables,

policers, and counters. The model is based on the

symmetric multiprocessor (SMP) architecture, in which

multiple CPUs share the same memory [5]. The CPUs

are used as a pool of processing resources, all executing

simultaneously, either processing data or in an idle mode

waiting for work. The PowerNP architecture is based on

the RTC model.

In the pipeline model, each pipeline CPU is optimized

to handle a certain category of tasks and instructions. The

application program is partitioned among pipeline stages

[6]. A weakness in the pipeline model is the necessity

of evenly distributing the work at each segment of the

pipeline. When the work is not properly distributed,

the ﬂow of work through the pipeline is disrupted. For

example, if one segment is over-allocated, that segment

of the pipeline stalls preceding segments and starves

successive segments.

Even when processing is identical for every packet, the

code path must be partitioned according to the number of

pipeline stages required. Of course, code cannot always be

partitioned ideally, leading to unused processor cycles in

some pipeline stages. Additional processor cycles are

required to pass packet context from one stage to the

next. Perhaps a more signiﬁcant challenge of a pipelined

programming model is in dealing with changes, since a

relatively minor code change may require a programmer

to start from scratch with code partitioning. The RTC

programming model avoids the problems associated

with pipelined designs by allowing the complete

functionality to reside within a single contiguous

program ﬂow.

Figure 3 shows the high-level architecture of the

PowerNP—a high-end member of the IBM network

processor family which integrates medium-access controls

(MACs), switch interface, processors, search engines,

trafﬁc management, and an embedded IBM PowerPC*

processor which provides design ﬂexibility for applications.

The PowerNP has the following main components:

embedded processor complex (EPC), data ﬂow (DF),

scheduler, MACs, and coprocessors.

The EPC processors work with coprocessors to provide

high-performance execution of the application software

and the PowerNP-related management software. The

coprocessors provide hardware-assist functions for

performing common operations such as table searches and

packet alterations. To provide for additional processing

capabilities, there is an interface for attachment of

external coprocessors such as content-addressable

memories (CAMs). The DF serves as the primary data

path for receiving and transmitting network trafﬁc. It

provides an interface to multiple large data memories for

buffering data trafﬁcasitﬂows through the network

processor. The scheduler enhances the QoS functions

provided by the PowerNP. It allows trafﬁc ﬂows to be

scheduled individually per their assigned QoS class for

differentiated services. The MACs provide network

interfaces for Ethernet and packet over SONET (POS).

Figure 2

Network processor architectural models: (a) run-to-completion

(RTC) model; (b) pipeline model.

Processor

Memory

Processor

Instruction memory

Dispatcher

Completion unit

Arbiter

Memory

(a)

(b)

Processor

Memory

Processor

Memory

Processor

Memory

Instruction

memory

Instruction

memory

Instruction

memory

. . .

Figure 3

PowerNP high-level architecture.

MACs

Switch fabric

Data

flow

Traffic management

(scheduler)

Coprocessors

Network interface

ports

PowerNP

Coprocessors

(external)

Embedded processor

complex

Embedded

PowerPC

Picoprocessors

IBM J. RES. & DEV. VOL. 47 NO. 2/3 MARCH/MAY 2003 J. R. ALLEN, JR., ET AL.

179

Functional blocks

Figure 4 shows the main functional blocks that make up

the PowerNP architecture. In the following sections we

discuss each functional block within the PowerNP.

Physical MAC multiplexer

The physical MAC multiplexer (PMM) moves data

between physical layer devices and the PowerNP. The

PMM interfaces with the external ports of the network

processor in the ingress PMM and egress PMM directions.

The PMM includes four data mover units (DMUs),

labeled A, B, C, and D. Each of the four DMUs can be

independently conﬁgured as an Ethernet MAC or a POS

interface. The PMM keeps a set of performance statistics

on a per-port basis in either mode. Each DMU moves

data at 1 Gb/s in both the ingress and the egress

directions. There is also an “internal wrap” link that

enables trafﬁc generated by the egress side of the

PowerNP to move to the ingress side without going out

of the chip.

When a DMU is conﬁgured for Ethernet, it can support

either one port of 1 Gigabit Ethernet or ten ports of Fast

Ethernet (10/100 Mb/s). To support 1 Gigabit Ethernet,

a DMU can be conﬁgured as either a gigabit media-

independent interface (GMII) or a ten-bit interface (TBI).

To support Fast Ethernet, a DMU can be conﬁgured as a

serial media-independent interface (SMII) supporting ten

Ethernet ports. Operation at 10 or 100 Mb/s is determined

by the PowerNP independently for each port.

When a DMU is conﬁgured for POS mode, it can

support both clear-channel and channelized optical carrier

(OC) interfaces. A DMU supports the following types and

Figure 4

PowerNP functional block diagram.

DPPU

TSE

Hardware

classifier

Instruction

memory

(internal)

Dispatch unit

Semaphore

manager

Control store arbiter

Ingress SWI

Ingress SDM

Completion unit

LuDefTable

CompTable

FreeQueues

Policy

manager

Counter

manager

Debug and single

step control

Interrupts

and timers

Ingress DS

interface

and arbiter

(Rd+Wr)

Egress DS

interface

and arbiter

(Rd+Wr)

EPC

ePPC

405

External

memories

Internal

memories

Internal wrap

Ingress

data

store

(internal)

Ingress EDS

queue interface

Ingress DF

(Ingress EDS)

CAB

arbiter

Egress

data

store

(external)

Egress EDS

queue

interface

Egress DF

(Egress EDS)

Mailbox

macro

PCI

macro

Egress

scheduler

Egress SWI

Egress SDM

DMU-A DMU-B DMU-C DMU-D DMU-A DMU-B DMU-C DMU-D

Ingress PMM

Egress PMM

J. R. ALLEN, JR., ET AL. IBM J. RES. & DEV. VOL. 47 NO. 2/3 MARCH/MAY 2003

180

speeds of POS framers: OC-3c, OC-12, OC-12c, OC-48,

and OC-48c.

To provide an OC-48 link, all four DMUs

are attached to a single framer, with each DMU providing

four OC-3c channels or one OC-12c channel to the

framer. To provide an OC-48 clear channel (OC-48c) link,

DMU A is conﬁgured to attach to a 32-bit framer and the

other three DMUs are disabled, providing only interface

pins for the data path.

Switch interface

The switch interface (SWI) supports two high-speed data-

aligned synchronous link (DASL)

interfaces, labeled A

and B, supporting standalone operation (wrap), dual-mode

operation (two PowerNPs interconnected), or connection

to an external switch fabric. Each DASL link provides up

to 4 Gb/s of bandwidth. The DASL links A and B can be

used in parallel, with one acting as the primary switch

interface and the other as an alternate switch interface

for increased system availability. The DASL interface

is frequency-synchronous, which removes the need for

asynchronous interfaces that introduce additional interface

latency. The ingress SWI side sends data to the switch

fabric, and the egress SWI side receives data from the

switch fabric. The DASL interface enables up to 64

network processors to be interconnected using an external

switch fabric.

The ingress switch data mover (SDM) is the logical

interface between the ingress enqueuer/dequeuer/

scheduler (EDS) packet data ﬂow, also designated as

the ingress DF, and the switch fabric cell data ﬂow. The

ingress SDM segments the packets into 64-byte switch

cells and passes the cells to the ingress SWI. The egress

SDM is the logical interface between the switch fabric cell

data ﬂow and the packet data ﬂow of the egress EDS, also

designated as the egress DF. The egress DF reassembles

the switch fabric cells back into packets. There is also an

“internal wrap” link which enables trafﬁc generated by the

ingress side of the PowerNP to move to the egress side

without going out of the chip.

Data ﬂow and trafﬁc management

The ingress DF interfaces with the ingress PMM, the

EPC, and the SWI. Packets that have been received on

the ingress PMM are passed to the ingress DF. The

ingress DF collects the packet data in its internal data

store (DS) memory. When it has received sufﬁcient data

(i.e., the packet header), the ingress DF enqueues the

data to the EPC for processing. Once the EPC processes

the packet, it provides forwarding and QoS information to

the ingress DF. The ingress DF then invokes a hardware-

conﬁgured ﬂow-control mechanism and then either

discards the packet or places it in a queue to await

transmission. The ingress DF schedules all packets that

cross the ingress SWI. After it selects a packet, the ingress

DF passes the packet to the ingress SWI.

The ingress DF invokes ﬂow control when packet data

enters the network processor. When the ingress DS is

sufﬁciently congested, the ﬂow-control actions discard

packets. The trafﬁc-management software uses the

information about the congestion state of the DF, the rate

at which packets arrive, the current status of the DS, and

the current status of target blades to compute transmit

probabilities for various ﬂows. The ingress DF has

hardware-assisted ﬂow control which uses the software-

computed transmit probabilities along with tail drop

congestion indicators to determine whether a forwarding

or discard action should be taken.

The egress DF interfaces with the egress SWI, the EPC,

and the egress PMM. Packets that have been received on

the egress SWI are passed to the egress DF. The egress

DF collects the packet data in its external DS memory.

The egress DF enqueues the packet to the EPC for

processing. Once the EPC processes the packet, it

provides forwarding and QoS information to the egress

DF. The egress DF then enqueues the packet either to the

egress scheduler, when enabled, or to a target port queue

for transmission to the egress PMM. The egress DF

invokes a hardware-assisted ﬂow-control mechanism, like

the ingress DF, when packet data enters the network

processor. When the egress DS is sufﬁciently congested,

the ﬂow-control actions discard packets.

The egress scheduler provides trafﬁc-shaping functions

for the network processor on the egress side. It addresses

functions that enable QoS mechanisms required by

applications such as the Internet protocol (IP)-differentiated

services (DiffServ), multiprotocol label switching (MPLS),

trafﬁc engineering, and virtual private networks (VPNs).

The scheduler manages bandwidth on a per-packet basis

by determining the bandwidth required by a packet (i.e.,

the number of bytes to be transmitted) and comparing this

against the bandwidth permitted by the conﬁguration of

the packet ﬂow queue. The bandwidth used by a ﬁrst

packet determines when the scheduler will permit the

transmission of a subsequent packet of a ﬂow queue. The

scheduler supports trafﬁc shaping for 2K ﬂow queues.

Embedded processor complex

The embedded processor complex (EPC) performs all

processing functions for the PowerNP. It provides and

controls the programmability of the network processor.

In general, the EPC accepts data for processing from

both the ingress and egress DFs. The EPC, under

The transmission rate of OC-n is n ⫻ 51.84 Mb/s. For example, OC-12 runs at

622.08 Mb/s.

Other switch interfaces, such as CSIX, can currently be supported via an

interposer chip. On-chip support for CSIX will be provided in a future version

of the network processor.

IBM J. RES. & DEV. VOL. 47 NO. 2/3 MARCH/MAY 2003 J. R. ALLEN, JR., ET AL.

181

IBM PowerNP network processor: Hardware, software, and applications

Figures

Citations

Linear Types for Packet Processing

Parallel data link layer controllers in a network switching device

Leaping Multiple Headers in a Single Bound: Wire-Speed Parsing Using the Kangaroo System

Multiprocessor subsystem in SoC with bridge between processor clusters interconnetion and SoC system bus

Packet processor with wide register set architecture

References

IP Mobility Support

Algorithms for packet classification

A Two Rate Three Color Marker

A Single Rate Three Color Marker

Building a robust software-based router using network processors

Related Papers (5)

NetBench: a benchmarking suite for network processors

Building a robust software-based router using network processors

The click modular router

PacketBench: a tool for workload characterization of network processing

CommBench-a telecommunications benchmark for network processors

Frequently Asked Questions (16)

Q1. What are the contributions in this paper?

Q2. What is the importance of a unified packet-based network?

Q3. How many cycles per second can the egress DS run?

Q4. What are the main components of the PowerNP?

Q5. What is the function of the control store arbiter?

Q6. What is the way to ensure consistency in the assignment and verification of GTP sequence numbers?

Q7. How many threads can be executed simultaneously?

Q8. What is the purpose of the DS interface and arbiters?

Q9. What is the way to configure a DMU for Gigabit Ethernet?

Q10. Why is it easier to develop new high-performance applications?

Q11. What is the softwarearchitecture and programming model?

Q12. What are the types of threads supported in the EPC?

Q13. What is the abbreviation of the term PowerNP?

Q14. What is the assembler for creating images designed to execute on the powerNP?

Q15. What is the main structure that manages the CS?

Q16. What is the significant challenge of a pipelined programming model?