scispace - formally typeset
Open AccessJournal ArticleDOI

IBM PowerNP network processor: Hardware, software, and applications

Reads0
Chats0
TLDR
An overview of the IBM PowerNPTM NP4GS3 network processor is provided and its hardware and software design characteristics and its comprehensive base operating software make it well suited for a wide range of networking applications.
Abstract
Deep packet processing is migrating to the edges of service provider networks to simplify and speed up core functions. On the other hand, the cores of such networks are migrating to the switching of high-speed traffic aggregates. As a result, more services will have to be performed at the edges, on behalf of both the core and the end users. Associated network equipment will therefore require high flexibility to support evolving high-level services as well as extraordinary performance to deal with the high packet rates. Whereas, in the past, network equipment was based either on general-purpose processors (GPPs) or application-specific integrated circuits (ASICs), favoring flexibility over speed or vice versa, the network processor approach achieves both flexibility and performance. The key advantage of network processors is that hardware-level performance is complemented by flexible software architecture. This paper provides an overview of the IBM PowerNPTM NP4GS3 network processor and how it addresses these issues. Its hardware and software design characteristics and its comprehensive base operating software make it well suited for a wide range of networking applications.

read more

Content maybe subject to copyright    Report

J. R. Allen, Jr.
B. M. Bass
C. Basso
R. H. Boivie
J. L. Calvignac
G. T. Davis
L. Frelechoux
M. Heddes
A. Herkersdorf
A. Kind
J. F. Logan
M. Peyravian
M. A. Rinaldi
R. K. Sabhikhi
M. S. Siegel
M. Waldvogel
IBM PowerNP network
processor: Hardware,
software, and applications
Deep packet processing is migrating to the edges of service
provider networks to simplify and speed up core functions. On
the other hand, the cores of such networks are migrating to the
switching of high-speed traffic aggregates. As a result, more
services will have to be performed at the edges, on behalf of
both the core and the end users. Associated network equipment
will therefore require high flexibility to support evolving high-
level services as well as extraordinary performance to deal with
the high packet rates. Whereas, in the past, network equipment
was based either on general-purpose processors (GPPs) or
application-specific integrated circuits (ASICs), favoring
flexibility over speed or vice versa, the network processor
approach achieves both flexibility and performance. The key
advantage of network processors is that hardware-level
performance is complemented by flexible software architecture.
This paper provides an overview of the IBM PowerNP
TM
NP4GS3 network processor and how it addresses these issues.
Its hardware and software design characteristics and its
comprehensive base operating software make it well suited
for a wide range of networking applications.
Introduction
The convergence of telecommunications and computer
networking into next-generation networks poses
challenging demands for high performance and flexibility.
Because of the ever-increasing number of connected
end users and end devices, link speeds in the core will
probably exceed 40 Gb/s in the next few years. At the
same time, forwarding intelligence will migrate to the
edges of service provider networks to simplify and speed
up core functions.
1
Since high-speed traffic aggregates will
be switched in the core, more services will be required at
the edge. In addition, more sophisticated end user services
lead to further demands on edge devices, calling for high
flexibility to support evolving high-level services as well as
performance to deal with associated high packet rates.
Whereas, in the past, network products were based either
on GPPs or ASICs, favoring flexibility over speed or vice
versa, the network processor approach achieves both
flexibility and performance.
Current rapid developments in network protocols and
applications push the demands for routers and other
network devices far beyond doing destination address
lookups to determine the output port to which the packet
should be sent. Network devices must inspect deeper into
the packet to achieve content-based forwarding; perform
protocol termination and gateway functionality for server
offloading and load balancing; and require support for
higher-layer protocols. Traditional hardware design, in
which ASICs are used to perform the bulk of processing
load, is not suited for the complex operations required
and the new and evolving protocols that must be
processed. Offloading the entire packet processing to a
GPP, not designed for packet handling, causes additional
difficulties. Recently, field-programmable gate arrays
(FPGAs) have been used. They allow processing to be
offloaded to dedicated hardware without having to
undergo the expensive and lengthy design cycles
commonly associated with ASICs. While FPGAs are
now large enough to accommodate the gates needed
for handling simple protocols, multiple and complex
protocols are still out of reach. This is further intensified
1
The term edge denotes the point at which traffic from multiple customer premises
enters the service provider network to begin its journey toward the network core.
Core devices aggregate and move traffic from many edge devices.
Copyright 2003 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each
reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this
paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of
this paper must be obtained from the Editor.
0018-8646/03/$5.00 © 2003 IBM
IBM J. RES. & DEV. VOL. 47 NO. 2/3 MARCH/MAY 2003 J. R. ALLEN, JR., ET AL.
177
First publ. in: IBM Journal of Research and Development, Vol. 47, (2003), 2/3, pp. 177-194
Konstanzer Online-Publikations-System (KOPS)
URL: http://www.ub.uni-konstanz.de/kops/volltexte/2007/2326/
URN: http://nbn-resolving.de/urn:nbn:de:bsz:352-opus-23264

by their relatively slow clock speeds and long on-chip
routing delays, which rule out FPGAs for complex
applications.
Typical network processors have a set of programmable
processors designed to efciently execute an instruction
set specically designed for packet processing and
forwarding. Overall performance is further enhanced with
the inclusion of specialized coprocessors (e.g., for table
lookup or checksum computation) and enhancements to
the data ow supporting necessary packet modications.
However, not only is the instruction set customized for
packet processing and forwarding; the entire design
of the network processor, including execution
environment, memory, hardware accelerators, and
bus architecture, is optimized for high-performance
packet handling.
The key advantage of network processors is that
hardware-level performance is complemented by
horizontally layered software architecture. On the lowest
layer, the forwarding instruction set together with the
overall system architecture determines the programming
model. At that layer, compilation tools may help to
abstract some of the specics of the hardware layer by
providing support for high-level programming language
syntax and packet handling libraries [1]. The interface to
the next layer is typically implemented by an interprocess-
communication protocol so that control path functionality
can be executed on a control point (CP), which provides
extended and high-level control functions through a
traditional GPP. With a dened application programming
interface (API) at this layer, a traditional software
engineering approach for the implementation of network
services can be followed. By providing an additional
software layer and API which spans more than one
network node, a highly programmable and exible
network can be implemented. These layers are shown
in Figure 1 as hardware, software, and applications,
respectively, and are supported by tools and a reference
implementation.
Flexibility through ease of programmability at line speed
is demanded by continuing increases in the number of
approaches to networking [2 4]:
Scalability for trafc engineering, quality of service
(QoS), and the integration of wireless networks in a
unied packet-based next-generation network requires
trafc differentiation and aggregation. These functions
are based on information in packet headers at various
protocol layers. The higher up the protocol stack the
information originates, the higher the semantic content,
and the more challenging is the demand for exibility
and performance in the data path.
The adoption of the Internet by businesses,
governments, and other institutions has increased the
importance of security functions (e.g., encryption,
authentication, rewalling, and intrusion detection).
Large investments in legacy networks have forced
network providers to require a seamless migration
strategy from existing circuit-switched networks to next-
generation networks. Infrastructures must be capable of
incremental modication of functionalities.
Networking equipment should be easy to adapt to
emerging standards, since the pace of the introduction
of new standards is accelerating.
Network equipment vendors see the need of service
providers for exible service differentiation and
increased time-to-market pressure.
This paper provides an overview of the IBM PowerNP*
NP4GS3
2
network processor platform, containing the
components of Figure 1, and how it addresses those needs.
The specic hardware and software design characteristics
and the comprehensive base operating software of this
network processor make it a complete solution for a wide
range of applications. Because of its associated advanced
development and testing tools combined with extensive
software and reference implementations, rapid prototyping
and development of new high-performance applications
are signicantly easier than with either GPPs or ASICs.
System architecture
From a system architecture viewpoint, network processors
can be divided into two general models: the run-to-
2
In this paper the abbreviated term PowerNP is used to designate the IBM
PowerNP NP4GS3, which is a high-end member of the IBM network processor
family.
Figure 1
Components of a network processor platform.
Network processor applications
(e.g., virtual private network,
load balancing, firewall)
Network processor software
(e.g., management services,
transport services, protocol services,
traffic engineering services)
Network processor hardware
(e.g., processors, coprocessors,
flow control, packet alteration,
classification)
Network processor tools
(e.g., assembler, debugger, simulator)
Network processor reference
implementation
J. R. ALLEN, JR., ET AL. IBM J. RES. & DEV. VOL. 47 NO. 2/3 MARCH/MAY 2003
178

completion (RTC) and pipeline models, as shown in Figure 2.
The RTC model provides a simple programming approach
which allows the programmer to see a single thread that
can access the entire instruction memory space and all
of the shared resources such as control memory, tables,
policers, and counters. The model is based on the
symmetric multiprocessor (SMP) architecture, in which
multiple CPUs share the same memory [5]. The CPUs
are used as a pool of processing resources, all executing
simultaneously, either processing data or in an idle mode
waiting for work. The PowerNP architecture is based on
the RTC model.
In the pipeline model, each pipeline CPU is optimized
to handle a certain category of tasks and instructions. The
application program is partitioned among pipeline stages
[6]. A weakness in the pipeline model is the necessity
of evenly distributing the work at each segment of the
pipeline. When the work is not properly distributed,
the ow of work through the pipeline is disrupted. For
example, if one segment is over-allocated, that segment
of the pipeline stalls preceding segments and starves
successive segments.
Even when processing is identical for every packet, the
code path must be partitioned according to the number of
pipeline stages required. Of course, code cannot always be
partitioned ideally, leading to unused processor cycles in
some pipeline stages. Additional processor cycles are
required to pass packet context from one stage to the
next. Perhaps a more signicant challenge of a pipelined
programming model is in dealing with changes, since a
relatively minor code change may require a programmer
to start from scratch with code partitioning. The RTC
programming model avoids the problems associated
with pipelined designs by allowing the complete
functionality to reside within a single contiguous
program ow.
Figure 3 shows the high-level architecture of the
PowerNPa high-end member of the IBM network
processor family which integrates medium-access controls
(MACs), switch interface, processors, search engines,
trafc management, and an embedded IBM PowerPC*
processor which provides design exibility for applications.
The PowerNP has the following main components:
embedded processor complex (EPC), data ow (DF),
scheduler, MACs, and coprocessors.
The EPC processors work with coprocessors to provide
high-performance execution of the application software
and the PowerNP-related management software. The
coprocessors provide hardware-assist functions for
performing common operations such as table searches and
packet alterations. To provide for additional processing
capabilities, there is an interface for attachment of
external coprocessors such as content-addressable
memories (CAMs). The DF serves as the primary data
path for receiving and transmitting network trafc. It
provides an interface to multiple large data memories for
buffering data trafcasitows through the network
processor. The scheduler enhances the QoS functions
provided by the PowerNP. It allows trafc ows to be
scheduled individually per their assigned QoS class for
differentiated services. The MACs provide network
interfaces for Ethernet and packet over SONET (POS).
Figure 2
Network processor architectural models: (a) run-to-completion
(RTC) model; (b) pipeline model.
Processor
Memory
Processor
Instruction memory
Dispatcher
Completion unit
Arbiter
Memory
Memory
(a)
(b)
Processor
Memory
Processor
Memory
Processor
Memory
Instruction
memory
Instruction
memory
Instruction
memory
. . .
. . .
. . .
Figure 3
PowerNP high-level architecture.
MACs
Switch fabric
Data
flow
Traffic management
(scheduler)
Coprocessors
Network interface
ports
PowerNP
Coprocessors
(external)
Embedded processor
complex
Embedded
PowerPC
Picoprocessors
IBM J. RES. & DEV. VOL. 47 NO. 2/3 MARCH/MAY 2003 J. R. ALLEN, JR., ET AL.
179

Functional blocks
Figure 4 shows the main functional blocks that make up
the PowerNP architecture. In the following sections we
discuss each functional block within the PowerNP.
Physical MAC multiplexer
The physical MAC multiplexer (PMM) moves data
between physical layer devices and the PowerNP. The
PMM interfaces with the external ports of the network
processor in the ingress PMM and egress PMM directions.
The PMM includes four data mover units (DMUs),
labeled A, B, C, and D. Each of the four DMUs can be
independently congured as an Ethernet MAC or a POS
interface. The PMM keeps a set of performance statistics
on a per-port basis in either mode. Each DMU moves
data at 1 Gb/s in both the ingress and the egress
directions. There is also an internal wrap link that
enables trafc generated by the egress side of the
PowerNP to move to the ingress side without going out
of the chip.
When a DMU is congured for Ethernet, it can support
either one port of 1 Gigabit Ethernet or ten ports of Fast
Ethernet (10/100 Mb/s). To support 1 Gigabit Ethernet,
a DMU can be congured as either a gigabit media-
independent interface (GMII) or a ten-bit interface (TBI).
To support Fast Ethernet, a DMU can be congured as a
serial media-independent interface (SMII) supporting ten
Ethernet ports. Operation at 10 or 100 Mb/s is determined
by the PowerNP independently for each port.
When a DMU is congured for POS mode, it can
support both clear-channel and channelized optical carrier
(OC) interfaces. A DMU supports the following types and
Figure 4
PowerNP functional block diagram.
DPPU
TSE
Hardware
classifier
Instruction
memory
(internal)
Dispatch unit
Semaphore
manager
Control store arbiter
Ingress SWI
BA
Ingress SDM
Completion unit
LuDefTable
CompTable
FreeQueues
Policy
manager
Counter
manager
Debug and single
step control
Interrupts
and timers
Ingress DS
interface
and arbiter
(Rd+Wr)
Egress DS
interface
and arbiter
(Rd+Wr)
EPC
ePPC
405
External
memories
Internal
memories
Internal wrap
Internal wrap
Ingress
data
store
(internal)
Ingress EDS
queue interface
Ingress DF
(Ingress EDS)
CAB
arbiter
Egress
data
store
(external)
Egress EDS
queue
interface
Egress DF
(Egress EDS)
Mailbox
macro
PCI
macro
Egress
scheduler
Egress SWI
B
A
Egress SDM
DMU-A DMU-B DMU-C DMU-D DMU-A DMU-B DMU-C DMU-D
Ingress PMM
Egress PMM
J. R. ALLEN, JR., ET AL. IBM J. RES. & DEV. VOL. 47 NO. 2/3 MARCH/MAY 2003
180

speeds of POS framers: OC-3c, OC-12, OC-12c, OC-48,
and OC-48c.
3
To provide an OC-48 link, all four DMUs
are attached to a single framer, with each DMU providing
four OC-3c channels or one OC-12c channel to the
framer. To provide an OC-48 clear channel (OC-48c) link,
DMU A is congured to attach to a 32-bit framer and the
other three DMUs are disabled, providing only interface
pins for the data path.
Switch interface
The switch interface (SWI) supports two high-speed data-
aligned synchronous link (DASL)
4
interfaces, labeled A
and B, supporting standalone operation (wrap), dual-mode
operation (two PowerNPs interconnected), or connection
to an external switch fabric. Each DASL link provides up
to 4 Gb/s of bandwidth. The DASL links A and B can be
used in parallel, with one acting as the primary switch
interface and the other as an alternate switch interface
for increased system availability. The DASL interface
is frequency-synchronous, which removes the need for
asynchronous interfaces that introduce additional interface
latency. The ingress SWI side sends data to the switch
fabric, and the egress SWI side receives data from the
switch fabric. The DASL interface enables up to 64
network processors to be interconnected using an external
switch fabric.
The ingress switch data mover (SDM) is the logical
interface between the ingress enqueuer/dequeuer/
scheduler (EDS) packet data ow, also designated as
the ingress DF, and the switch fabric cell data ow. The
ingress SDM segments the packets into 64-byte switch
cells and passes the cells to the ingress SWI. The egress
SDM is the logical interface between the switch fabric cell
data ow and the packet data ow of the egress EDS, also
designated as the egress DF. The egress DF reassembles
the switch fabric cells back into packets. There is also an
internal wrap link which enables trafc generated by the
ingress side of the PowerNP to move to the egress side
without going out of the chip.
Data flow and traffic management
The ingress DF interfaces with the ingress PMM, the
EPC, and the SWI. Packets that have been received on
the ingress PMM are passed to the ingress DF. The
ingress DF collects the packet data in its internal data
store (DS) memory. When it has received sufcient data
(i.e., the packet header), the ingress DF enqueues the
data to the EPC for processing. Once the EPC processes
the packet, it provides forwarding and QoS information to
the ingress DF. The ingress DF then invokes a hardware-
congured ow-control mechanism and then either
discards the packet or places it in a queue to await
transmission. The ingress DF schedules all packets that
cross the ingress SWI. After it selects a packet, the ingress
DF passes the packet to the ingress SWI.
The ingress DF invokes ow control when packet data
enters the network processor. When the ingress DS is
sufciently congested, the ow-control actions discard
packets. The trafc-management software uses the
information about the congestion state of the DF, the rate
at which packets arrive, the current status of the DS, and
the current status of target blades to compute transmit
probabilities for various ows. The ingress DF has
hardware-assisted ow control which uses the software-
computed transmit probabilities along with tail drop
congestion indicators to determine whether a forwarding
or discard action should be taken.
The egress DF interfaces with the egress SWI, the EPC,
and the egress PMM. Packets that have been received on
the egress SWI are passed to the egress DF. The egress
DF collects the packet data in its external DS memory.
The egress DF enqueues the packet to the EPC for
processing. Once the EPC processes the packet, it
provides forwarding and QoS information to the egress
DF. The egress DF then enqueues the packet either to the
egress scheduler, when enabled, or to a target port queue
for transmission to the egress PMM. The egress DF
invokes a hardware-assisted ow-control mechanism, like
the ingress DF, when packet data enters the network
processor. When the egress DS is sufciently congested,
the ow-control actions discard packets.
The egress scheduler provides trafc-shaping functions
for the network processor on the egress side. It addresses
functions that enable QoS mechanisms required by
applications such as the Internet protocol (IP)-differentiated
services (DiffServ), multiprotocol label switching (MPLS),
trafc engineering, and virtual private networks (VPNs).
The scheduler manages bandwidth on a per-packet basis
by determining the bandwidth required by a packet (i.e.,
the number of bytes to be transmitted) and comparing this
against the bandwidth permitted by the conguration of
the packet ow queue. The bandwidth used by a rst
packet determines when the scheduler will permit the
transmission of a subsequent packet of a ow queue. The
scheduler supports trafc shaping for 2K ow queues.
Embedded processor complex
The embedded processor complex (EPC) performs all
processing functions for the PowerNP. It provides and
controls the programmability of the network processor.
In general, the EPC accepts data for processing from
both the ingress and egress DFs. The EPC, under
3
The transmission rate of OC-n is n 51.84 Mb/s. For example, OC-12 runs at
622.08 Mb/s.
4
Other switch interfaces, such as CSIX, can currently be supported via an
interposer chip. On-chip support for CSIX will be provided in a future version
of the network processor.
IBM J. RES. & DEV. VOL. 47 NO. 2/3 MARCH/MAY 2003 J. R. ALLEN, JR., ET AL.
181

Citations
More filters
Book ChapterDOI

Linear Types for Packet Processing

TL;DR: It is argued that PacLang’s linear type system ensures that no packet is referenced by more than one thread, but allows multiple references to a packet within a thread, which greatly simplifies compilation of high-level programs to the distributed memory architectures of modern Network Processors.
Patent

Parallel data link layer controllers in a network switching device

Anees Narsinh, +1 more
TL;DR: In this article, the authors present a data link layer processor for performing VLAN tagging operations, policing, shaping, and statistics acquisition integrally with one or more media access controllers (MACs).
Proceedings ArticleDOI

Leaping Multiple Headers in a Single Bound: Wire-Speed Parsing Using the Kangaroo System

TL;DR: The Kangaroo system is described, a flexible packet parser that can run at 40 Gbps even for worst-case packet headers and uses lookahead to parse several protocol headers in one step using a new architecture in which a CAM directs the next set of bytes to be extracted.
Patent

Multiprocessor subsystem in SoC with bridge between processor clusters interconnetion and SoC system bus

TL;DR: In this article, a switch fabric is used to connect a SoC local system bus device with SoC processor components with the independent multiprocessor subsystem core, which is capable of performing multi-threading operation processing.
Patent

Packet processor with wide register set architecture

TL;DR: Wide Register Set (WRS) as discussed by the authors is used in a packet processor to increase performance for certain packet processing operations by having wider bit lengths than the main registers used for primary packet processing.
References
More filters

IP Mobility Support

TL;DR: This document specifies protocol enhancements that allow transparent routing of IP datagrams to mobile nodes in the Internet.
Journal ArticleDOI

Algorithms for packet classification

TL;DR: This tutorial describes algorithms that are representative of each category of basic search algorithms, and discusses which type of algorithm might be suitable for different applications.

A Two Rate Three Color Marker

J. Heinanen, +1 more
TL;DR: This document defines a Two Rate Three Color Marker (trTCM), which can be used as a component in a Diffserv traffic conditioner, and meters an IP packet stream and marks its packets based on two rates, Peak Information Rate (PIR) and Committed Information rate (CIR).

A Single Rate Three Color Marker

J. Heinanen, +1 more
TL;DR: This document defines a Single Rate Three Color Marker (srTCM), which can be used as component in a Diffserv traffic conditioner [RFC2475, RFC2474].
Proceedings ArticleDOI

Building a robust software-based router using network processors

TL;DR: It is shown it is possible to combine an IXP1200 development board and a PC to build an inexpensive router that forwards minimum-sized packets at a rate of 3.47Mpps, nearly an order of magnitude faster than existing pure PC-based routers, and sufficient to support 1.77Gbps of aggregate link bandwidth.
Related Papers (5)
Frequently Asked Questions (16)
Q1. What are the contributions in this paper?

Deep packet processing is migrating to the edges of service provider networks to simplify and speed up core functions. This paper provides an overview of the IBM PowerNP NP4GS3 network processor and how it addresses these issues. 

Scalability for traffic engineering, quality of service (QoS), and the integration of wireless networks in a unified packet-based next-generation network requires traffic differentiation and aggregation. 

To sustain media speed with 48-byte packets, 6.1 million packets per second, the egress DS must run with a 10-clock cycle data store access window. 

The PowerNP has the following main components: embedded processor complex (EPC), data flow (DF), scheduler, MACs, and coprocessors. 

The control store arbiter (CSA) controls access to the control store (CS), which allocates memory bandwidth among the threads of all DPPUs. 

Consistency in the assignment and verification of GTP sequence numbers, and in operations on the reordering queues, is ensured by using the semaphore coprocessor. 

Although there are 32 independent threads, each CLP can execute the instructions of only one of its threads at a time, so at any instant up to 16 threads are executing simultaneously. 

The ingress and egress DS interface and arbiters are for controlling accesses to the DS, since only one thread at a time can access either DS. 

To support 1 Gigabit Ethernet, a DMU can be configured as either a gigabit mediaindependent interface (GMII) or a ten-bit interface (TBI). 

Because of the availability of associated advanced development and simulation tools, combined with extensive reference implementations, rapid prototyping and development of new high-performance applications are significantly easier than with either GPPs or ASICs. 

The softwarearchitecture and programming model describes the data plane functions and APIs, the control plane functions and APIs, and the communication model between these components. 

Five types of threads are supported:● General data handler (GDH) Seven DPPUs contain the GDH threads for a total of 28 GDH threads. 

In this paper the abbreviated term PowerNP is used to designate the IBM PowerNP NP4GS3, which is a high-end member of the IBM network processor family. 

It generates files used to execute picocode on the chip-level simulation model or the PowerNP, as well as files that picocode programmers can use for debugging. 

The lookup definition table (LuDefTable), an internal memory structure that contains 128 entries to define 128 trees, is the main structure that manages the CS. 

Perhaps a more significant challenge of a pipelined programming model is in dealing with changes, since a relatively minor code change may require a programmer to start from scratch with code partitioning.