scispace - formally typeset
Search or ask a question
Journal ArticleDOI

IBM PowerNP network processor: Hardware, software, and applications

TL;DR: An overview of the IBM PowerNPTM NP4GS3 network processor is provided and its hardware and software design characteristics and its comprehensive base operating software make it well suited for a wide range of networking applications.
Abstract: Deep packet processing is migrating to the edges of service provider networks to simplify and speed up core functions. On the other hand, the cores of such networks are migrating to the switching of high-speed traffic aggregates. As a result, more services will have to be performed at the edges, on behalf of both the core and the end users. Associated network equipment will therefore require high flexibility to support evolving high-level services as well as extraordinary performance to deal with the high packet rates. Whereas, in the past, network equipment was based either on general-purpose processors (GPPs) or application-specific integrated circuits (ASICs), favoring flexibility over speed or vice versa, the network processor approach achieves both flexibility and performance. The key advantage of network processors is that hardware-level performance is complemented by flexible software architecture. This paper provides an overview of the IBM PowerNPTM NP4GS3 network processor and how it addresses these issues. Its hardware and software design characteristics and its comprehensive base operating software make it well suited for a wide range of networking applications.

Summary (5 min read)

Introduction

  • The convergence of telecommunications and computer networking into next-generation networks poses challenging demands for high performance and flexibility.
  • In addition, more sophisticated end user services lead to further demands on edge devices, calling for high flexibility to support evolving high-level services as well as performance to deal with associated high packet rates.
  • Traditional hardware design, in which ASICs are used to perform the bulk of processing load, is not suited for the complex operations required and the new and evolving protocols that must be processed.
  • Permission to republish any other portion of this paper must be obtained from the Editor.
  • Not only is the instruction set customized for packet processing and forwarding; the entire design of the network processor, including execution environment, memory, hardware accelerators, and bus architecture, is optimized for high-performance packet handling.

System architecture

  • From a system architecture viewpoint, network processors can be divided into two general models: the run-to- 2.
  • Which is a high-end member of the IBM network processor family.
  • The model is based on the symmetric multiprocessor (SMP) architecture, in which multiple CPUs share the same memory [5].
  • Even when processing is identical for every packet, the code path must be partitioned according to the number of pipeline stages required.
  • It provides an interface to multiple large data memories for buffering data traffic as it flows through the network processor.

Functional blocks

  • Figure 4 shows the main functional blocks that make up the PowerNP architecture.
  • In the following sections the authors discuss each functional block within the PowerNP.

Physical MAC multiplexer

  • The physical MAC multiplexer (PMM) moves data between physical layer devices and the PowerNP.
  • The PMM interfaces with the external ports of the network processor in the ingress PMM and egress PMM directions.
  • When a DMU is configured for Ethernet, it can support either one port of 1 Gigabit Ethernet or ten ports of Fast Ethernet (10/100 Mb/s).
  • To provide an OC-48 clear channel (OC-48c) link, DMU A is configured to attach to a 32-bit framer and the other three DMUs are disabled, providing only interface pins for the data path.

Switch interface

  • The switch interface (SWI) supports two high-speed dataaligned synchronous link (DASL)4 interfaces, labeled A and B, supporting standalone operation (wrap), dual-mode operation (two PowerNPs interconnected), or connection to an external switch fabric.
  • The DASL links A and B can be used in parallel, with one acting as the primary switch interface and the other as an alternate switch interface for increased system availability.
  • The ingress SWI side sends data to the switch fabric, and the egress SWI side receives data from the switch fabric.
  • The egress SDM is the logical interface between the switch fabric cell data flow and the packet data flow of the egress EDS, also designated as the egress DF.
  • There is also an “internal wrap” link which enables traffic generated by the ingress side of the PowerNP to move to the egress side without going out of the chip.

Data flow and traffic management

  • The ingress DF interfaces with the ingress PMM, the EPC, and the SWI.
  • After it selects a packet, the ingress DF passes the packet to the ingress SWI.
  • When the ingress DS is sufficiently congested, the flow-control actions discard packets.
  • The egress DF enqueues the packet to the EPC for processing.
  • The scheduler manages bandwidth on a per-packet basis by determining the bandwidth required by a packet (i.e., the number of bytes to be transmitted) and comparing this against the bandwidth permitted by the configuration of the packet flow queue.

Embedded processor complex

  • The embedded processor complex (EPC) performs all processing functions for the PowerNP.
  • Within the EPC, eight dyadic protocol processor units containing processors, coprocessors, and hardware accelerators support functions such as packet parsing and classification, high-speed pattern search, and internal chip management.
  • The GPH-Resp thread processes responses from the embedded PowerPC.
  • The TSE coprocessor provides hardware search operations for full match (FM) trees, longest prefix match (LPM) trees, and software-managed trees (SMTs).
  • Each thread has read and write access to the ingress and egress DS through a DS coprocessor.

Ingress side

  • The ingress PMM receives a packet from an external physical layer device and forwards it to the ingress DF: 1. The ingress DF enqueues the packet to the EPC.
  • The code examines the information from the HC and may examine the data further; it assembles search keys and launches the TSE.
  • Packet data moves into the memory buffer of the DS coprocessor.
  • ● Forwarding and packet alteration information is identified by the results of the search. ●.
  • With the help of the ingress DF, the ingress switch data mover (I-SDM) segments the packets from the switch interface queues into 64-byte cells and inserts cell header and packet header bytes as they are transmitted to the SWI.

Egress side

  • The egress SWI receives a packet from a switch fabric, from another PowerNP processor, or from the ingress SWI of the device.
  • The code examines the information from the HC and may examine the data further; it assembles search keys and launches the TSE.
  • In flexible packet alteration, the code allocates additional buffers, and the DS coprocessor places data in these buffers.
  • The enqueue coprocessor develops the necessary information to enqueue the packet to the egress DF and provides it to the CU, which guarantees the packet order as the data moves from the 32 threads of the DPPUs to the egress DF queues.
  • The egress DF selects packets for transmission from the target port queue and moves their data to the egress PMM.

System software architecture

  • The PowerNP system software architecture is defined around the concept of partitioning control and data planes, as shown in Figure 6.
  • This is consistent with industry and standard directions, for example, the Network Processing Forum (NPF) and the Forces group of the Internet Engineering Task Force (IETF) organization.
  • The non-performance-critical functions of the control plane run on a GPP, while the performancecritical data plane functions run on the PowerNP processing elements (i.e., CLPs).
  • There is no need to use the instruction memory on the network processor for IP options processing, given that options are associated with only a small percentage of IP packets.
  • The software architecture and programming model describes the data plane functions and APIs, the control plane functions and APIs, and the communication model between these components.

Data plane

  • The data plane is structured as two major components: ● System library.
  • These functions provide a hardware abstraction layer that can be used either from the control plane, using a message-passing interface, or from the data plane software, using API calls.
  • These components plus the overall software design help a programmer to develop the PowerNP networking applications quickly.
  • Po w er N P so ft w ar e de ve lo pm en t t oo lk it Management application NPAS core Middleware Po w er N P Application library System library VOL.
  • The MBC model provides scalability in configuration, supporting multiple network processors and control processors, and achieves a scalable system architecture.

Control plane

  • The control plane networking applications interface with the data plane functions using the network processor application services (NPAS) layer that exposes two types of APIs: ● Protocol services API (such as IPv4 and MPLS).
  • This set of services handles hardware-independent protocol objects.
  • The control plane software provides a way to manage the PowerNP, and it also provides a set of APIs which can be used to support generic control plane protocol stacks.
  • The look and feel of the APIs is consistent inside the NPAS.
  • The NPAS can easily be ported to any operating system and can be used with any application because it is designed to run on any OS and to connect to the network processor via any transport mechanism.

Software development toolkit

  • The PowerNP software development toolkit provides a set of tightly integrated development tools which address each phase of the software development process.
  • Through the use of Tcl/Tk scripts, developers can interact with the simulation model and perform a wide range of functions in support of packet traffic generation and analysis.
  • Figure 8 PowerNP software development kit. NPProfile Chip-level simulator NP simulation model NPSim NPScope NPTest NPAsm Software development toolkit Network processor ePPC 405 R IS C W at ch pr ob e RISCWatch application Ethernet PowerNP JTAG interface IBM J. RES.
  • NPProfile analyzes simulation event information contained in a message log file produced by NPScope to accumulate relevant data regarding the performance of picocode execution.

Networking applications support

  • The power and flexibility of the PowerNP is useful in supporting a wide range of existing and emerging applications.
  • A number of networking applications have been implemented on the PowerNP chip, and the following sections discuss two of them.

Small group multicast

  • Small group multicast (SGM) [9] is a new approach to IP multicast that makes multicast practical for applications such as IP telephony, videoconferencing, and multimedia “e-meetings.”.
  • Like today’s multicast schemes, SGM sends at most one copy of any given packet across any network link, thus minimizing the use of network bandwidth.
  • A router performs a route table lookup to determine the “next hop” for each of the destinations.
  • Programmability that allows the above functions to be combined in nontraditional ways.
  • A preliminary version of SGM has been implemented on the PowerNP chip.

GPRS tunneling protocol

  • General packet radio service (GPRS), a set of protocols for converging the mobile data with the IP packet data, presents new challenges to the equipment manufacturers of GPRS support nodes (GSNs) as the bandwidth available to mobile terminals increases significantly with wireless technology advances.
  • The authors consider the support of GTP as a typical networking application that requires a high memory/bandwidth product and deeper packet processing than the common packet forwarding of an IP router.
  • Traffic counters associated with the context are incremented to account for the data transmission.
  • The decapsulation process requires the retrieval of the GTP context from the IP address of the inner IP header.
  • Packet reordering based on the GTP sequence number requires the temporary storage of misordered packets in a per-context associated reordering queue.

Performance

  • The PowerNP picoprocessors provide 2128 MIPS of aggregate processing capability.
  • Similarly, some cycles are not usable for instruction execution (i.e., both threads waiting for search result).
  • To quantify expected performance for specific applications, associated code paths are profiled in terms of memory accesses, coprocessor use, program flow (i.e., branch instructions), and overlap of coprocessor operations with instructions or other coprocessor operations.
  • The following code paths from the PowerNP software package achieve OC-48 line speed at minimum packet size: ● Border gateway protocol (BGP) layer-3 routing.

Summary

  • As the demand on network edge equipment increases to provide more services on behalf of the core and the end user, the role of flexible and programmable network processors becomes more critical.
  • The authors have discussed the challenges and demands posed by nextgeneration networks and have described how network processors can address these issues by performing highly sophisticated packet processing at line speed.
  • Its hardware and software design characteristics make it an ideal component for a wide range of networking applications.
  • Its run-to-completion model supports a simple programming model and a scalable system architecture, which provide abundant functionality and headroom at line speed.
  • Because of the availability of associated advanced development and simulation tools, combined with extensive reference implementations, rapid prototyping and development of new high-performance applications are significantly easier than with either GPPs or ASICs.

Acknowledgments

  • The authors gratefully acknowledge the significant contributions of a large number of their colleagues at IBM who, through years of research, design, and development, have helped to create and document the PowerNP network processor.
  • The processor would not have been possible without the appreciable efforts and contributions by those colleagues.
  • *Trademark or registered trademark of International Business Machines Corporation.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

J. R. Allen, Jr.
B. M. Bass
C. Basso
R. H. Boivie
J. L. Calvignac
G. T. Davis
L. Frelechoux
M. Heddes
A. Herkersdorf
A. Kind
J. F. Logan
M. Peyravian
M. A. Rinaldi
R. K. Sabhikhi
M. S. Siegel
M. Waldvogel
IBM PowerNP network
processor: Hardware,
software, and applications
Deep packet processing is migrating to the edges of service
provider networks to simplify and speed up core functions. On
the other hand, the cores of such networks are migrating to the
switching of high-speed traffic aggregates. As a result, more
services will have to be performed at the edges, on behalf of
both the core and the end users. Associated network equipment
will therefore require high flexibility to support evolving high-
level services as well as extraordinary performance to deal with
the high packet rates. Whereas, in the past, network equipment
was based either on general-purpose processors (GPPs) or
application-specific integrated circuits (ASICs), favoring
flexibility over speed or vice versa, the network processor
approach achieves both flexibility and performance. The key
advantage of network processors is that hardware-level
performance is complemented by flexible software architecture.
This paper provides an overview of the IBM PowerNP
TM
NP4GS3 network processor and how it addresses these issues.
Its hardware and software design characteristics and its
comprehensive base operating software make it well suited
for a wide range of networking applications.
Introduction
The convergence of telecommunications and computer
networking into next-generation networks poses
challenging demands for high performance and flexibility.
Because of the ever-increasing number of connected
end users and end devices, link speeds in the core will
probably exceed 40 Gb/s in the next few years. At the
same time, forwarding intelligence will migrate to the
edges of service provider networks to simplify and speed
up core functions.
1
Since high-speed traffic aggregates will
be switched in the core, more services will be required at
the edge. In addition, more sophisticated end user services
lead to further demands on edge devices, calling for high
flexibility to support evolving high-level services as well as
performance to deal with associated high packet rates.
Whereas, in the past, network products were based either
on GPPs or ASICs, favoring flexibility over speed or vice
versa, the network processor approach achieves both
flexibility and performance.
Current rapid developments in network protocols and
applications push the demands for routers and other
network devices far beyond doing destination address
lookups to determine the output port to which the packet
should be sent. Network devices must inspect deeper into
the packet to achieve content-based forwarding; perform
protocol termination and gateway functionality for server
offloading and load balancing; and require support for
higher-layer protocols. Traditional hardware design, in
which ASICs are used to perform the bulk of processing
load, is not suited for the complex operations required
and the new and evolving protocols that must be
processed. Offloading the entire packet processing to a
GPP, not designed for packet handling, causes additional
difficulties. Recently, field-programmable gate arrays
(FPGAs) have been used. They allow processing to be
offloaded to dedicated hardware without having to
undergo the expensive and lengthy design cycles
commonly associated with ASICs. While FPGAs are
now large enough to accommodate the gates needed
for handling simple protocols, multiple and complex
protocols are still out of reach. This is further intensified
1
The term edge denotes the point at which traffic from multiple customer premises
enters the service provider network to begin its journey toward the network core.
Core devices aggregate and move traffic from many edge devices.
Copyright 2003 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each
reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this
paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of
this paper must be obtained from the Editor.
0018-8646/03/$5.00 © 2003 IBM
IBM J. RES. & DEV. VOL. 47 NO. 2/3 MARCH/MAY 2003 J. R. ALLEN, JR., ET AL.
177
First publ. in: IBM Journal of Research and Development, Vol. 47, (2003), 2/3, pp. 177-194
Konstanzer Online-Publikations-System (KOPS)
URL: http://www.ub.uni-konstanz.de/kops/volltexte/2007/2326/
URN: http://nbn-resolving.de/urn:nbn:de:bsz:352-opus-23264

by their relatively slow clock speeds and long on-chip
routing delays, which rule out FPGAs for complex
applications.
Typical network processors have a set of programmable
processors designed to efciently execute an instruction
set specically designed for packet processing and
forwarding. Overall performance is further enhanced with
the inclusion of specialized coprocessors (e.g., for table
lookup or checksum computation) and enhancements to
the data ow supporting necessary packet modications.
However, not only is the instruction set customized for
packet processing and forwarding; the entire design
of the network processor, including execution
environment, memory, hardware accelerators, and
bus architecture, is optimized for high-performance
packet handling.
The key advantage of network processors is that
hardware-level performance is complemented by
horizontally layered software architecture. On the lowest
layer, the forwarding instruction set together with the
overall system architecture determines the programming
model. At that layer, compilation tools may help to
abstract some of the specics of the hardware layer by
providing support for high-level programming language
syntax and packet handling libraries [1]. The interface to
the next layer is typically implemented by an interprocess-
communication protocol so that control path functionality
can be executed on a control point (CP), which provides
extended and high-level control functions through a
traditional GPP. With a dened application programming
interface (API) at this layer, a traditional software
engineering approach for the implementation of network
services can be followed. By providing an additional
software layer and API which spans more than one
network node, a highly programmable and exible
network can be implemented. These layers are shown
in Figure 1 as hardware, software, and applications,
respectively, and are supported by tools and a reference
implementation.
Flexibility through ease of programmability at line speed
is demanded by continuing increases in the number of
approaches to networking [2 4]:
Scalability for trafc engineering, quality of service
(QoS), and the integration of wireless networks in a
unied packet-based next-generation network requires
trafc differentiation and aggregation. These functions
are based on information in packet headers at various
protocol layers. The higher up the protocol stack the
information originates, the higher the semantic content,
and the more challenging is the demand for exibility
and performance in the data path.
The adoption of the Internet by businesses,
governments, and other institutions has increased the
importance of security functions (e.g., encryption,
authentication, rewalling, and intrusion detection).
Large investments in legacy networks have forced
network providers to require a seamless migration
strategy from existing circuit-switched networks to next-
generation networks. Infrastructures must be capable of
incremental modication of functionalities.
Networking equipment should be easy to adapt to
emerging standards, since the pace of the introduction
of new standards is accelerating.
Network equipment vendors see the need of service
providers for exible service differentiation and
increased time-to-market pressure.
This paper provides an overview of the IBM PowerNP*
NP4GS3
2
network processor platform, containing the
components of Figure 1, and how it addresses those needs.
The specic hardware and software design characteristics
and the comprehensive base operating software of this
network processor make it a complete solution for a wide
range of applications. Because of its associated advanced
development and testing tools combined with extensive
software and reference implementations, rapid prototyping
and development of new high-performance applications
are signicantly easier than with either GPPs or ASICs.
System architecture
From a system architecture viewpoint, network processors
can be divided into two general models: the run-to-
2
In this paper the abbreviated term PowerNP is used to designate the IBM
PowerNP NP4GS3, which is a high-end member of the IBM network processor
family.
Figure 1
Components of a network processor platform.
Network processor applications
(e.g., virtual private network,
load balancing, firewall)
Network processor software
(e.g., management services,
transport services, protocol services,
traffic engineering services)
Network processor hardware
(e.g., processors, coprocessors,
flow control, packet alteration,
classification)
Network processor tools
(e.g., assembler, debugger, simulator)
Network processor reference
implementation
J. R. ALLEN, JR., ET AL. IBM J. RES. & DEV. VOL. 47 NO. 2/3 MARCH/MAY 2003
178

completion (RTC) and pipeline models, as shown in Figure 2.
The RTC model provides a simple programming approach
which allows the programmer to see a single thread that
can access the entire instruction memory space and all
of the shared resources such as control memory, tables,
policers, and counters. The model is based on the
symmetric multiprocessor (SMP) architecture, in which
multiple CPUs share the same memory [5]. The CPUs
are used as a pool of processing resources, all executing
simultaneously, either processing data or in an idle mode
waiting for work. The PowerNP architecture is based on
the RTC model.
In the pipeline model, each pipeline CPU is optimized
to handle a certain category of tasks and instructions. The
application program is partitioned among pipeline stages
[6]. A weakness in the pipeline model is the necessity
of evenly distributing the work at each segment of the
pipeline. When the work is not properly distributed,
the ow of work through the pipeline is disrupted. For
example, if one segment is over-allocated, that segment
of the pipeline stalls preceding segments and starves
successive segments.
Even when processing is identical for every packet, the
code path must be partitioned according to the number of
pipeline stages required. Of course, code cannot always be
partitioned ideally, leading to unused processor cycles in
some pipeline stages. Additional processor cycles are
required to pass packet context from one stage to the
next. Perhaps a more signicant challenge of a pipelined
programming model is in dealing with changes, since a
relatively minor code change may require a programmer
to start from scratch with code partitioning. The RTC
programming model avoids the problems associated
with pipelined designs by allowing the complete
functionality to reside within a single contiguous
program ow.
Figure 3 shows the high-level architecture of the
PowerNPa high-end member of the IBM network
processor family which integrates medium-access controls
(MACs), switch interface, processors, search engines,
trafc management, and an embedded IBM PowerPC*
processor which provides design exibility for applications.
The PowerNP has the following main components:
embedded processor complex (EPC), data ow (DF),
scheduler, MACs, and coprocessors.
The EPC processors work with coprocessors to provide
high-performance execution of the application software
and the PowerNP-related management software. The
coprocessors provide hardware-assist functions for
performing common operations such as table searches and
packet alterations. To provide for additional processing
capabilities, there is an interface for attachment of
external coprocessors such as content-addressable
memories (CAMs). The DF serves as the primary data
path for receiving and transmitting network trafc. It
provides an interface to multiple large data memories for
buffering data trafcasitows through the network
processor. The scheduler enhances the QoS functions
provided by the PowerNP. It allows trafc ows to be
scheduled individually per their assigned QoS class for
differentiated services. The MACs provide network
interfaces for Ethernet and packet over SONET (POS).
Figure 2
Network processor architectural models: (a) run-to-completion
(RTC) model; (b) pipeline model.
Processor
Memory
Processor
Instruction memory
Dispatcher
Completion unit
Arbiter
Memory
Memory
(a)
(b)
Processor
Memory
Processor
Memory
Processor
Memory
Instruction
memory
Instruction
memory
Instruction
memory
. . .
. . .
. . .
Figure 3
PowerNP high-level architecture.
MACs
Switch fabric
Data
flow
Traffic management
(scheduler)
Coprocessors
Network interface
ports
PowerNP
Coprocessors
(external)
Embedded processor
complex
Embedded
PowerPC
Picoprocessors
IBM J. RES. & DEV. VOL. 47 NO. 2/3 MARCH/MAY 2003 J. R. ALLEN, JR., ET AL.
179

Functional blocks
Figure 4 shows the main functional blocks that make up
the PowerNP architecture. In the following sections we
discuss each functional block within the PowerNP.
Physical MAC multiplexer
The physical MAC multiplexer (PMM) moves data
between physical layer devices and the PowerNP. The
PMM interfaces with the external ports of the network
processor in the ingress PMM and egress PMM directions.
The PMM includes four data mover units (DMUs),
labeled A, B, C, and D. Each of the four DMUs can be
independently congured as an Ethernet MAC or a POS
interface. The PMM keeps a set of performance statistics
on a per-port basis in either mode. Each DMU moves
data at 1 Gb/s in both the ingress and the egress
directions. There is also an internal wrap link that
enables trafc generated by the egress side of the
PowerNP to move to the ingress side without going out
of the chip.
When a DMU is congured for Ethernet, it can support
either one port of 1 Gigabit Ethernet or ten ports of Fast
Ethernet (10/100 Mb/s). To support 1 Gigabit Ethernet,
a DMU can be congured as either a gigabit media-
independent interface (GMII) or a ten-bit interface (TBI).
To support Fast Ethernet, a DMU can be congured as a
serial media-independent interface (SMII) supporting ten
Ethernet ports. Operation at 10 or 100 Mb/s is determined
by the PowerNP independently for each port.
When a DMU is congured for POS mode, it can
support both clear-channel and channelized optical carrier
(OC) interfaces. A DMU supports the following types and
Figure 4
PowerNP functional block diagram.
DPPU
TSE
Hardware
classifier
Instruction
memory
(internal)
Dispatch unit
Semaphore
manager
Control store arbiter
Ingress SWI
BA
Ingress SDM
Completion unit
LuDefTable
CompTable
FreeQueues
Policy
manager
Counter
manager
Debug and single
step control
Interrupts
and timers
Ingress DS
interface
and arbiter
(Rd+Wr)
Egress DS
interface
and arbiter
(Rd+Wr)
EPC
ePPC
405
External
memories
Internal
memories
Internal wrap
Internal wrap
Ingress
data
store
(internal)
Ingress EDS
queue interface
Ingress DF
(Ingress EDS)
CAB
arbiter
Egress
data
store
(external)
Egress EDS
queue
interface
Egress DF
(Egress EDS)
Mailbox
macro
PCI
macro
Egress
scheduler
Egress SWI
B
A
Egress SDM
DMU-A DMU-B DMU-C DMU-D DMU-A DMU-B DMU-C DMU-D
Ingress PMM
Egress PMM
J. R. ALLEN, JR., ET AL. IBM J. RES. & DEV. VOL. 47 NO. 2/3 MARCH/MAY 2003
180

speeds of POS framers: OC-3c, OC-12, OC-12c, OC-48,
and OC-48c.
3
To provide an OC-48 link, all four DMUs
are attached to a single framer, with each DMU providing
four OC-3c channels or one OC-12c channel to the
framer. To provide an OC-48 clear channel (OC-48c) link,
DMU A is congured to attach to a 32-bit framer and the
other three DMUs are disabled, providing only interface
pins for the data path.
Switch interface
The switch interface (SWI) supports two high-speed data-
aligned synchronous link (DASL)
4
interfaces, labeled A
and B, supporting standalone operation (wrap), dual-mode
operation (two PowerNPs interconnected), or connection
to an external switch fabric. Each DASL link provides up
to 4 Gb/s of bandwidth. The DASL links A and B can be
used in parallel, with one acting as the primary switch
interface and the other as an alternate switch interface
for increased system availability. The DASL interface
is frequency-synchronous, which removes the need for
asynchronous interfaces that introduce additional interface
latency. The ingress SWI side sends data to the switch
fabric, and the egress SWI side receives data from the
switch fabric. The DASL interface enables up to 64
network processors to be interconnected using an external
switch fabric.
The ingress switch data mover (SDM) is the logical
interface between the ingress enqueuer/dequeuer/
scheduler (EDS) packet data ow, also designated as
the ingress DF, and the switch fabric cell data ow. The
ingress SDM segments the packets into 64-byte switch
cells and passes the cells to the ingress SWI. The egress
SDM is the logical interface between the switch fabric cell
data ow and the packet data ow of the egress EDS, also
designated as the egress DF. The egress DF reassembles
the switch fabric cells back into packets. There is also an
internal wrap link which enables trafc generated by the
ingress side of the PowerNP to move to the egress side
without going out of the chip.
Data flow and traffic management
The ingress DF interfaces with the ingress PMM, the
EPC, and the SWI. Packets that have been received on
the ingress PMM are passed to the ingress DF. The
ingress DF collects the packet data in its internal data
store (DS) memory. When it has received sufcient data
(i.e., the packet header), the ingress DF enqueues the
data to the EPC for processing. Once the EPC processes
the packet, it provides forwarding and QoS information to
the ingress DF. The ingress DF then invokes a hardware-
congured ow-control mechanism and then either
discards the packet or places it in a queue to await
transmission. The ingress DF schedules all packets that
cross the ingress SWI. After it selects a packet, the ingress
DF passes the packet to the ingress SWI.
The ingress DF invokes ow control when packet data
enters the network processor. When the ingress DS is
sufciently congested, the ow-control actions discard
packets. The trafc-management software uses the
information about the congestion state of the DF, the rate
at which packets arrive, the current status of the DS, and
the current status of target blades to compute transmit
probabilities for various ows. The ingress DF has
hardware-assisted ow control which uses the software-
computed transmit probabilities along with tail drop
congestion indicators to determine whether a forwarding
or discard action should be taken.
The egress DF interfaces with the egress SWI, the EPC,
and the egress PMM. Packets that have been received on
the egress SWI are passed to the egress DF. The egress
DF collects the packet data in its external DS memory.
The egress DF enqueues the packet to the EPC for
processing. Once the EPC processes the packet, it
provides forwarding and QoS information to the egress
DF. The egress DF then enqueues the packet either to the
egress scheduler, when enabled, or to a target port queue
for transmission to the egress PMM. The egress DF
invokes a hardware-assisted ow-control mechanism, like
the ingress DF, when packet data enters the network
processor. When the egress DS is sufciently congested,
the ow-control actions discard packets.
The egress scheduler provides trafc-shaping functions
for the network processor on the egress side. It addresses
functions that enable QoS mechanisms required by
applications such as the Internet protocol (IP)-differentiated
services (DiffServ), multiprotocol label switching (MPLS),
trafc engineering, and virtual private networks (VPNs).
The scheduler manages bandwidth on a per-packet basis
by determining the bandwidth required by a packet (i.e.,
the number of bytes to be transmitted) and comparing this
against the bandwidth permitted by the conguration of
the packet ow queue. The bandwidth used by a rst
packet determines when the scheduler will permit the
transmission of a subsequent packet of a ow queue. The
scheduler supports trafc shaping for 2K ow queues.
Embedded processor complex
The embedded processor complex (EPC) performs all
processing functions for the PowerNP. It provides and
controls the programmability of the network processor.
In general, the EPC accepts data for processing from
both the ingress and egress DFs. The EPC, under
3
The transmission rate of OC-n is n 51.84 Mb/s. For example, OC-12 runs at
622.08 Mb/s.
4
Other switch interfaces, such as CSIX, can currently be supported via an
interposer chip. On-chip support for CSIX will be provided in a future version
of the network processor.
IBM J. RES. & DEV. VOL. 47 NO. 2/3 MARCH/MAY 2003 J. R. ALLEN, JR., ET AL.
181

Citations
More filters
Patent
23 Jul 2004
TL;DR: In this paper, a single chip protocol converter integrated circuit (IC) capable of receiving packets generating according to a first protocol type and processing said packets to implement protocol conversion and generating converted packets of a second protocol type for output thereof, the process of protocol conversion being performed entirely within the single integrated circuit chip.
Abstract: A single chip protocol converter integrated circuit (IC) capable of receiving packets generating according to a first protocol type and processing said packets to implement protocol conversion and generating converted packets of a second protocol type for output thereof, the process of protocol conversion being performed entirely within the single integrated circuit chip. The single chip protocol converter can be further implemented as a macro core in a system-on-chip (SoC) implementation, wherein the process of protocol conversion is contained within a SoC protocol conversion macro core without requiring the processing resources of a host system. Packet conversion may additionally entail converting packets generated according to a first protocol version level and processing the said packets to implement protocol conversion for generating converted packets according to a second protocol version level, but within the same protocol family type. The single chip protocol converter integrated circuit and SoC protocol conversion macro implementation include multiprocessing capability including processor devices that are configurable to adapt and modify the operating functionality of the chip.

173 citations

Patent
02 Mar 2004
TL;DR: In this paper, a system and method for secure data transfer over a network is described, which includes a processor, having logic configured to retrieve a portion of the data from the memory using the memory controller.
Abstract: A system and method are described for secure data transfer over a network. According to an exemplary embodiment a system for secure data transfer over a network includes memory and a memory controller configured to transfer data received from the network to the memory. The system includes a processor, having logic configured to retrieve a portion of the data from the memory using the memory controller. The processor also includes logic configured to perform security operations on the retrieved portion of the data, and logic configured to store the operated-on portion of the data in the memory using the memory controller. The memory controller is further configured to transfer the operated-on portion of the data from the memory to the network.

94 citations

Patent
28 Oct 2003
TL;DR: In this paper, a storage processing device with an input/output module and a control module is described, where the control module routes processed control path network traffic to the switch for routing to a defined egress port processor.
Abstract: A system including a storage processing device with an input/output module. The input/output module has port processors to receive and transmit network traffic. The input/output module also has a switch connecting the port processors. Each port processor categorizes the network traffic as fast path network traffic or control path network traffic. The switch routes fast path network traffic from an ingress port processor to a specified egress port processor. The storage processing device also includes a control module to process the control path network traffic received from the ingress port processor. The control module routes processed control path network traffic to the switch for routing to a defined egress port processor. The control module is connected to the input/output module. The input/output module and the control module are configured to interactively support data virtualization, data migration, data journaling, and snapshotting. The distributed control and fast path processors achieve scaling of storage network software. The storage processors provide line-speed processing of storage data using a rich set of storage-optimized hardware acceleration engines. The multi-protocol switching fabric provides a low-latency, protocol-neutral interconnect that integrally links all components with any-to-any non-blocking throughput.

62 citations

Patent
30 Jan 2004
TL;DR: In this article, the protocol conversion is performed entirely within the SoC macro core of a system-on-chip (SoC) macro core and does not require the processing resources of a host system.
Abstract: A network processor includes a system-onchip (SoC) macro core and functions as a single chip protocol converter that receives packets generating according to a first protocol type and processes the packets to implement protocol conversion and generates converted packets of a second protocol type for output thereof, the process of protocol conversion being performed entirely within the SoC macro core. The process of protocol conversion contained within the SoC macro core does not require the processing resources of a host system. The system-on chip macro core includes a bridge device for coupling a local bus in the protocol converting multiprocessor SoC macro core local bus to peripheral interfaces coupled to a system bus.

62 citations

Book ChapterDOI
04 Apr 2005
TL;DR: A new approach in which a high-level program is separated from its partitioning into concurrent tasks, and an AMS script that partitions it into a form capable of running at 3Gb/s on an Intel IXP2400 Network Processor is presented.
Abstract: Network processors (NPs) typically contain multiple concurrent processing cores. State-of-the-art programming techniques for NPs are invariably low-level, requiring programmers to partition code into concurrent tasks early in the design process. This results in programs that are hard to maintain and hard to port to alternative architectures. This paper presents a new approach in which a high-level program is separated from its partitioning into concurrent tasks. Designers write their programs in a high-level, domain-specific, architecturally-neutral language, but also provide a separate Architecture Mapping Script (AMS). An AMS specifies semantics-preserving transformations that are applied to the program to re-arrange it into a set of tasks appropriate for execution on a particular target architecture. We (i) describe three such transformations: pipeline introduction, pipeline elimination and queue multiplexing; and (ii) specify when each can be safely applied. As a case study we describe an IP packet-forwarder and present an AMS script that partitions it into a form capable of running at 3Gb/s on an Intel IXP2400 Network Processor.

59 citations


Cites methods from "IBM PowerNP network processor: Hard..."

  • ...The parallel hardware architectures we target are Network Processors (NPs) [ 1 ,6,10,23]: specialised programmable chips designed for high-speed packet processing....

    [...]

References
More filters
01 Oct 1996
TL;DR: This document specifies protocol enhancements that allow transparent routing of IP datagrams to mobile nodes in the Internet.
Abstract: This document specifies protocol enhancements that allow transparent routing of IP datagrams to mobile nodes in the Internet. Each mobile node is always identified by its home address, regardless of its current point of attachment to the Internet. While situated away from its home, a mobile node is also associated with a care-of address, which provides information about its current point of attachment to the Internet. The protocol provides for registering the care-of address with a home agent. The home agent sends datagrams destined for the mobile node through a tunnel to the care- of address. After arriving at the end of the tunnel, each datagram is then delivered to the mobile node.

2,094 citations


"IBM PowerNP network processor: Hard..." refers background in this paper

  • ...The routing challenges introduced by the mobility of terminals are supported by tunneling packets between the GSN of a wireless network provider, similarly to the mobile IP scheme [11]....

    [...]

Journal ArticleDOI
TL;DR: This tutorial describes algorithms that are representative of each category of basic search algorithms, and discusses which type of algorithm might be suitable for different applications.
Abstract: The process of categorizing packets into "flows" in an Internet router is called packet classification. All packets belonging to the same flow obey a predefined rule and are processed in a similar manner by the router. For example, all packets with the same source and destination IP addresses may be defined to form a flow. Packet classification is needed for non-best-effort services, such as firewalls and quality of service; services that require the capability to distinguish and isolate traffic in different flows for suitable processing. In general, packet classification on multiple fields is a difficult problem. Hence, researchers have proposed a variety of algorithms which, broadly speaking, can be categorized as basic search algorithms, geometric algorithms, heuristic algorithms, or hardware-specific search algorithms. In this tutorial we describe algorithms that are representative of each category, and discuss which type of algorithm might be suitable for different applications.

774 citations

01 Sep 1999
TL;DR: This document defines a Two Rate Three Color Marker (trTCM), which can be used as a component in a Diffserv traffic conditioner, and meters an IP packet stream and marks its packets based on two rates, Peak Information Rate (PIR) and Committed Information rate (CIR).
Abstract: This document defines a Two Rate Three Color Marker (trTCM), which can be used as a component in a Diffserv traffic conditioner [RFC2475, RFC2474]. The trTCM meters an IP packet stream and marks its packets based on two rates, Peak Information Rate (PIR) and Committed Information Rate (CIR), and their associated burst sizes to be either green, yellow, or red. A packet is marked red if it exceeds the PIR. Otherwise it is marked either yellow or green depending on whether it exceeds or doesn't exceed the CIR.

358 citations

01 Sep 1999
TL;DR: This document defines a Single Rate Three Color Marker (srTCM), which can be used as component in a Diffserv traffic conditioner [RFC2475, RFC2474].
Abstract: This document defines a Single Rate Three Color Marker (srTCM), which can be used as component in a Diffserv traffic conditioner [RFC2475, RFC2474]. The srTCM meters a traffic stream and marks its packets according to three traffic parameters, Committed Information Rate (CIR), Committed Burst Size (CBS), and Excess Burst Size (EBS), to be either green, yellow, or red. A packet is marked green if it doesn't exceed the CBS, yellow if it does exceed the CBS, but not the EBS, and red otherwise.

270 citations

Proceedings ArticleDOI
21 Oct 2001
TL;DR: It is shown it is possible to combine an IXP1200 development board and a PC to build an inexpensive router that forwards minimum-sized packets at a rate of 3.47Mpps, nearly an order of magnitude faster than existing pure PC-based routers, and sufficient to support 1.77Gbps of aggregate link bandwidth.
Abstract: Recent efforts to add new services to the Internet have increased interest in software-based routers that are easy to extend and evolve. This paper describes our experiences using emerging network processors---in particular, the Intel IXP1200---to implement a router. We show it is possible to combine an IXP1200 development board and a PC to build an inexpensive router that forwards minimum-sized packets at a rate of 3.47Mpps. This is nearly an order of magnitude faster than existing pure PC-based routers, and sufficient to support 1.77Gbps of aggregate link bandwidth. At lesser aggregate line speeds, our design also allows the excess resources available on the IXP1200 to be used robustly for extra packet processing. For example, with 8 × 100Mbps links, 240 register operations and 96 bytes of state storage are available for each 64-byte packet. Using a hierarchical architecture we can guarantee line-speed forwarding rates for simple packets with the IXP1200, and still have extra capacity to process exceptional packets with the Pentium. Up to 310Kpps of the traffic can be routed through the Pentium to receive 1510 cycles of extra per-packet processing.

251 citations

Frequently Asked Questions (16)
Q1. What are the contributions in this paper?

Deep packet processing is migrating to the edges of service provider networks to simplify and speed up core functions. This paper provides an overview of the IBM PowerNP NP4GS3 network processor and how it addresses these issues. 

Scalability for traffic engineering, quality of service (QoS), and the integration of wireless networks in a unified packet-based next-generation network requires traffic differentiation and aggregation. 

To sustain media speed with 48-byte packets, 6.1 million packets per second, the egress DS must run with a 10-clock cycle data store access window. 

The PowerNP has the following main components: embedded processor complex (EPC), data flow (DF), scheduler, MACs, and coprocessors. 

The control store arbiter (CSA) controls access to the control store (CS), which allocates memory bandwidth among the threads of all DPPUs. 

Consistency in the assignment and verification of GTP sequence numbers, and in operations on the reordering queues, is ensured by using the semaphore coprocessor. 

Although there are 32 independent threads, each CLP can execute the instructions of only one of its threads at a time, so at any instant up to 16 threads are executing simultaneously. 

The ingress and egress DS interface and arbiters are for controlling accesses to the DS, since only one thread at a time can access either DS. 

To support 1 Gigabit Ethernet, a DMU can be configured as either a gigabit mediaindependent interface (GMII) or a ten-bit interface (TBI). 

Because of the availability of associated advanced development and simulation tools, combined with extensive reference implementations, rapid prototyping and development of new high-performance applications are significantly easier than with either GPPs or ASICs. 

The softwarearchitecture and programming model describes the data plane functions and APIs, the control plane functions and APIs, and the communication model between these components. 

Five types of threads are supported:● General data handler (GDH) Seven DPPUs contain the GDH threads for a total of 28 GDH threads. 

In this paper the abbreviated term PowerNP is used to designate the IBM PowerNP NP4GS3, which is a high-end member of the IBM network processor family. 

It generates files used to execute picocode on the chip-level simulation model or the PowerNP, as well as files that picocode programmers can use for debugging. 

The lookup definition table (LuDefTable), an internal memory structure that contains 128 entries to define 128 trees, is the main structure that manages the CS. 

Perhaps a more significant challenge of a pipelined programming model is in dealing with changes, since a relatively minor code change may require a programmer to start from scratch with code partitioning.