scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Scalable high speed IP routing lookups

01 Oct 1997-Vol. 27, Iss: 4, pp 25-36
TL;DR: This paper describes a new algorithm for best matching prefix using binary search on hash tables organized by prefix lengths that scales very well as address and routing table sizes increase and introduces Mutating Binary Search and other optimizations that considerably reduce the average number of hashes to less than 2.
Abstract: Internet address lookup is a challenging problem because of increasing routing table sizes, increased traffic, higher speed links, and the migration to 128 bit IPv6 addresses. IP routing lookup requires computing the best matching prefix, for which standard solutions like hashing were believed to be inapplicable. The best existing solution we know of, BSD radix tries, scales badly as IP moves to 128 bit addresses. Our paper describes a new algorithm for best matching prefix using binary search on hash tables organized by prefix lengths. Our scheme scales very well as address and routing table sizes increase: independent of the table size, it requires a worst case time of log2(address bits) hash lookups. Thus only 5 hash lookups are needed for IPv4 and 7 for IPv6. We also introduce Mutating Binary Search and other optimizations that, for a typical IPv4 backbone router with over 33,000 entries, considerably reduce the average number of hashes to less than 2, of which one hash can be simplified to an indexed array access. We expect similar average case behavior for IPv6.

Summary (5 min read)

1 Introduction

  • The Internet is becoming ubiquitous: everyone wants to join in.
  • The increasing traffic demand requires three key factors to keep pace if the Internet is to continue to provide good service: link speeds, router data throughput, and packet forwarding rates.
  • In their paper, the authors distinguish between routing (a process that computes a database mapping destination networks to output links) and forwarding (a process by which a routing database is consulted to decide which output link a single packet should be forwarded on.).
  • Instead of having multiple routing entries for each subnet in the large network, just two entries are needed: one for the big network, and a more specific one for the small subnet (which has preference, if both should match).
  • Thus, for the current Internet protocol suite (IPv4) with 32 bit addresses, the authors need at most 5 hash lookups.

2 Existing Approaches to IP Lookup

  • The authors discuss approaches based on modifying exact matching schemes, trie based schemes, hardware solutions based on parallelism, proposals for protocol changes to simplify IP lookup, and caching solutions.
  • For the rest of this paper, the authors use BMP as a shorthand for Best Matching Prefix.

Modifications of Exact Matching Schemes

  • Classical fast lookup techniques such hashing and binary search do not directly apply to the best matching prefix (BMP) problem since they only do exact matches.
  • This method requires log 2 2N steps, with N being the number of routing table entries.
  • With current routing table sizes, the worst case would be 17 data lookups, each requiring at least one costly memory access.
  • A second classical solution would be to reapply any exact match scheme for each possible prefix length [Skl93] .
  • This is even more expensive, requiring W iterations of the exact match scheme used (e.g. W = 128 for IPv6).

Trie Based Schemes

  • The most commonly available IP lookup implementation is found in the BSD kernel, and is a radix trie implementation [Skl93] .
  • Current implementations have made a number of improvements on Sklower's original implementation.
  • The worst case was improved to O(W) by requiring that the prefix be contiguous (previously non-contiguous masks were allowed, a feature which was never used).
  • The implementation requires up to 32 or 128 costly memory accesses (for IPv4 or IPv6, respectively).
  • Tries also can have large storage requirements.

Hardware Solutions

  • Hardware solutions can potentially use parallelism to gain lookup speed.
  • Large CAMs are usually slower and much more expensive than ordinary memory.
  • In their basic form, both systems potentially require the boundary routers between autonomous systems (e.g., between a company and its ISP or between ISPs) to perform the full forwarding decision again, because of trust issues, scarce resources, or different views of the network.
  • Thus while both tag switching and IP switching can provide good performance within a level of hierarchy, neither solution currently does well at hierarchy boundaries without scaling problems.
  • For years, designers of fast routers have resorted to caching to claim high speed IP lookups.

Summary

  • In summary, all existing schemes have problems of either performance, scalability, generality, or cost.
  • The authors now describe a scheme that has good performance, excellent scalability, and does not require protocol changes.
  • The authors scheme also allows a cheap, fast software implementation, and also a more expensive (but faster) hardware implementation.

3 Basic Binary Search Scheme

  • The authors basic algorithm is based on three significant ideas, of which only the first has been reported before.
  • Rather than present the final solution directly, the authors will gradually refine these ideas in Section 3.1, Section 3.2, and Section 3.5 to arrive at a working basic scheme.
  • The authors describe further optimizations to the basic scheme in the next section.

3.1 Linear Search of Hash Tables

  • The authors point of departure is a simple scheme that does linear search of hash tables organized by prefix lengths.
  • The authors will improve this scheme shortly to do binary search on the hash tables.
  • The idea is to look for all prefixes of a certain length L using hashing and use multiple hashes to find the best matching prefix, starting with the largest value of L and working backwards.

3.2 Binary Search of Hash Tables

  • The previous scheme essentially does (in the worst case) linear search among all distinct string lengths.
  • Linear search requires O(W) expected time (more precisely, O(Wdist), where Wdist W is the number of distinct lengths in the database.).
  • Markers are needed to direct binary search to look for matching prefixes of greater length.
  • The authors will use upper half to mean the half of the trie with prefix lengths strictly less than the median length.

3.3 Reducing Marker Storage

  • The following definitions are useful before proceeding.
  • In the typical case, many prefixes will share markers (Table 1 ), reducing the marker storage further.
  • (Consider N prefixes whose first log 2 N bits are all distinct and whose remaining bits are all 1's).
  • Unfortunately, this algorithm is not correct as it stands and does not take logarithmic time if implemented naively.
  • In case of failure, the authors would have to modify the binary search (for correctness) to backtrack and search the upper half of R again.

3.5 Precomputation to Avoid Backtracking

  • Suppose every marker node M is a record that contains a variable M:bmp, which is the value of the best matching prefix of the marker M. M:bmp can be precomputed when the marker M is inserted into its hash table.
  • Now, when the authors find M at the mid point of R, they indeed search the lower half, but they also remember the value of M:bmp as the current best matching prefix.
  • The standard invariant for binary search when searching for key K is: "K is in range R".
  • Finally, the invariant implies the correct result when the range shrinks to 1.
  • Thus the algorithm works correctly; also since it has no backtracking, it takes O(log 2 Wdist) time.

4 Refinements to Basic Scheme

  • The basic scheme described in Section 3 takes just 7 hash computations, in the worst case, for 128 bit IPv6 addresses.
  • Each hash computation takes at least one access to memory; at gigabit speeds each memory access is significant.
  • Thus, in this section, the authors explore a series of optimizations that exploit the deeper structure inherent in the problem to reduce the average number of hash computations.
  • The authors main optimization, mutating binary search, is described in the next section.

Structure of Hash Table Entry:

  • Mutating Binary Search Example Doing basic binary search for an IPv4 address whose BMP has length 21 requires checking the prefix lengths 16 (hit), 24 (miss), 20 (hit), 22 (miss), and finally 21, also known as Figure 11.
  • Each binary tree has the root level (i.e., the first length to be searched) at the left; the upper child of each binary tree node is the length to be searched on failure, and whenever there is a match, the search switches to the more specific tree.
  • Two possible disadvantages of mutating binary search immediately present themselves.
  • First, precomputing optimal trees can increase the time to insert a new prefix.
  • The starting Rope corresponds to the default binary search tree.

4.3 Using Arrays

  • In cases where program complexity and memory use can be traded for speed, it might be desirable to change the first hash table lookup to a simple indexed array lookup, with the index being formed from the first w0 bits of the address, with w0 being the prefix length at which the search would be started.
  • Each array entry for index i will contain the bmp of i as well as a Rope which will guide binary search among all prefixes that begin with i.
  • An initial array lookup is not only faster than a hash lookup, but also results in reducing the average number of lookups (to around 0.5 using the current data sets the authors have examined.).

4.4 Hardware Implementations

  • The inner component, most likely done as a hash table in software implementations, can be implemented using hashing hardware such as described in [Dig95] .
  • The outer loop in the Rope scheme can be implemented as a shift register.
  • Multiple shift registers, it is possible to pipeline the searches, resulting in one completed routing lookup per hash lookup time.

5 Implementation

  • Besides hashing and binary search, a predominant idea in this paper is precomputation.
  • Every hash table entry has an associated bmp field and a Rope field, both of which are precomputed.
  • Precomputation allows fast search but requires more complex Insertion routines.
  • As mentioned earlier, while routes to prefixes may change frequently, the addition of a new prefix (the expensive case) is much rarer.
  • Thus it is worth paying a penalty for Insertion in return for improved search speed.

5.2 Rope Search from Scratch

  • Building a Rope Search data structure balanced for optimal search speed is more complex, since every possible binary search path needs to be optimized.
  • Thus the authors have two passes: Pass 1 builds a conventional trie.
  • Inserting from shortest to longest prefix has the nice property that all BMPs for the newly inserted markers are identical and thus only need to be calculated once.
  • For typical IPv4 forwarding tables, about half of this maximum number is being used.

5.3 Insertions and Deletions

  • Adding and removing single entries from the tree can also be done, but since no rebalancing occurs, the performance of the lookups might slowly degrade over time.
  • Adding or deleting a single prefix can change the bmp values of a large number of markers, and thus insertion is potentially expensive in the worst case.
  • Such solutions will have adequate throughput (because whenever the build process falls behind, the authors will batch more efficiently), but have poor latency.
  • The authors are working on fast incremental insertion and deletion algorithms, but they do not describe them here for want of space.

6.2 Measurements for IPv4

  • So far the authors have described how long their algorithm takes (in the average or worst case) in terms of the number of hash computations required.
  • It remains to quantify the time taken for a computation on an arbitrary prefix length using software.
  • The forwarding table was the same 33,000 entry forwarding table [Mer96] used before.
  • Basic Scheme Memory usage is close to 1.2 MByte, for the primary data structures (the most commonly accessed hash tables for length 8, 16, and 24) fit mostly into second level cache, so the first two steps (which is the average number needed) are very likely to be found in the cache.
  • Later steps, seldom needed, will be remarkably slower.

Rope Search starting with Array Lookup

  • This array fully fits into the cache, leaving ample space for the hash tables.
  • The array lookup is much quicker, and there will be less total lookups needed than for the Rope scheme.

6.3 Projections for IP Version 6

  • IPv6 address assignment principles have not been finally decided upon.
  • All these schemes help to reduce routing information.
  • Another new feature of IPv6, Anycast addresses [HD96, DH96] , may (depending on how popular they will become) add a very large number of host routes and other routes with very long prefixes.
  • Depending on the actual data, this may still be a win.
  • All other optimizations are expected to yield similar improvements.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Scalable High Speed IP Routing Lookups
Marcel Waldvogel
y
, George Varghese
z
,JonTurner
z
, Bernhard Plattner
y
y
Computer Engineering and Networks Laboratory
ETH Z¨urich, Switzerland
f
waldvogel,plattner
g
@tik.ee.ethz.ch
z
Computer and Communications Research Center
Washington University in St. Louis, USA
f
varghese,jst
g
@ccrc.wustl.edu
Abstract
Internet address lookup is a challenging problem because of in-
creasing routing table sizes, increased traffic, higher speed links,
and the migration to 128 bit IPv6 addresses. IP routing lookup
requires computing the best matching prefix, for which standard
solutions like hashing were believed to be inapplicable. The best
existing solution we know of, BSD radix tries, scales badly as IP
moves to 128 bit addresses. Our paper describes a new algorithm
for best matching prefix using binary search on hash tables orga-
nized by prefix lengths. Our scheme scales very well as address
and routing table sizes increase: independentof the table size, it re-
quires a worst case time of
log
2
(
address bits
)
hash lookups. Thus
only 5 hash lookups are needed for IPv4 and 7 for IPv6. We also
introduce Mutating Binary Search and other optimizations that, for
a typical IPv4 backbone router with over 33,000 entries, consider-
ably reduce the average number of hashes to less than 2, of which
one hash can be simplified to an indexed array access. We expect
similar average case behavior for IPv6.
1 Introduction
The Internet is becoming ubiquitous: everyone wants to join in.
Since the advent of the World Wide Web, the number of users,
hosts, domains, and networks connected to the Internet seems to be
exploding. Not surprisingly, network traffic is doubling every few
months. The proliferation of multimedia networking applications
and devices is expected to give traffic another major boost.
The increasing traffic demand requires three key factors to
keep pace if the Internet is to continue to provide good service:
link speeds, router data throughput, and packet forwarding rates.
1
Readily available solutions exist for the first two factors: for ex-
ample, fiber-optic cables can provide faster links,
2
and switching
technology can be used to move packets from the input interface
of a router to the corresponding output interface at gigabit speeds.
1
In our paper, we distinguish between routing (a process that computes a database
mapping destination networks to output links) and forwarding (a process by which
a routing database is consulted to decide which output link a single packet should
be forwarded on.) Route computation is less time critical than forwarding because
forwarding is done for each packet, while route computation needs to be done only
when the topology changes.
2
For example, MCI is currentlyupgradingits lines from 45Mbits/s to 155Mbits/s;
they plan to switch to 622 Mbits/s within a year.
Our paper deals with the third factor, packet forwarding, for which
current techniques perform poorly as network speedsincrease.
The major step in packet forwarding is to lookup the desti-
nation address (of an incoming packet) in the routing database.
While there are other chores, such as updating TTL fields, these
are computationally inexpensive compared to the major task of ad-
dress lookup. Data link Bridges have been doing address lookups
at 100 Mbps [Dig95] for many years. However, bridges only do
exact matching on the destination (MAC) address, while Internet
routers have to searchtheir databasefor the longestprefixmatching
a destination IP address. Thus standardtechniques for exact match-
ing, such as perfect hashing, binary search, and standard Content
Adressable Memories (CAMs) cannot directly be used for Internet
address lookups.
Prefix matching was introduced in the early 1990s, whenit was
foreseen that the number of endpoints and the amount of routing
information would grow enormously. The address classes A, B,
and C (allowing sites to have 24, 16, and 8 bits respectively for ad-
dressing) proved too inflexible and wasteful of the address space.
To makebetteruseof this scarceresource, especiallytheclassB ad-
dresses,bundles of class C networkswere givenout instead of class
B addresses. This resulted in massive growth of routing table en-
tries. So, in turn, Classless Inter-Domain Routing (CIDR) [F
+
93]
was deployed, to allow for arbitrary aggregation of networks to re-
duce routing table entries.
To reducerouting table space,aggregation is doneaggressively.
Suppose all the subnets in a big network have identical routing in-
formation except for a single, small subnet that has different infor-
mation. Instead of having multiple routing entries for each subnet
in the large network, just two entries are needed: one for the big
network, and a more specific one for the small subnet (which has
preference, if both should match). This results in better usage of
the available IP address space and decreases the amount of routing
table entries. On the other hand, the processing power needed for
forwarding lookup is increased.
Thus today an IP routers database consists of a number of ad-
dress prefixes. When an IP router receives a packet, it must com-
pute which of the prefixes in its database has the longest match
when comparedto thedestination addressin the packet. The packet
is then forwarded to the output link associated with that prefix. For
example, a forwarding database may have the prefixes
P
1 = 0101
,
P
2 = 0101101
and
P
3 = 010110101011
. An address whosefirst
12 bits are
010101101011
has longest matching prefix
P
1
.Onthe
other hand, an address whose first 12 bits are
010110101101
has
longest matching prefix
P
3
.
The use of best matching prefix in forwarding has allowed IP
routers to accomodate variouslevels of addresshierarchies, and has
allowed different parts of the network to havedifferent views of the
First publ. in: Proceedings of ACM SIGCOMM '97 conference on Applications, Technologies, Architecture and Protocols for
Computer Communication, Cannes, France, 1997, pp. 25-36
Konstanzer Online-Publikations-System (KOPS)
URL: http://www.ub.uni-konstanz.de/kops/volltexte/2007/2471/
URN: http://nbn-resolving.de/urn:nbn:de:bsz:352-opus-24711

address hierarchy. Given that best matching prefix forwarding is
necessary for hierarchies, and hashing is a natural solution for ex-
act matching, a natural question is: “Why can’t we modify hashing
to do best matching prefix. However, for several years now, it was
considered not to be “apparent how to accommodate hierarchies
while using hashing, other than rehashing for each level of hierar-
chy possible” [Skl93].
Our paperdescribesa novel algorithmic solution to longestpre-
fix match, using binary search over hash tables organized by the
length of the prefix. Our solution requires a worst case complexity
3
of
O
(log
2
W
)
, with
W
being the length of the address in bits.
Thus, for the current Internet protocol suite (IPv4) with 32 bit ad-
dresses, we need at most 5 hash lookups. For the upcoming IP ver-
sion 6 (IPv6) with 128 bit addresses, we can do lookup in at most
7 steps, as opposed to 128 in current algorithms (see Section 2),
giving an order of magnitude performance improvement.Using
perfect hashing, we can lookup 128 bit IP addresses in at most
7 memory accesses. This is significant because on current RISC
processors, hash functions can be found whose computation is
cheaper than a memory access.
In addition, we use severaloptimizations to significantly reduce
the average number of hashes needed. For example, our analysis of
an IPv4 forwarding table from an Internet backbone router at the
Mae-East network access point (NAP) [Mer96] show an average
case performance of less than two hashes, where the first hash can
be replaced by a simple index table lookup.
The rest of the paper is organized as follows. Section 2
describes drawbacks with existing approaches to IP lookups.
Section 3 describes our basicscheme in a series of refinements that
culminate in the basic binary search scheme. Section 4 describes
a series of important optimizations to the basic scheme that im-
prove average performance. Section 5 describes our implementa-
tion, including algorithms to build the data structure and perform
insertions and deletions. Section 6 describesperformance measure-
ments using our scheme for IPv4 addresses, and performance pro-
jections for IPv6 addresses. We conclude in Section 7 by assessing
the theoretical and practical contributions of this paper.
2 Existing Approaches to IP Lookup
We survey existing approaches to IP lookups and their problems.
We discuss approaches based on modifying exact matching sche-
mes, trie based schemes, hardware solutions based on parallelism,
proposals for protocol changes to simplify IP lookup, and caching
solutions. For the rest of this paper, we use BMP as a shorthand for
Best Matching Prefix.
Modifications of Exact Matching Schemes Classical fast
lookup techniques such hashing and binary search do not directly
applyto the bestmatchingprefix (BMP) problemsincetheyonly do
exact matches. A modified binary search technique, originally due
to Butler Lampson, is described in [Per92]. However, this method
requires
log
2
2
N
steps, with
N
being the number of routing table
entries. With current routing table sizes, the worst case would be 17
data lookups, each requiring at least one costly memory access. As
with any binary search scheme, the average number of accesses is
log
2
(2
N
)
1
. A secondclassical solution would be to reapply any
exact match scheme for each possible prefix length [Skl93]. This
is even more expensive, requiring
W
iterations of the exact match
scheme used (e.g.
W
= 128
for IPv6).
3
This assumes assuming
O
(1)
for hashing, which can be achieved using perfect
hashing, althoughlimited collisions do not affect performance significantly.
Trie Based Schemes The most commonly available IP lookup
implementation is found in the BSD kernel, and is a radix trie im-
plementation [Skl93]. If
W
is the length of an address, the worst-
case time in the basic implementation can be shown to be
O
(
W
2
)
.
Current implementations have made a number of improvements on
Sklower’s original implementation. The worst case was improved
to
O
(
W
)
by requiring that the prefix be contiguous (previously
non-contiguous masks were allowed, a feature which was never
used). Despite this, the implementation requires up to 32 or 128
costly memory accesses (for IPv4 or IPv6, respectively). Tries also
can have large storage requirements.
Hardware Solutions Hardware solutions can potentially use par-
allelism to gain lookup speed. For exact matches, this is done using
Content Addressable Memories (CAMs) in which every memory
location, in parallel, compares the input key value to the content of
that memory location.
Some CAMs allow a mask of bits that must be matched. Al-
though there are expensive so-called ternary CAMs available al-
lowing a mask to be specifiedper word, the mask must typically be
specified in advance. It has been shown that these CAMs can be
used to do BMP lookups [MF93, MTW95], but the solutions are
usually expensive.
Large CAMs are usually slower and much more expensive than
ordinary memory. Typical CAMs are small, both in the number of
bits per entry and the number of entries. Thus the CAM memory
for large address/mask pairs (256 bits needed for IPv6) and a huge
amount of prefixes appears (currently) to be very expensive. An-
other possibility is to use a number of CAMs doing parallel look-
ups for each prefix length. Again, this seems expensive. Probably
the most fundamental problem with CAMs is that CAM designs
have not historically kept pace with improvements in RAM mem-
ory. Thus a CAM based solution (or indeed any hardware solution)
runs the risk of being made obselete, in a few years, by software
technology running on faster processors and memory.
Protocol Based Solutions One way to get around the problems
of IP lookup is to have extra information sent along with the packet
to simplify or even totally get rid of IP lookups at routers. Two
major proposals along these lines are IP Switching [NMH97] and
Tag Switching [CV95, CV96, R
+
96]. Both schemes require large,
contiguous parts of the network to adopt their protocol changes be-
fore they will show a major improvement. The speedup is achieved
by adding information on the destination to every IP packet.
In IP Switching, this is done by associating a flow of packets
with an ATM Virtual Circuit; in Tag Switching, this is done by
adding a “tag” to each packet, where a “tag” is a small integer that
allows direct lookup in the routers forwarding table. Tag switching
is based on a concept originally described by Chandranmenon and
Varghese ([CV95, CV96]) using the name “threaded indices”. The
current tag switching proposal[R
+
96] goes further than threaded
indices by adding a stack of indices to deal with hierarchies.
Neither scheme can completely avoid ordinary IP lookups.
Both schemes require the ingress router (to the portions of the net-
work implementing their protoocol) to perform a full routing de-
cision. In their basic form, both systems potentially require the
boundary routers between autonomous systems (e.g., between a
company and its ISP or between ISPs) to perform the full forward-
ing decision again, because of trust issues,scarce resources,or dif-
ferent views of the network. Scarce resources can be ATM VCs or
tags, of which only a small amount exists. Thus towards the back-
bone, they need to be aggregated; away from the backbone, they
need to be separated again.

Different views of the network can arise because systems of-
ten know more details about their own and adjacentnetworks, than
about networks further away. Although Tag Switching addresses
that problem by allowing hierarchical stacking of tags, this af-
fects routing scalability. Tag Switching assigns and distributes tags
based on routing information; thus every originating network now
has to know tags in the destination networks. Thus while both tag
switching and IP switching can provide good performance within a
level of hierarchy, neither solution currently does well at hierarchy
boundaries without scaling problems.
Caching For years, designers of fast routers have resorted to
caching to claim high speed IP lookups. This is problematic for
several reasons. First, information is typically cached on the en-
tire address, potentially diluting the cache with hundreds of ad-
dresses that map to the same prefix. Second, a typical backbone
router of the future may have hundredsof thousandsof prefixesand
be expected to forward packets at Gigabit rates. Although studies
have shown that caching in the backbone can result in hit ratios up
to and exceeding 90 percent [Par96, NMH97], the simulations of
cache behavior were done on large, fully associative caches which
commonly are implemented using CAMs. CAMs, as already men-
tioned, are usually expensive. It is not clear how set associative
caches will perform and whether cachingwill be able keep up with
the growth of the Internet. So caching doeshelp, but does not avoid
the need for fast BMP lookups, especially in view of current net-
work speedups.
Summary In summary, all existing schemes have problems of ei-
ther performance, scalability, generality, or cost. Lookup schemes
based on tries and binary search are (currently) too slow and do
not scale well; CAM solutions are expensive and carry the risk
of being quickly outdated; tag and IP switching solutions require
widespread agreement on protocol changes, and still require BMP
lookupsin portions ofthe network; finally,locality patterns at back-
bone routers make it infeasible to depend entirely on caching.
We now describe a scheme that has good performance, excel-
lent scalability, and does not require protocol changes. Our scheme
also allows a cheap,fast software implementation, and also a more
expensive (but faster) hardware implementation.
3 Basic Binary Search Scheme
Our basic algorithm is based on three significant ideas, of which
only the first has been reported before. First, we use hashing to
check whether an address
D
matches any prefix of a particular
length; second, we use binary search to reduce number of searches
from linear to logarithmic; third, we use precomputation to prevent
backtracking in case of failures in the binary search of a range.
Rather than present the final solution directly, we will gradually
refine these ideas in Section 3.1, Section 3.2, and Section 3.5 to ar-
rive at a working basic scheme. We describe further optimizations
to the basic scheme in the next section.
3.1 Linear Search of Hash Tables
Our point of departure is a simple scheme that does linear search
of hash tables organized by prefix lengths. We will improve this
scheme shortly to do binary searchon the hash tables.
The idea is to look for all prefixes of a certain length
L
using
hashing and use multiple hashes to nd the best matching prefix,
starting with the largest value of
L
and working backwards. Thus
we start by dividing the database of prefixes according to lengths.
Length Hash
5
7
12
01010
0101011
0110110
011011010101
Hash tables
Figure 1: Hash Tables for each possible prefix length
Assuming a particularly tiny routing table with four prefixes of
length 5, 7, 7, and 12, respectively, each of them would be stored
in the hash table for its length (Figure 1). So each set of prefixes of
distinct length is organized as a hash table. If we have a sorted ar-
ray
L
corresponding to the distinct lengths, we only have 3 entries
in the array, with a pointer to the longest length hash table in the
last entry of the array.
To search for address
D
, we simply start with the longestlength
hash table
l
(i.e. 12 in the example), and extract the first
l
bits of
D
and do a search in the hash table for length
l
entries. If we succeed,
we have found a BMP
4
; if not, we look at the first length smaller
than
l
,say
l
0
(this is easy to find if we have the array
L
by simply
indexing one position less than the position of
l
), and continuing
the search.
More concretely, let
L
be an array of records.
L
[
i
]
:leng th
is
the lengthof prefixes found at position
i
,and
L
[
i
]
:hash
is a pointer
to a hash table containing all prefixes of length
L
[
i
]
:leng th
.The
resulting code is shown in Figure 2.
Function LinearSearch(
D
) (* search for address
D
*)
Initialize
BM P
to the empty string;
i
:= Highest index in array
L
;
While
(
BM P
=
nil
)
and
(
i
0)
do
Extract the rst
L
[
i
]
:leng th
bits of
D
into
D
0
;
BM P
:= Search(
D
0
,
L
[
i
]
:hash
); (* search hash for
D
0
*)
i
:=
i
1
;
Endwhile
Figure 2: Linear Search
3.2 Binary Search of Hash Tables
The previous scheme essentially does (in the worst case) linear
search among all distinct string lengths. Linear search requires
O
(
W
)
expected time (more precisely,
O
(
W
dist
)
,where
W
dist
W
is the number of distinct lengths in the database.)
A better search strategy is to use binary search on the array
L
to cut down the number of hashes to
O
(log
2
W
dist
)
.However,for
binary search to work, we need markers in tables corresponding
to shorter lengths to point to prefixes of greater lengths. Markers
are needed to direct binary search to look for matching prefixes of
greater length. Here is an example to illustrate the need for markers.
Suppose we have the prefixes
P
1=0
,
P
2=00
,
P
3 = 111
(Figure 3 (b)). Assume that the zeroth entry of
L
points to
P
1
’s
hash table, the first to
P
2
s hash table, and the second points to
P
3
s hash table. Suppose we search for
111
. Binary search (a)
would start at the middle hash table and search for
11
in the hash
table containing
P
2
(the triangles denote a pointer to the hashtable
4
Recall that BMP stands for Best Matching Prefix. We use this abbreviation
throughthe rest ofthe paper

P1=0
P3=111
P2=00
Hash
Tables
Binary
Search
Increasing
Prefix Length
0
111
00
11
Hash Tables
with Marker
(a) (b) (c)
1
2
3
L
Figure 3: Binary Search on Hash Tables
to search). It would fail and have no indication that it should search
among the longer prefix tables for a better matching prefix. To fix
this problem, we simply add a marker entry
11
to the middle table.
Now when binary search is done for
111
, we will lookup
11
in the
middle hash table and find the marker node. This can be used to
direct binary search to the lower half of the table.
Trie StructureBinary Search
Increasing Prefix Length
Figure 4: Binary Search on Trie Levels
Eachhash table (markers plusrealprefixes) canbe thought of as
a horizontal layer of a trie corresponding to some length
L
(except
that the hash table contains the complete path to that layer of each
entry in that layer). Our basic schemes is essentially doing binary
search on the levels of a trie (Figure 4). We start by doing a hash
on prefixes corresponding to the median length of the trie. If we
match, we search the upper half of the trie; if we fail we search the
lower half of the trie.
Figure 4 and other figures describing search order contain sev-
eral elements: (1) Horizontal stripes grouping all the elements of
a specified prefix length, (2) a trie containing the prefixes, shown
on the right of the figure and rooted on the top of the figure, and
(3) a binary tree, shown on the left of the figure and rooted at the
left, which depicts all possible paths that binary search can fol-
low. We will use upper half to mean the half of the trie with prefix
lengths strictly less than the median length. We also use lower half
for the portion of the trie with prefix lengths strictly greater than
the median length. It is important to understand the conventions in
Figure 4 to understand the later figures and text.
3.3 Reducing Marker Storage
The following definitions are useful before proceeding. For a pre-
fix P in the table, define
Level
(
P
)
to be the integer
i
for which
L
[
i
]
:leng th
=
leng th
(
P
)
(i.e., the index of the entry in
L
that
Total entries 33199 100%
Entries needing no markers 4743 14%
Entries needing 1 marker 22505 68%
Entries needing 2 markers 3562 11%
Entries needing 3 markers 2389 7%
Total markers requested 36796 111%
(before sharing)
Total markers 8699 26%
Pure markers 8336 25%
Table 1: Marker Overhead for Backbone Forwarding Table
points to
P
s hash table). Also, “up” to refers to shorter, “down”to
longer prefixes.
How many markers do we need? A naive view would indi-
cate placing a marker for prefix
P
at all levels in
L
higher than
the level of
P
. However, it suffices to place markers at all levels
in
L
that could be visited by binary search when looking for an
entry whose BMP is
P
. This reduces the number of markers to
at most
log
2
W
per real prefix, which keeps the storage expan-
sion modest. More precisely, if the
Level
(
P
)
is written down
in binary as
a
1
;a
2
;:::;a
n
, then we need a marker at each level
a
1
a
2
;:::;a
k
;
0
;
0
;:::;
0
such that
a
k
=1
. (We assume that
L
is padded so that its size is a power of 2). In fact, the number
of marker nodes is limited by the number of 1 bits in
Level
(
P
)
.
Clearly this results in a logarithmic number of markers.
In the typical case, many prefixes will share markers (Table 1),
reducing the marker storagefurther. In our samplerouting database
[Mer96], the storage required will increase by 25%. However, it is
easy to give a worst case example where the storage needs require
O
(log
2
W
)
markers per prefix. (Consider
N
prefixes whose first
log
2
N
bits are all distinct and whose remaining bits are all
1
’s).
3.4 Problems with Backtracking
Function NaiveBinarySearch(
D
) (* search for address
D
*)
Initialize search range
R
to cover the whole array
L
;
While
R
is not a single entry do
Let
i
correspond to the middle level in range
R
;
Extract the first
L
[
i
]
:leng th
bits of
D
into
D
0
;
Search(
D
0
,
L
[
i
]
:hash
); (* search hash table for
D
0
*)
If found then set
R
:= lower half of
R
(*longer prefixes*)
Else set R := upper half of
R
; (*shorter prefixes*)
Endif
Endwhile
Figure 5: Naive Binary Search
Binary search of hash tables can be expressed as shown in
Figure 5. Unfortunately, this algorithm is not correct as it stands
and does not take logarithmic time if implemented naively. The
problem is that while markers are good things (they lead to poten-
tially better prexes lower in the table), they can also cause the
search to follow false leads which may fail. In case of failure, we
would have to modify the binary search (for correctness) to back-
track and search the upper half of
R
again. Such a naive modifica-
tion can lead us backto linear time search. An example will clarify
this.

First consider the prefixes
P
1=1
,
P
2=00
,
P
3 = 111
.
As discussed above, we add a marker to the middle table so that
the middle hash table contains
00
(a real prefix) and
11
(a marker
pointing down to
P
3
). Now considera search for
110
.Westartat
the middle hash table and get a hit; thus we search the third hash
table for
110
and fail. But the correct best matching prefix is at the
first level hash table i.e.,
P
1
. The marker indicating that there
will be longer prefixes, indispensible to find
P
3
, was misleading in
this case; so apparently, we have to go back and search the upper
half of the range.
The fact that each entry contributes at most
log
2
W
markers
may cause some readers to suspect that the worst case with back-
tracking is limited to
O
(log
2
W
)
. This is incorrect. The worst case
is
O
(
W
)
. The worst-case example for say
W
bits is as follows:
wehaveaprex
P
i
of length
i
,for
1
i<W
that contains all
0s. In addition we have the prefix
Q
whose first
W
1
bits are
all zeroes, but whose last bit is a
1
. If we search for the
W
bit
address containing all zeroes then we can show that binary search
with backtracking will take
O
(
W
)
time and visit every level in the
table. (The problem is that every level contains a false marker that
indicates the presence of something better below.)
3.5 Precomputation to Avoid Backtracking
We use precomputation to avoid backtracking when we shrink the
current range
R
to the lower half of
R
(which happens when we
find a marker at the mid point of
R
). Suppose every marker node
M
is a record that contains a variable
M :bmp
, which is the value
of the best matching prefix of the marker
M
.
M :bmp
can be pre-
computed when the marker
M
is inserted into its hash table. Now,
when we find
M
at the mid point of
R
, we indeed search the lower
half, but we also remember the value of
M :bmp
as the current best
matching prefix. Now if the lower half of
R
fails to produce any-
thing interesting, we need not backtrack, because the results of the
backtracking are already summarized in the value of
M :bmp
.The
new code is shown in Figure 6.
Function BinarySearch(
D
) (* search for address
D
*)
Initialize search range
R
to cover the whole array
L
;
Initialize
BM P
found so far to null string;
While
R
is not empty do
Let
i
correspond to the middle level in range
R
;
Extract the first
L
[
i
]
:leng th
bits of
D
into
D
0
;
M
:= Search(
D
0
,
L
[
i
]
:hash
); (* search hash for
D
0
*)
If
M
is nil Then set
R
:= upper half of
R
; (* not found *)
Elseif
M
is a prefix and not a marker
Then
BM P
:=
M :bmp
; break; (* exit loop *)
Else (*
M
is a pure marker, or marker and prefix *)
BM P
:=
M :bmp
; (* update best matching prefix so far *)
R
:= lower half of
R
;
Endif
Endwhile
Figure 6: Binary Search
The standard invariant for binary search whensearchingfor key
K
is:
K
is in range
R
”. We then shrink
R
while preserving this in-
variant. The invariant for this algorithm, when searching for key
K
is: “EITHER (The Best Matching Prefix of
K
is BMP) OR (There
is a longer matching prefix in
R
)”.
It is easy to see that initialization preserves this invariant, and
each of the search cases preserves this invariant (this can be es-
tablished using an inductive proof.) Finally, the invariant implies
the correct result when the range shrinks to 1. Thus the algo-
rithm works correctly; also since it has no backtracking, it takes
O
(log
2
W
dist
)
time.
4 Refinements to Basic Scheme
The basic schemedescribed in Section 3 takes just 7 hashcomputa-
tions, in the worst case, for 128 bit IPv6 addresses. However, each
hash computation takes at least one access to memory; at gigabit
speeds each memory access is significant. Thus, in this section,
we explore a series of optimizations that exploit the deeper struc-
ture inherent in the problem to reduce the average number of hash
computations.
4.1 Asymmetric Binary Search
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
32
32
Prefix Length
Frequency
1 10 100 1000 10,000100,000
Figure 7: Histogram of the Prex Length Distribution
We first describe a series of simple-minded optimizations. Our
main optimization, mutating binary search, is described in the next
section. A reader can safely skip to Section 4.2 on a first reading.
The current algorithm is a fast, yet very general, BMP search
engine. Usually, the performance of general algorithms can be im-
proved by tailoring them to the particular datasets they will be ap-
plied to. As can be seen in Figure 7, the distribution of a typical
backbone router’s forwarding table as obtained from [Mer96], the
entries are not equally distributed over the different prefix lengths.
All the concepts we described below apply to any set of addresses;
however, we will quantify the potential improvements using the ex-
isting table.
As the first improvement, which has already been mentioned
and used in the basic scheme, the search can be limited to those
prefix lengths which do contain at least one entry, reducing the
worst case number of hashes from
log
2
W
(5 with
W
=32
)to
log
2
W
dist
(4.5 with
W
dist
=23
, the number of non-empty buck-
ets in the histogram), as shown in Figure 8. (While this is an im-
provement for the worst case, in this case, it harms the average
performance, as we will see later.)

Citations
More filters
Journal ArticleDOI
TL;DR: On conventional PC hardware, the Click IP router achieves a maximum loss-free forwarding rate of 333,000 64-byte packets per second, demonstrating that Click's modular and flexible architecture is compatible with good performance.
Abstract: Clicks is a new software architecture for building flexible and configurable routers. A Click router is assembled from packet processing modules called elements. Individual elements implement simple router functions like packet classification, queuing, scheduling, and interfacing with network devices. A router configurable is a directed graph with elements at the vertices; packets flow along the edges of the graph. Several features make individual elements more powerful and complex configurations easier to write, including pull connections, which model packet flow drivn by transmitting hardware devices, and flow-based router context, which helps an element locate other interesting elements. Click configurations are modular and easy to extend. A standards-compliant Click IP router has 16 elements on its forwarding path; some of its elements are also useful in Ethernet switches and IP tunnelling configurations. Extending the IP router to support dropping policies, fairness among flows, or Differentiated Services simply requires adding a couple of element at the right place. On conventional PC hardware, the Click IP router achieves a maximum loss-free forwarding rate of 333,000 64-byte packets per second, demonstrating that Click's modular and flexible architecture is compatible with good performance.

2,595 citations

Proceedings ArticleDOI
12 Dec 1999
TL;DR: The Click IP router can forward 64-byte packets at 73,000 packets per second, just 10% slower than Linux alone, and is easy to extend by adding additional elements, which are demonstrated with augmented configurations.
Abstract: Click is a new software architecture for building flexible and configurable routers. A Click router is assembled from packet processing modules called elements. Individual elements implement simple router functions like packet classification, queueing, scheduling, and interfacing with network devices. Complete configurations are built by connecting elements into a graph; packets flow along the graph's edges. Several features make individual elements more powerful and complex configurations easier to write, including pull processing, which models packet flow driven by transmitting interfaces, and flow-based router context, which helps an element locate other interesting elements.We demonstrate several working configurations, including an IP router and an Ethernet bridge. These configurations are modular---the IP router has 16 elements on the forwarding path---and easy to extend by adding additional elements, which we demonstrate with augmented configurations. On commodity PC hardware running Linux, the Click IP router can forward 64-byte packets at 73,000 packets per second, just 10% slower than Linux alone.

1,608 citations


Cites background from "Scalable high speed IP routing look..."

  • ...Increasing the routing table size would also decrease performance, a problem existing work on fast lookup in large tables could address [10, 29]....

    [...]

Proceedings ArticleDOI
30 Aug 1999
TL;DR: It is found that a simple multi-stage classification algorithm, called RFC (recursive flow classification), can classify 30 million packets per second in pipelined hardware, or one million packetsper second in software.
Abstract: Routers classify packets to determine which flow they belong to, and to decide what service they should receive. Classification may, in general, be based on an arbitrary number of fields in the packet header. Performing classification quickly on an arbitrary number of fields is known to be difficult, and has poor worst-case performance. In this paper, we consider a number of classifiers taken from real networks. We find that the classifiers contain considerable structure and redundancy that can be exploited by the classification algorithm. In particular, we find that a simple multi-stage classification algorithm, called RFC (recursive flow classification), can classify 30 million packets per second in pipelined hardware, or one million packets per second in software.

822 citations


Cites background from "Scalable high speed IP routing look..."

  • ...…1.58113-135.6/99/0008...55.00 Figure 1: Example network of an ISP (ISPt) connected to two enterprise networks (El and E2) and to two other ISP networks across a NAP. algorithms have been developed (e.g. [1][5][7][9][ 161). attention has turned to the more general problem of packet classification....

    [...]

Journal ArticleDOI
01 Jul 2002
TL;DR: The design involves both a local mechanism for detecting and controlling an aggregate at a single router, and a cooperative pushback mechanism in which a router can ask upstream routers to control an aggregate.
Abstract: The current Internet infrastructure has very few built-in protection mechanisms, and is therefore vulnerable to attacks and failures. In particular, recent events have illustrated the Internet's vulnerability to both denial of service (DoS) attacks and flash crowds in which one or more links in the network (or servers at the edge of the network) become severely congested. In both DoS attacks and flash crowds the congestion is due neither to a single flow, nor to a general increase in traffic, but to a well-defined subset of the traffic --- an aggregate. This paper proposes mechanisms for detecting and controlling such high bandwidth aggregates. Our design involves both a local mechanism for detecting and controlling an aggregate at a single router, and a cooperative pushback mechanism in which a router can ask upstream routers to control an aggregate. While certainly not a panacea, these mechanisms could provide some needed relief from flash crowds and flooding-style DoS attacks. The presentation in this paper is a first step towards a more rigorous evaluation of these mechanisms.

808 citations

01 Jan 1999
TL;DR: In this paper, the authors present a framework for the emerging Internet Quality of Service (QoS) All the important components of this framework, ie, Integrated Services, RSVP, Differentiated Services, Multi-Protocol Label Switching (MPLS) and Constraint Based Routing are covered.
Abstract: In this paper we present a framework for the emerging Internet Quality of Service (QoS) All the important components of this framework, ie, Integrated Services, RSVP, Differentiated Services, Multi-Protocol Label Switching (MPLS) and Constraint Based Routing are covered We describe what Integrated Services and Differentiated Services are, how they can be implemented, and the problems they have We then describe why MPLS and Constraint Based Routing have been introduced into this framework, how they differ from and relate to each other, and where they fit into the Differentiated Services architecture Two likely service architectures are presented, and the end-to-end service deliveries in these two architectures are illustrated We also compare ATM networks to router networks with Differentiated Services and MPLS Putting all these together, we give the readers a grasp of the big picture of the emerging Internet QoS

803 citations

References
More filters
01 Jan 1998
TL;DR: This document specifies version 6 of the Internet Protocol (IPv6), also sometimes referred to as IP Next Generation or IPng.

1,886 citations

01 Dec 1995
TL;DR: This specification defines the addressing architecture of the IP Version 6 protocol [IPV6], which includes the IPv6 addressing model, text representations of IPv6 addresses, definition of IPv 6 unicast addresses, anycast addresses, and multicast addressing, and an IPv6 node's required addresses.
Abstract: This specification defines the addressing architecture of the IP Version 6 protocol [IPV6]. The document includes the IPv6 addressing model, text representations of IPv6 addresses, definition of IPv6 unicast addresses, anycast addresses, and multicast addresses, and an IPv6 node's required addresses.

771 citations

01 Sep 1993
TL;DR: In this paper, the authors discuss strategies for address assignment of the existing IP address space with a view to conserve the address space and stem the explosive growth of routing tables in default-route-free routers.
Abstract: This memo discusses strategies for address assignment of the existing IP address space with a view to conserve the address space and stem the explosive growth of routing tables in default-route-free routers.

487 citations

Proceedings ArticleDOI
28 Mar 1993
TL;DR: The authors investigate fast routing table lookup techniques, where the table is composed of hierarchical addresses such as those found in a national telephone network, and several quick lookup solutions for hierarchical address based on binary and ternary CAMs are presented.
Abstract: The authors investigate fast routing table lookup techniques, where the table is composed of hierarchical addresses such as those found in a national telephone network. The hierarchical addresses provide important benefits in large networks, but existing fast routing table lookup techniques, based on hardware such as content addressable memory (CAM), work only with flat addresses. Several fast routing table lookup solutions for hierarchical address based on binary and ternary CAMs are presented, and their advantages and drawbacks are analyzed. >

441 citations

Book
01 May 1992
TL;DR: In this article, the authors describe how Routers and bridges are needed to form networks of reasonable size and why it is necessary to understand these devices in order to manage a network.
Abstract: Computer networks, once devoted primarily to research, are now an integral part of modern life. People rely on them to do real work, such as handling bank transactions or making airline reservations. Routers and bridges are needed to form networks of reasonable size. To manage a network it is necessary to understand these devices.

334 citations


"Scalable high speed IP routing look..." refers methods in this paper

  • ...A modified binary search technique, originally due to Butler Lampson, is described in [ Per92 ]....

    [...]

Frequently Asked Questions (11)
Q1. What are the future works mentioned in the paper "Scalable high speed ip routing lookups" ?

The authors expect most of the characteristics of this address structure to strengthen in the future, especially with the transition to IPv6. Future work on their algorithm includes theoretical work on a choice of balancing function, hopefully yielding an improvement over their ad-hoc heuristic functions. With algorithms such as ours, the authors believe that there is no more reason for router throughputs to be limited by the speed of their lookup engine. The authors also do not believe that hardware lookup engines are required because their algorithm can be implemented in software and still perform well. 

The authors also introduce Mutating Binary Search and other optimizations that, for a typical IPv4 backbone router with over 33,000 entries, considerably reduce the average number of hashes to less than 2, of which one hash can be simplified to an indexed array access. 

To minimize storage in the forwarding database, a single bit can be used to decide whether the rope or only a pointer to a rope is stored in a node. 

For each possible n bit prefix, the authors could draw 2 n individual histograms with possibly fewer non-empty buckets, thus reducing the depth of the search tree. 

For IPv6, 64 bits of Rope is more than sufficient, though it seems possible to get away with 32 bits of Rope in most practical cases. 

As long as only a few entries with even fewer distinct prefix lengths dominate the traffic characteristics, the solution can be found easily. 

The authors also do not believe that hardware lookup engines are required because their algorithm can be implemented in software and still perform well. 

To make better use of this scarce resource, especially the class B addresses, bundles of class C networks were given out instead of class B addresses. 

The best matching prefix problem has been around for twenty years in theoretical computer science; to the best of their knowledge, the best theoretical algorithms are based on tries. 

Thus standard techniques for exact matching, such as perfect hashing, binary search, and standard Content Adressable Memories (CAMs) cannot directly be used for Internet address lookups. 

For simplicity of implementation, thelist of prefixes is assumed to be sorted by increasing prefix length in advance (O(N) using bucket sort).