Proceedings Article•DOI•

Scalable high speed IP routing lookups

Q: What are the future works mentioned in the paper "Scalable high speed ip routing lookups" ?

The authors expect most of the characteristics of this address structure to strengthen in the future, especially with the transition to IPv6. Future work on their algorithm includes theoretical work on a choice of balancing function, hopefully yielding an improvement over their ad-hoc heuristic functions. With algorithms such as ours, the authors believe that there is no more reason for router throughputs to be limited by the speed of their lookup engine. The authors also do not believe that hardware lookup engines are required because their algorithm can be implemented in software and still perform well.

Q: What are the contributions in "Scalable high speed ip routing lookups" ?

The authors also introduce Mutating Binary Search and other optimizations that, for a typical IPv4 backbone router with over 33,000 entries, considerably reduce the average number of hashes to less than 2, of which one hash can be simplified to an indexed array access.

Q: Why is the Rope used to determine whether a rope is stored in a node?

To minimize storage in the forwarding database, a single bit can be used to decide whether the rope or only a pointer to a rope is stored in a node.

Q: How many n bit prefixes can the authors draw?

For each possible n bit prefix, the authors could draw 2 n individual histograms with possibly fewer non-empty buckets, thus reducing the depth of the search tree.

Q: How many bits of Rope is enough for a binary search?

For IPv6, 64 bits of Rope is more than sufficient, though it seems possible to get away with 32 bits of Rope in most practical cases.

Q: How many entries can be found with a few different prefix lengths?

As long as only a few entries with even fewer distinct prefix lengths dominate the traffic characteristics, the solution can be found easily.

Q: Do you think hardware lookup engines are needed?

The authors also do not believe that hardware lookup engines are required because their algorithm can be implemented in software and still perform well.

Q: What did the CIDR system do to make better use of the class B addresses?

To make better use of this scarce resource, especially the class B addresses, bundles of class C networks were given out instead of class B addresses.

Q: What is the matching prefix problem?

The best matching prefix problem has been around for twenty years in theoretical computer science; to the best of their knowledge, the best theoretical algorithms are based on tries.

Q: What are the main reasons why a CAM can be used for Internet address lookups?

Thus standard techniques for exact matching, such as perfect hashing, binary search, and standard Content Adressable Memories (CAMs) cannot directly be used for Internet address lookups.

Marcel Waldvogel¹, George Varghese², Jon Turner², Bernhard Plattner¹•Institutions (2)

ETH Zurich¹, Washington University in St. Louis²

01 Oct 1997-Vol. 27, Iss: 4, pp 25-36

TL;DR: This paper describes a new algorithm for best matching prefix using binary search on hash tables organized by prefix lengths that scales very well as address and routing table sizes increase and introduces Mutating Binary Search and other optimizations that considerably reduce the average number of hashes to less than 2.

read less

Abstract: Internet address lookup is a challenging problem because of increasing routing table sizes, increased traffic, higher speed links, and the migration to 128 bit IPv6 addresses. IP routing lookup requires computing the best matching prefix, for which standard solutions like hashing were believed to be inapplicable. The best existing solution we know of, BSD radix tries, scales badly as IP moves to 128 bit addresses. Our paper describes a new algorithm for best matching prefix using binary search on hash tables organized by prefix lengths. Our scheme scales very well as address and routing table sizes increase: independent of the table size, it requires a worst case time of log2(address bits) hash lookups. Thus only 5 hash lookups are needed for IPv4 and 7 for IPv6. We also introduce Mutating Binary Search and other optimizations that, for a typical IPv4 backbone router with over 33,000 entries, considerably reduce the average number of hashes to less than 2, of which one hash can be simplified to an indexed array access. We expect similar average case behavior for IPv6.

...read moreread less

Summary (5 min read)

Jump to: [1 Introduction] – [2 Existing Approaches to IP Lookup] – [Modifications of Exact Matching Schemes] – [Trie Based Schemes] – [Hardware Solutions] – [Summary] – [3 Basic Binary Search Scheme] – [3.1 Linear Search of Hash Tables] – [3.2 Binary Search of Hash Tables] – [3.3 Reducing Marker Storage] – [3.5 Precomputation to Avoid Backtracking] – [4 Refinements to Basic Scheme] – [4.1 Asymmetric Binary Search] – [4.2 Mutating Binary Search] – [Structure of Hash Table Entry:] – [4.3 Using Arrays] – [4.4 Hardware Implementations] – [5 Implementation] – [5.2 Rope Search from Scratch] – [5.3 Insertions and Deletions] – [6.2 Measurements for IPv4] – [Rope Search starting with Array Lookup] and [6.3 Projections for IP Version 6]

1 Introduction

The Internet is becoming ubiquitous: everyone wants to join in.
The increasing traffic demand requires three key factors to keep pace if the Internet is to continue to provide good service: link speeds, router data throughput, and packet forwarding rates.
In their paper, the authors distinguish between routing (a process that computes a database mapping destination networks to output links) and forwarding (a process by which a routing database is consulted to decide which output link a single packet should be forwarded on.).
Instead of having multiple routing entries for each subnet in the large network, just two entries are needed: one for the big network, and a more specific one for the small subnet (which has preference, if both should match).
Thus, for the current Internet protocol suite (IPv4) with 32 bit addresses, the authors need at most 5 hash lookups.

2 Existing Approaches to IP Lookup

The authors discuss approaches based on modifying exact matching schemes, trie based schemes, hardware solutions based on parallelism, proposals for protocol changes to simplify IP lookup, and caching solutions.
For the rest of this paper, the authors use BMP as a shorthand for Best Matching Prefix.

Modifications of Exact Matching Schemes

Classical fast lookup techniques such hashing and binary search do not directly apply to the best matching prefix (BMP) problem since they only do exact matches.
This method requires log 2 2N steps, with N being the number of routing table entries.
With current routing table sizes, the worst case would be 17 data lookups, each requiring at least one costly memory access.
A second classical solution would be to reapply any exact match scheme for each possible prefix length [Skl93] .
This is even more expensive, requiring W iterations of the exact match scheme used (e.g. W = 128 for IPv6).

Trie Based Schemes

The most commonly available IP lookup implementation is found in the BSD kernel, and is a radix trie implementation [Skl93] .
Current implementations have made a number of improvements on Sklower's original implementation.
The worst case was improved to O(W) by requiring that the prefix be contiguous (previously non-contiguous masks were allowed, a feature which was never used).
The implementation requires up to 32 or 128 costly memory accesses (for IPv4 or IPv6, respectively).
Tries also can have large storage requirements.

Hardware Solutions

Hardware solutions can potentially use parallelism to gain lookup speed.
Large CAMs are usually slower and much more expensive than ordinary memory.
In their basic form, both systems potentially require the boundary routers between autonomous systems (e.g., between a company and its ISP or between ISPs) to perform the full forwarding decision again, because of trust issues, scarce resources, or different views of the network.
Thus while both tag switching and IP switching can provide good performance within a level of hierarchy, neither solution currently does well at hierarchy boundaries without scaling problems.
For years, designers of fast routers have resorted to caching to claim high speed IP lookups.

Summary

In summary, all existing schemes have problems of either performance, scalability, generality, or cost.
The authors now describe a scheme that has good performance, excellent scalability, and does not require protocol changes.
The authors scheme also allows a cheap, fast software implementation, and also a more expensive (but faster) hardware implementation.

3 Basic Binary Search Scheme

The authors basic algorithm is based on three significant ideas, of which only the first has been reported before.
Rather than present the final solution directly, the authors will gradually refine these ideas in Section 3.1, Section 3.2, and Section 3.5 to arrive at a working basic scheme.
The authors describe further optimizations to the basic scheme in the next section.

3.1 Linear Search of Hash Tables

The authors point of departure is a simple scheme that does linear search of hash tables organized by prefix lengths.
The authors will improve this scheme shortly to do binary search on the hash tables.
The idea is to look for all prefixes of a certain length L using hashing and use multiple hashes to find the best matching prefix, starting with the largest value of L and working backwards.

3.2 Binary Search of Hash Tables

The previous scheme essentially does (in the worst case) linear search among all distinct string lengths.
Linear search requires O(W) expected time (more precisely, O(Wdist), where Wdist W is the number of distinct lengths in the database.).
Markers are needed to direct binary search to look for matching prefixes of greater length.
The authors will use upper half to mean the half of the trie with prefix lengths strictly less than the median length.

3.3 Reducing Marker Storage

The following definitions are useful before proceeding.
In the typical case, many prefixes will share markers (Table 1 ), reducing the marker storage further.
(Consider N prefixes whose first log 2 N bits are all distinct and whose remaining bits are all 1's).
Unfortunately, this algorithm is not correct as it stands and does not take logarithmic time if implemented naively.
In case of failure, the authors would have to modify the binary search (for correctness) to backtrack and search the upper half of R again.

3.5 Precomputation to Avoid Backtracking

Suppose every marker node M is a record that contains a variable M:bmp, which is the value of the best matching prefix of the marker M. M:bmp can be precomputed when the marker M is inserted into its hash table.
Now, when the authors find M at the mid point of R, they indeed search the lower half, but they also remember the value of M:bmp as the current best matching prefix.
The standard invariant for binary search when searching for key K is: "K is in range R".
Finally, the invariant implies the correct result when the range shrinks to 1.
Thus the algorithm works correctly; also since it has no backtracking, it takes O(log 2 Wdist) time.

4.1 Asymmetric Binary Search

The current algorithm is a fast, yet very general, BMP search engine.
As can be seen in Figure 7 , the distribution of a typical backbone router's forwarding table as obtained from [Mer96] , the entries are not equally distributed over the different prefix lengths.
(While this is an improvement for the worst case, in this case, it harms the average performance, as the authors will see later.).
To build a useful asymmetrical tree, the authors can recursively split both the upper and lower part of the binary search tree's current node's search space, at a point selected by a heuristic weighting function.
Two different weighting functions with different goals (one strictly picking the level covering most addresses, the other maximizing the entries while keeping the worst case bound) are shown in Figure 9 , with coverage and average/worst case analysis for both weighting functions in Table 2 .

4.2 Mutating Binary Search

The authors further refine the basic binary search tree to change or mutate to more specialized binary trees each time they encounter a partial match in some hash table.
This resulting histogram led us to propose asymmetrical binary search, which can improve average speed.
Further information about prefix distributions can be extracted by dissecting the histogram:.
There is nothing magic about the 16 bit level, other than it being a good root for a binary search of 32 bit IPv4 addresses.
In general, every match in the binary search with some marker X, means that the authors need only search among the set of prefixes for which X is a prefix.

Structure of Hash Table Entry:

Mutating Binary Search Example Doing basic binary search for an IPv4 address whose BMP has length 21 requires checking the prefix lengths 16 (hit), 24 (miss), 20 (hit), 22 (miss), and finally 21, also known as Figure 11.
Each binary tree has the root level (i.e., the first length to be searched) at the left; the upper child of each binary tree node is the length to be searched on failure, and whenever there is a match, the search switches to the more specific tree.
Two possible disadvantages of mutating binary search immediately present themselves.
First, precomputing optimal trees can increase the time to insert a new prefix.
The starting Rope corresponds to the default binary search tree.

4.3 Using Arrays

In cases where program complexity and memory use can be traded for speed, it might be desirable to change the first hash table lookup to a simple indexed array lookup, with the index being formed from the first w0 bits of the address, with w0 being the prefix length at which the search would be started.
Each array entry for index i will contain the bmp of i as well as a Rope which will guide binary search among all prefixes that begin with i.
An initial array lookup is not only faster than a hash lookup, but also results in reducing the average number of lookups (to around 0.5 using the current data sets the authors have examined.).

4.4 Hardware Implementations

The inner component, most likely done as a hash table in software implementations, can be implemented using hashing hardware such as described in [Dig95] .
The outer loop in the Rope scheme can be implemented as a shift register.
Multiple shift registers, it is possible to pipeline the searches, resulting in one completed routing lookup per hash lookup time.

5 Implementation

Besides hashing and binary search, a predominant idea in this paper is precomputation.
Every hash table entry has an associated bmp field and a Rope field, both of which are precomputed.
Precomputation allows fast search but requires more complex Insertion routines.
As mentioned earlier, while routes to prefixes may change frequently, the addition of a new prefix (the expensive case) is much rarer.
Thus it is worth paying a penalty for Insertion in return for improved search speed.

5.2 Rope Search from Scratch

Building a Rope Search data structure balanced for optimal search speed is more complex, since every possible binary search path needs to be optimized.
Thus the authors have two passes: Pass 1 builds a conventional trie.
Inserting from shortest to longest prefix has the nice property that all BMPs for the newly inserted markers are identical and thus only need to be calculated once.
For typical IPv4 forwarding tables, about half of this maximum number is being used.

5.3 Insertions and Deletions

Adding and removing single entries from the tree can also be done, but since no rebalancing occurs, the performance of the lookups might slowly degrade over time.
Adding or deleting a single prefix can change the bmp values of a large number of markers, and thus insertion is potentially expensive in the worst case.
Such solutions will have adequate throughput (because whenever the build process falls behind, the authors will batch more efficiently), but have poor latency.
The authors are working on fast incremental insertion and deletion algorithms, but they do not describe them here for want of space.

6.2 Measurements for IPv4

So far the authors have described how long their algorithm takes (in the average or worst case) in terms of the number of hash computations required.
It remains to quantify the time taken for a computation on an arbitrary prefix length using software.
The forwarding table was the same 33,000 entry forwarding table [Mer96] used before.
Basic Scheme Memory usage is close to 1.2 MByte, for the primary data structures (the most commonly accessed hash tables for length 8, 16, and 24) fit mostly into second level cache, so the first two steps (which is the average number needed) are very likely to be found in the cache.
Later steps, seldom needed, will be remarkably slower.

Rope Search starting with Array Lookup

This array fully fits into the cache, leaving ample space for the hash tables.
The array lookup is much quicker, and there will be less total lookups needed than for the Rope scheme.

6.3 Projections for IP Version 6

IPv6 address assignment principles have not been finally decided upon.
All these schemes help to reduce routing information.
Another new feature of IPv6, Anycast addresses [HD96, DH96] , may (depending on how popular they will become) add a very large number of host routes and other routes with very long prefixes.
Depending on the actual data, this may still be a win.
All other optimizations are expected to yield similar improvements.

Did you find this useful? Give us your feedback

Figures (20)

Table 2: Address (A) and Entry (E) Coverage for Asymmetric Trees

Figure 8: Search Trees for Standard and Distinct Binary Search

Figure 9: Asymmetric Trees produced by two Weighting Functions

Table 4: Address (A) and Entry (E) Coverage for Mutating Binary Search

Figure 11: Mutating Binary Search Example

Figure 10: Showing how mutating binary search for prefix P dynamically changes the trie on which it will do binary search of hash tables.

Table 3: Number of Distinct Prefix Lengths in the 16 bit Partitions (Histogram)

Table 5: Speed and Memory Usage Complexity

Figure 1: Hash Tables for each possible prefix length

Figure 12: In terms of a trie, a rope for the trie node is the sequence of lengths starting from the median length, the quartile length, and so on, which is the same as the series of left children (see dotted oval in binary tree on right) of a perfectly balanced binary tree on the trie levels.

Table 1: Marker Overhead for Backbone Forwarding Table

Figure 7: Histogram of the Prefix Length Distribution

Content maybe subject to copyright Report

Scalable High Speed IP Routing Lookups

Marcel Waldvogel

, George Varghese

,JonTurner

, Bernhard Plattner

Computer Engineering and Networks Laboratory

ETH Z¨urich, Switzerland

waldvogel,plattner

@tik.ee.ethz.ch

Computer and Communications Research Center

Washington University in St. Louis, USA

varghese,jst

@ccrc.wustl.edu

Abstract

Internet address lookup is a challenging problem because of in-

creasing routing table sizes, increased trafﬁc, higher speed links,

and the migration to 128 bit IPv6 addresses. IP routing lookup

requires computing the best matching preﬁx, for which standard

solutions like hashing were believed to be inapplicable. The best

existing solution we know of, BSD radix tries, scales badly as IP

moves to 128 bit addresses. Our paper describes a new algorithm

for best matching preﬁx using binary search on hash tables orga-

nized by preﬁx lengths. Our scheme scales very well as address

and routing table sizes increase: independentof the table size, it re-

quires a worst case time of

log

(

address bits

)

hash lookups. Thus

only 5 hash lookups are needed for IPv4 and 7 for IPv6. We also

introduce Mutating Binary Search and other optimizations that, for

a typical IPv4 backbone router with over 33,000 entries, consider-

ably reduce the average number of hashes to less than 2, of which

one hash can be simpliﬁed to an indexed array access. We expect

similar average case behavior for IPv6.

1 Introduction

The Internet is becoming ubiquitous: everyone wants to join in.

Since the advent of the World Wide Web, the number of users,

hosts, domains, and networks connected to the Internet seems to be

exploding. Not surprisingly, network trafﬁc is doubling every few

months. The proliferation of multimedia networking applications

and devices is expected to give trafﬁc another major boost.

The increasing trafﬁc demand requires three key factors to

keep pace if the Internet is to continue to provide good service:

link speeds, router data throughput, and packet forwarding rates.

Readily available solutions exist for the ﬁrst two factors: for ex-

ample, ﬁber-optic cables can provide faster links,

and switching

technology can be used to move packets from the input interface

of a router to the corresponding output interface at gigabit speeds.

In our paper, we distinguish between routing (a process that computes a database

mapping destination networks to output links) and forwarding (a process by which

a routing database is consulted to decide which output link a single packet should

be forwarded on.) Route computation is less time critical than forwarding because

forwarding is done for each packet, while route computation needs to be done only

when the topology changes.

For example, MCI is currentlyupgradingits lines from 45Mbits/s to 155Mbits/s;

they plan to switch to 622 Mbits/s within a year.

Our paper deals with the third factor, packet forwarding, for which

current techniques perform poorly as network speedsincrease.

The major step in packet forwarding is to lookup the desti-

nation address (of an incoming packet) in the routing database.

While there are other chores, such as updating TTL ﬁelds, these

are computationally inexpensive compared to the major task of ad-

dress lookup. Data link Bridges have been doing address lookups

at 100 Mbps [Dig95] for many years. However, bridges only do

exact matching on the destination (MAC) address, while Internet

routers have to searchtheir databasefor the longestpreﬁxmatching

a destination IP address. Thus standardtechniques for exact match-

ing, such as perfect hashing, binary search, and standard Content

Adressable Memories (CAMs) cannot directly be used for Internet

address lookups.

Preﬁx matching was introduced in the early 1990s, whenit was

foreseen that the number of endpoints and the amount of routing

information would grow enormously. The address classes A, B,

and C (allowing sites to have 24, 16, and 8 bits respectively for ad-

dressing) proved too inﬂexible and wasteful of the address space.

To makebetteruseof this scarceresource, especiallytheclassB ad-

dresses,bundles of class C networkswere givenout instead of class

B addresses. This resulted in massive growth of routing table en-

tries. So, in turn, Classless Inter-Domain Routing (CIDR) [F

93]

was deployed, to allow for arbitrary aggregation of networks to re-

duce routing table entries.

To reducerouting table space,aggregation is doneaggressively.

Suppose all the subnets in a big network have identical routing in-

formation except for a single, small subnet that has different infor-

mation. Instead of having multiple routing entries for each subnet

in the large network, just two entries are needed: one for the big

network, and a more speciﬁc one for the small subnet (which has

preference, if both should match). This results in better usage of

the available IP address space and decreases the amount of routing

table entries. On the other hand, the processing power needed for

forwarding lookup is increased.

Thus today an IP router’s database consists of a number of ad-

dress preﬁxes. When an IP router receives a packet, it must com-

pute which of the preﬁxes in its database has the longest match

when comparedto thedestination addressin the packet. The packet

is then forwarded to the output link associated with that preﬁx. For

example, a forwarding database may have the preﬁxes

1 = 0101

2 = 0101101

and

3 = 010110101011

. An address whoseﬁrst

12 bits are

010101101011

has longest matching preﬁx

.Onthe

other hand, an address whose ﬁrst 12 bits are

010110101101

has

longest matching preﬁx

The use of best matching preﬁx in forwarding has allowed IP

routers to accomodate variouslevels of addresshierarchies, and has

allowed different parts of the network to havedifferent views of the

First publ. in: Proceedings of ACM SIGCOMM '97 conference on Applications, Technologies, Architecture and Protocols for

Computer Communication, Cannes, France, 1997, pp. 25-36

Konstanzer Online-Publikations-System (KOPS)

URL: http://www.ub.uni-konstanz.de/kops/volltexte/2007/2471/

URN: http://nbn-resolving.de/urn:nbn:de:bsz:352-opus-24711

address hierarchy. Given that best matching preﬁx forwarding is

necessary for hierarchies, and hashing is a natural solution for ex-

act matching, a natural question is: “Why can’t we modify hashing

to do best matching preﬁx.” However, for several years now, it was

considered not to be “apparent how to accommodate hierarchies

while using hashing, other than rehashing for each level of hierar-

chy possible” [Skl93].

Our paperdescribesa novel algorithmic solution to longestpre-

ﬁx match, using binary search over hash tables organized by the

length of the preﬁx. Our solution requires a worst case complexity

(log

)

, with

being the length of the address in bits.

Thus, for the current Internet protocol suite (IPv4) with 32 bit ad-

dresses, we need at most 5 hash lookups. For the upcoming IP ver-

sion 6 (IPv6) with 128 bit addresses, we can do lookup in at most

7 steps, as opposed to 128 in current algorithms (see Section 2),

giving an order of magnitude performance improvement.Using

perfect hashing, we can lookup 128 bit IP addresses in at most

7 memory accesses. This is signiﬁcant because on current RISC

processors, hash functions can be found whose computation is

cheaper than a memory access.

In addition, we use severaloptimizations to signiﬁcantly reduce

the average number of hashes needed. For example, our analysis of

an IPv4 forwarding table from an Internet backbone router at the

Mae-East network access point (NAP) [Mer96] show an average

case performance of less than two hashes, where the ﬁrst hash can

be replaced by a simple index table lookup.

The rest of the paper is organized as follows. Section 2

describes drawbacks with existing approaches to IP lookups.

Section 3 describes our basicscheme in a series of reﬁnements that

culminate in the basic binary search scheme. Section 4 describes

a series of important optimizations to the basic scheme that im-

prove average performance. Section 5 describes our implementa-

tion, including algorithms to build the data structure and perform

insertions and deletions. Section 6 describesperformance measure-

ments using our scheme for IPv4 addresses, and performance pro-

jections for IPv6 addresses. We conclude in Section 7 by assessing

the theoretical and practical contributions of this paper.

2 Existing Approaches to IP Lookup

We survey existing approaches to IP lookups and their problems.

We discuss approaches based on modifying exact matching sche-

mes, trie based schemes, hardware solutions based on parallelism,

proposals for protocol changes to simplify IP lookup, and caching

solutions. For the rest of this paper, we use BMP as a shorthand for

Best Matching Preﬁx.

Modiﬁcations of Exact Matching Schemes Classical fast

lookup techniques such hashing and binary search do not directly

applyto the bestmatchingpreﬁx (BMP) problemsincetheyonly do

exact matches. A modiﬁed binary search technique, originally due

to Butler Lampson, is described in [Per92]. However, this method

requires

log

steps, with

being the number of routing table

entries. With current routing table sizes, the worst case would be 17

data lookups, each requiring at least one costly memory access. As

with any binary search scheme, the average number of accesses is

log

)



. A secondclassical solution would be to reapply any

exact match scheme for each possible preﬁx length [Skl93]. This

is even more expensive, requiring

iterations of the exact match

scheme used (e.g.

= 128

for IPv6).

This assumes assuming

(1)

for hashing, which can be achieved using perfect

hashing, althoughlimited collisions do not affect performance signiﬁcantly.

Trie Based Schemes The most commonly available IP lookup

implementation is found in the BSD kernel, and is a radix trie im-

plementation [Skl93]. If

is the length of an address, the worst-

case time in the basic implementation can be shown to be

(

)

Current implementations have made a number of improvements on

Sklower’s original implementation. The worst case was improved

(

)

by requiring that the preﬁx be contiguous (previously

non-contiguous masks were allowed, a feature which was never

used). Despite this, the implementation requires up to 32 or 128

costly memory accesses (for IPv4 or IPv6, respectively). Tries also

can have large storage requirements.

Hardware Solutions Hardware solutions can potentially use par-

allelism to gain lookup speed. For exact matches, this is done using

Content Addressable Memories (CAMs) in which every memory

location, in parallel, compares the input key value to the content of

that memory location.

Some CAMs allow a mask of bits that must be matched. Al-

though there are expensive so-called ternary CAMs available al-

lowing a mask to be speciﬁedper word, the mask must typically be

speciﬁed in advance. It has been shown that these CAMs can be

used to do BMP lookups [MF93, MTW95], but the solutions are

usually expensive.

Large CAMs are usually slower and much more expensive than

ordinary memory. Typical CAMs are small, both in the number of

bits per entry and the number of entries. Thus the CAM memory

for large address/mask pairs (256 bits needed for IPv6) and a huge

amount of preﬁxes appears (currently) to be very expensive. An-

other possibility is to use a number of CAMs doing parallel look-

ups for each preﬁx length. Again, this seems expensive. Probably

the most fundamental problem with CAMs is that CAM designs

have not historically kept pace with improvements in RAM mem-

ory. Thus a CAM based solution (or indeed any hardware solution)

runs the risk of being made obselete, in a few years, by software

technology running on faster processors and memory.

Protocol Based Solutions One way to get around the problems

of IP lookup is to have extra information sent along with the packet

to simplify or even totally get rid of IP lookups at routers. Two

major proposals along these lines are IP Switching [NMH97] and

Tag Switching [CV95, CV96, R

96]. Both schemes require large,

contiguous parts of the network to adopt their protocol changes be-

fore they will show a major improvement. The speedup is achieved

by adding information on the destination to every IP packet.

In IP Switching, this is done by associating a ﬂow of packets

with an ATM Virtual Circuit; in Tag Switching, this is done by

adding a “tag” to each packet, where a “tag” is a small integer that

allows direct lookup in the router’s forwarding table. Tag switching

is based on a concept originally described by Chandranmenon and

Varghese ([CV95, CV96]) using the name “threaded indices”. The

current tag switching proposal[R

96] goes further than threaded

indices by adding a stack of indices to deal with hierarchies.

Neither scheme can completely avoid ordinary IP lookups.

Both schemes require the ingress router (to the portions of the net-

work implementing their protoocol) to perform a full routing de-

cision. In their basic form, both systems potentially require the

boundary routers between autonomous systems (e.g., between a

company and its ISP or between ISPs) to perform the full forward-

ing decision again, because of trust issues,scarce resources,or dif-

ferent views of the network. Scarce resources can be ATM VCs or

tags, of which only a small amount exists. Thus towards the back-

bone, they need to be aggregated; away from the backbone, they

need to be separated again.

Different views of the network can arise because systems of-

ten know more details about their own and adjacentnetworks, than

about networks further away. Although Tag Switching addresses

that problem by allowing hierarchical stacking of tags, this af-

fects routing scalability. Tag Switching assigns and distributes tags

based on routing information; thus every originating network now

has to know tags in the destination networks. Thus while both tag

switching and IP switching can provide good performance within a

level of hierarchy, neither solution currently does well at hierarchy

boundaries without scaling problems.

Caching For years, designers of fast routers have resorted to

caching to claim high speed IP lookups. This is problematic for

several reasons. First, information is typically cached on the en-

tire address, potentially diluting the cache with hundreds of ad-

dresses that map to the same preﬁx. Second, a typical backbone

router of the future may have hundredsof thousandsof preﬁxesand

be expected to forward packets at Gigabit rates. Although studies

have shown that caching in the backbone can result in hit ratios up

to and exceeding 90 percent [Par96, NMH97], the simulations of

cache behavior were done on large, fully associative caches which

commonly are implemented using CAMs. CAMs, as already men-

tioned, are usually expensive. It is not clear how set associative

caches will perform and whether cachingwill be able keep up with

the growth of the Internet. So caching doeshelp, but does not avoid

the need for fast BMP lookups, especially in view of current net-

work speedups.

Summary In summary, all existing schemes have problems of ei-

ther performance, scalability, generality, or cost. Lookup schemes

based on tries and binary search are (currently) too slow and do

not scale well; CAM solutions are expensive and carry the risk

of being quickly outdated; tag and IP switching solutions require

widespread agreement on protocol changes, and still require BMP

lookupsin portions ofthe network; ﬁnally,locality patterns at back-

bone routers make it infeasible to depend entirely on caching.

We now describe a scheme that has good performance, excel-

lent scalability, and does not require protocol changes. Our scheme

also allows a cheap,fast software implementation, and also a more

expensive (but faster) hardware implementation.

3 Basic Binary Search Scheme

Our basic algorithm is based on three signiﬁcant ideas, of which

only the ﬁrst has been reported before. First, we use hashing to

check whether an address

matches any preﬁx of a particular

length; second, we use binary search to reduce number of searches

from linear to logarithmic; third, we use precomputation to prevent

backtracking in case of failures in the binary search of a range.

Rather than present the ﬁnal solution directly, we will gradually

reﬁne these ideas in Section 3.1, Section 3.2, and Section 3.5 to ar-

rive at a working basic scheme. We describe further optimizations

to the basic scheme in the next section.

3.1 Linear Search of Hash Tables

Our point of departure is a simple scheme that does linear search

of hash tables organized by preﬁx lengths. We will improve this

scheme shortly to do binary searchon the hash tables.

The idea is to look for all preﬁxes of a certain length

using

hashing and use multiple hashes to ﬁnd the best matching preﬁx,

starting with the largest value of

and working backwards. Thus

we start by dividing the database of preﬁxes according to lengths.

Length Hash

01010

0101011

0110110

011011010101

Hash tables

Figure 1: Hash Tables for each possible preﬁx length

Assuming a particularly tiny routing table with four preﬁxes of

length 5, 7, 7, and 12, respectively, each of them would be stored

in the hash table for its length (Figure 1). So each set of preﬁxes of

distinct length is organized as a hash table. If we have a sorted ar-

ray

corresponding to the distinct lengths, we only have 3 entries

in the array, with a pointer to the longest length hash table in the

last entry of the array.

To search for address

, we simply start with the longestlength

hash table

(i.e. 12 in the example), and extract the ﬁrst

bits of

and do a search in the hash table for length

entries. If we succeed,

we have found a BMP

; if not, we look at the ﬁrst length smaller

than

,say

(this is easy to ﬁnd if we have the array

by simply

indexing one position less than the position of

), and continuing

the search.

More concretely, let

be an array of records.

[

]

:leng th

the lengthof preﬁxes found at position

,and

[

]

:hash

is a pointer

to a hash table containing all preﬁxes of length

[

]

:leng th

.The

resulting code is shown in Figure 2.

Function LinearSearch(

) (* search for address

Initialize

BM P

to the empty string;

:= Highest index in array

;

While

(

BM P

nil

)

and

(



Extract the ﬁrst

[

]

:leng th

bits of

into

;

BM P

:= Search(

[

]

:hash

); (* search hash for



;

Endwhile

Figure 2: Linear Search

3.2 Binary Search of Hash Tables

The previous scheme essentially does (in the worst case) linear

search among all distinct string lengths. Linear search requires

(

)

expected time (more precisely,

(

dist

)

,where

dist



is the number of distinct lengths in the database.)

A better search strategy is to use binary search on the array

to cut down the number of hashes to

(log

dist

)

.However,for

binary search to work, we need markers in tables corresponding

to shorter lengths to point to preﬁxes of greater lengths. Markers

are needed to direct binary search to look for matching preﬁxes of

greater length. Here is an example to illustrate the need for markers.

Suppose we have the preﬁxes

1=0

2=00

3 = 111

(Figure 3 (b)). Assume that the zeroth entry of

points to

’s

hash table, the ﬁrst to

’s hash table, and the second points to

’s hash table. Suppose we search for

111

. Binary search (a)

would start at the middle hash table and search for

in the hash

table containing

(the triangles denote a pointer to the hashtable

Recall that BMP stands for Best Matching Preﬁx. We use this abbreviation

throughthe rest ofthe paper

P1=0

P3=111

P2=00

Hash

Tables

Binary

Increasing

Prefix Length

111

Hash Tables

with Marker

(a) (b) (c)

Figure 3: Binary Search on Hash Tables

to search). It would fail and have no indication that it should search

among the longer preﬁx tables for a better matching preﬁx. To ﬁx

this problem, we simply add a marker entry

to the middle table.

Now when binary search is done for

111

, we will lookup

in the

middle hash table and ﬁnd the marker node. This can be used to

direct binary search to the lower half of the table.

Trie StructureBinary Search

Increasing Prefix Length

Figure 4: Binary Search on Trie Levels

Eachhash table (markers plusrealpreﬁxes) canbe thought of as

a horizontal layer of a trie corresponding to some length

(except

that the hash table contains the complete path to that layer of each

entry in that layer). Our basic schemes is essentially doing binary

search on the levels of a trie (Figure 4). We start by doing a hash

on preﬁxes corresponding to the median length of the trie. If we

match, we search the upper half of the trie; if we fail we search the

lower half of the trie.

Figure 4 and other ﬁgures describing search order contain sev-

eral elements: (1) Horizontal stripes grouping all the elements of

a speciﬁed preﬁx length, (2) a trie containing the preﬁxes, shown

on the right of the ﬁgure and rooted on the top of the ﬁgure, and

(3) a binary tree, shown on the left of the ﬁgure and rooted at the

left, which depicts all possible paths that binary search can fol-

low. We will use upper half to mean the half of the trie with preﬁx

lengths strictly less than the median length. We also use lower half

for the portion of the trie with preﬁx lengths strictly greater than

the median length. It is important to understand the conventions in

Figure 4 to understand the later ﬁgures and text.

3.3 Reducing Marker Storage

The following deﬁnitions are useful before proceeding. For a pre-

ﬁx P in the table, deﬁne

Level

(

)

to be the integer

for which

[

]

:leng th

leng th

(

)

(i.e., the index of the entry in

that

Total entries 33199 100%

Entries needing no markers 4743 14%

Entries needing 1 marker 22505 68%

Entries needing 2 markers 3562 11%

Entries needing 3 markers 2389 7%

Total markers requested 36796 111%

(before sharing)

Total markers 8699 26%

Pure markers 8336 25%

Table 1: Marker Overhead for Backbone Forwarding Table

points to

’s hash table). Also, “up” to refers to shorter, “down”to

longer preﬁxes.

How many markers do we need? A naive view would indi-

cate placing a marker for preﬁx

at all levels in

higher than

the level of

. However, it sufﬁces to place markers at all levels

that could be visited by binary search when looking for an

entry whose BMP is

. This reduces the number of markers to

at most

log

per real preﬁx, which keeps the storage expan-

sion modest. More precisely, if the

Level

(

)

is written down

in binary as

;:::;a

, then we need a marker at each level

;:::;a

;

;:::;

such that

. (We assume that

is padded so that its size is a power of 2). In fact, the number

of marker nodes is limited by the number of 1 bits in

Level

(

)

Clearly this results in a logarithmic number of markers.

In the typical case, many preﬁxes will share markers (Table 1),

reducing the marker storagefurther. In our samplerouting database

[Mer96], the storage required will increase by 25%. However, it is

easy to give a worst case example where the storage needs require

(log

)

markers per preﬁx. (Consider

preﬁxes whose ﬁrst

log

bits are all distinct and whose remaining bits are all

’s).

3.4 Problems with Backtracking

Function NaiveBinarySearch(

) (* search for address

Initialize search range

to cover the whole array

;

While

is not a single entry do

Let

correspond to the middle level in range

;

Extract the ﬁrst

[

]

:leng th

bits of

into

;

Search(

[

]

:hash

); (* search hash table for

If found then set

:= lower half of

(*longer preﬁxes*)

Else set R := upper half of

; (*shorter preﬁxes*)

Endif

Endwhile

Figure 5: Naive Binary Search

Binary search of hash tables can be expressed as shown in

Figure 5. Unfortunately, this algorithm is not correct as it stands

and does not take logarithmic time if implemented naively. The

problem is that while markers are good things (they lead to poten-

tially better preﬁxes lower in the table), they can also cause the

search to follow false leads which may fail. In case of failure, we

would have to modify the binary search (for correctness) to back-

track and search the upper half of

again. Such a naive modiﬁca-

tion can lead us backto linear time search. An example will clarify

this.

First consider the preﬁxes

1=1

2=00

3 = 111

As discussed above, we add a marker to the middle table so that

the middle hash table contains

(a real preﬁx) and

(a marker

pointing down to

). Now considera search for

110

.Westartat

the middle hash table and get a hit; thus we search the third hash

table for

110

and fail. But the correct best matching preﬁx is at the

ﬁrst level hash table — i.e.,

. The marker indicating that there

will be longer preﬁxes, indispensible to ﬁnd

, was misleading in

this case; so apparently, we have to go back and search the upper

half of the range.

The fact that each entry contributes at most

log

markers

may cause some readers to suspect that the worst case with back-

tracking is limited to

(log

)

. This is incorrect. The worst case

(

)

. The worst-case example for say

bits is as follows:

wehaveapreﬁx

of length

,for



i<W

that contains all

0s. In addition we have the preﬁx

whose ﬁrst



bits are

all zeroes, but whose last bit is a

. If we search for the

bit

address containing all zeroes then we can show that binary search

with backtracking will take

(

)

time and visit every level in the

table. (The problem is that every level contains a false marker that

indicates the presence of something better below.)

3.5 Precomputation to Avoid Backtracking

We use precomputation to avoid backtracking when we shrink the

current range

to the lower half of

(which happens when we

ﬁnd a marker at the mid point of

). Suppose every marker node

is a record that contains a variable

M :bmp

, which is the value

of the best matching preﬁx of the marker

M :bmp

can be pre-

computed when the marker

is inserted into its hash table. Now,

when we ﬁnd

at the mid point of

, we indeed search the lower

half, but we also remember the value of

M :bmp

as the current best

matching preﬁx. Now if the lower half of

fails to produce any-

thing interesting, we need not backtrack, because the results of the

backtracking are already summarized in the value of

M :bmp

.The

new code is shown in Figure 6.

Function BinarySearch(

) (* search for address

Initialize search range

to cover the whole array

;

Initialize

BM P

found so far to null string;

While

is not empty do

Let

correspond to the middle level in range

;

Extract the ﬁrst

[

]

:leng th

bits of

into

;

:= Search(

[

]

:hash

); (* search hash for

is nil Then set

:= upper half of

; (* not found *)

Elseif

is a preﬁx and not a marker

Then

BM P

M :bmp

; break; (* exit loop *)

Else (*

is a pure marker, or marker and preﬁx *)

BM P

M :bmp

; (* update best matching preﬁx so far *)

:= lower half of

;

Endif

Endwhile

Figure 6: Binary Search

The standard invariant for binary search whensearchingfor key

is: “

is in range

”. We then shrink

while preserving this in-

variant. The invariant for this algorithm, when searching for key

is: “EITHER (The Best Matching Preﬁx of

is BMP) OR (There

is a longer matching preﬁx in

)”.

It is easy to see that initialization preserves this invariant, and

each of the search cases preserves this invariant (this can be es-

tablished using an inductive proof.) Finally, the invariant implies

the correct result when the range shrinks to 1. Thus the algo-

rithm works correctly; also since it has no backtracking, it takes

(log

dist

)

time.

4 Reﬁnements to Basic Scheme

The basic schemedescribed in Section 3 takes just 7 hashcomputa-

tions, in the worst case, for 128 bit IPv6 addresses. However, each

hash computation takes at least one access to memory; at gigabit

speeds each memory access is signiﬁcant. Thus, in this section,

we explore a series of optimizations that exploit the deeper struc-

ture inherent in the problem to reduce the average number of hash

computations.

4.1 Asymmetric Binary Search

Prefix Length

Frequency

1 10 100 1000 10,000100,000

Figure 7: Histogram of the Preﬁx Length Distribution

We ﬁrst describe a series of simple-minded optimizations. Our

main optimization, mutating binary search, is described in the next

section. A reader can safely skip to Section 4.2 on a ﬁrst reading.

The current algorithm is a fast, yet very general, BMP search

engine. Usually, the performance of general algorithms can be im-

proved by tailoring them to the particular datasets they will be ap-

plied to. As can be seen in Figure 7, the distribution of a typical

backbone router’s forwarding table as obtained from [Mer96], the

entries are not equally distributed over the different preﬁx lengths.

All the concepts we described below apply to any set of addresses;

however, we will quantify the potential improvements using the ex-

isting table.

As the ﬁrst improvement, which has already been mentioned

and used in the basic scheme, the search can be limited to those

preﬁx lengths which do contain at least one entry, reducing the

worst case number of hashes from

log

(5 with

=32

)to

log

dist

(4.5 with

dist

=23

, the number of non-empty buck-

ets in the histogram), as shown in Figure 8. (While this is an im-

provement for the worst case, in this case, it harms the average

performance, as we will see later.)

HTML Viewer

Frequently Asked Questions (11)

Q1. What are the future works mentioned in the paper "Scalable high speed ip routing lookups" ?

The authors expect most of the characteristics of this address structure to strengthen in the future, especially with the transition to IPv6. Future work on their algorithm includes theoretical work on a choice of balancing function, hopefully yielding an improvement over their ad-hoc heuristic functions. With algorithms such as ours, the authors believe that there is no more reason for router throughputs to be limited by the speed of their lookup engine. The authors also do not believe that hardware lookup engines are required because their algorithm can be implemented in software and still perform well.

Q2. What are the contributions in "Scalable high speed ip routing lookups" ?

The authors also introduce Mutating Binary Search and other optimizations that, for a typical IPv4 backbone router with over 33,000 entries, considerably reduce the average number of hashes to less than 2, of which one hash can be simplified to an indexed array access.

Q3. Why is the Rope used to determine whether a rope is stored in a node?

To minimize storage in the forwarding database, a single bit can be used to decide whether the rope or only a pointer to a rope is stored in a node.

Q4. How many n bit prefixes can the authors draw?

For each possible n bit prefix, the authors could draw 2 n individual histograms with possibly fewer non-empty buckets, thus reducing the depth of the search tree.

Q5. How many bits of Rope is enough for a binary search?

For IPv6, 64 bits of Rope is more than sufficient, though it seems possible to get away with 32 bits of Rope in most practical cases.

Q6. How many entries can be found with a few different prefix lengths?

As long as only a few entries with even fewer distinct prefix lengths dominate the traffic characteristics, the solution can be found easily.

Q7. Do you think hardware lookup engines are needed?

The authors also do not believe that hardware lookup engines are required because their algorithm can be implemented in software and still perform well.

Q8. What did the CIDR system do to make better use of the class B addresses?

To make better use of this scarce resource, especially the class B addresses, bundles of class C networks were given out instead of class B addresses.

Q9. What is the matching prefix problem?

The best matching prefix problem has been around for twenty years in theoretical computer science; to the best of their knowledge, the best theoretical algorithms are based on tries.

Q10. What are the main reasons why a CAM can be used for Internet address lookups?

Thus standard techniques for exact matching, such as perfect hashing, binary search, and standard Content Adressable Memories (CAMs) cannot directly be used for Internet address lookups.

Q11. How is the list of prefixes sorted?

For simplicity of implementation, thelist of prefixes is assumed to be sorted by increasing prefix length in advance (O(N) using bucket sort).

Scalable high speed IP routing lookups

Summary (5 min read)

1 Introduction

2 Existing Approaches to IP Lookup

Modifications of Exact Matching Schemes

Trie Based Schemes

Hardware Solutions

Summary

3 Basic Binary Search Scheme

3.1 Linear Search of Hash Tables

3.2 Binary Search of Hash Tables

3.3 Reducing Marker Storage

3.5 Precomputation to Avoid Backtracking

4 Refinements to Basic Scheme

4.1 Asymmetric Binary Search

4.2 Mutating Binary Search

Structure of Hash Table Entry:

4.3 Using Arrays

4.4 Hardware Implementations

5 Implementation

5.2 Rope Search from Scratch

5.3 Insertions and Deletions

6.2 Measurements for IPv4

Rope Search starting with Array Lookup

6.3 Projections for IP Version 6

Figures (20)

Citations

Cites background from "Scalable high speed IP routing look..."

Cites background from "Scalable high speed IP routing look..."

References

"Scalable high speed IP routing look..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (11)

Q1. What are the future works mentioned in the paper "Scalable high speed ip routing lookups" ?

Q2. What are the contributions in "Scalable high speed ip routing lookups" ?

Q3. Why is the Rope used to determine whether a rope is stored in a node?

Q4. How many n bit prefixes can the authors draw?

Q5. How many bits of Rope is enough for a binary search?

Q6. How many entries can be found with a few different prefix lengths?

Q7. Do you think hardware lookup engines are needed?

Q8. What did the CIDR system do to make better use of the class B addresses?

Q9. What is the matching prefix problem?

Q10. What are the main reasons why a CAM can be used for Internet address lookups?

Q11. How is the list of prefixes sorted?