Scalable high speed IP routing lookups
Summary (5 min read)
1 Introduction
- The Internet is becoming ubiquitous: everyone wants to join in.
- The increasing traffic demand requires three key factors to keep pace if the Internet is to continue to provide good service: link speeds, router data throughput, and packet forwarding rates.
- In their paper, the authors distinguish between routing (a process that computes a database mapping destination networks to output links) and forwarding (a process by which a routing database is consulted to decide which output link a single packet should be forwarded on.).
- Instead of having multiple routing entries for each subnet in the large network, just two entries are needed: one for the big network, and a more specific one for the small subnet (which has preference, if both should match).
- Thus, for the current Internet protocol suite (IPv4) with 32 bit addresses, the authors need at most 5 hash lookups.
2 Existing Approaches to IP Lookup
- The authors discuss approaches based on modifying exact matching schemes, trie based schemes, hardware solutions based on parallelism, proposals for protocol changes to simplify IP lookup, and caching solutions.
- For the rest of this paper, the authors use BMP as a shorthand for Best Matching Prefix.
Modifications of Exact Matching Schemes
- Classical fast lookup techniques such hashing and binary search do not directly apply to the best matching prefix (BMP) problem since they only do exact matches.
- This method requires log 2 2N steps, with N being the number of routing table entries.
- With current routing table sizes, the worst case would be 17 data lookups, each requiring at least one costly memory access.
- A second classical solution would be to reapply any exact match scheme for each possible prefix length [Skl93] .
- This is even more expensive, requiring W iterations of the exact match scheme used (e.g. W = 128 for IPv6).
Trie Based Schemes
- The most commonly available IP lookup implementation is found in the BSD kernel, and is a radix trie implementation [Skl93] .
- Current implementations have made a number of improvements on Sklower's original implementation.
- The worst case was improved to O(W) by requiring that the prefix be contiguous (previously non-contiguous masks were allowed, a feature which was never used).
- The implementation requires up to 32 or 128 costly memory accesses (for IPv4 or IPv6, respectively).
- Tries also can have large storage requirements.
Hardware Solutions
- Hardware solutions can potentially use parallelism to gain lookup speed.
- Large CAMs are usually slower and much more expensive than ordinary memory.
- In their basic form, both systems potentially require the boundary routers between autonomous systems (e.g., between a company and its ISP or between ISPs) to perform the full forwarding decision again, because of trust issues, scarce resources, or different views of the network.
- Thus while both tag switching and IP switching can provide good performance within a level of hierarchy, neither solution currently does well at hierarchy boundaries without scaling problems.
- For years, designers of fast routers have resorted to caching to claim high speed IP lookups.
Summary
- In summary, all existing schemes have problems of either performance, scalability, generality, or cost.
- The authors now describe a scheme that has good performance, excellent scalability, and does not require protocol changes.
- The authors scheme also allows a cheap, fast software implementation, and also a more expensive (but faster) hardware implementation.
3 Basic Binary Search Scheme
- The authors basic algorithm is based on three significant ideas, of which only the first has been reported before.
- Rather than present the final solution directly, the authors will gradually refine these ideas in Section 3.1, Section 3.2, and Section 3.5 to arrive at a working basic scheme.
- The authors describe further optimizations to the basic scheme in the next section.
3.1 Linear Search of Hash Tables
- The authors point of departure is a simple scheme that does linear search of hash tables organized by prefix lengths.
- The authors will improve this scheme shortly to do binary search on the hash tables.
- The idea is to look for all prefixes of a certain length L using hashing and use multiple hashes to find the best matching prefix, starting with the largest value of L and working backwards.
3.2 Binary Search of Hash Tables
- The previous scheme essentially does (in the worst case) linear search among all distinct string lengths.
- Linear search requires O(W) expected time (more precisely, O(Wdist), where Wdist W is the number of distinct lengths in the database.).
- Markers are needed to direct binary search to look for matching prefixes of greater length.
- The authors will use upper half to mean the half of the trie with prefix lengths strictly less than the median length.
3.3 Reducing Marker Storage
- The following definitions are useful before proceeding.
- In the typical case, many prefixes will share markers (Table 1 ), reducing the marker storage further.
- (Consider N prefixes whose first log 2 N bits are all distinct and whose remaining bits are all 1's).
- Unfortunately, this algorithm is not correct as it stands and does not take logarithmic time if implemented naively.
- In case of failure, the authors would have to modify the binary search (for correctness) to backtrack and search the upper half of R again.
3.5 Precomputation to Avoid Backtracking
- Suppose every marker node M is a record that contains a variable M:bmp, which is the value of the best matching prefix of the marker M. M:bmp can be precomputed when the marker M is inserted into its hash table.
- Now, when the authors find M at the mid point of R, they indeed search the lower half, but they also remember the value of M:bmp as the current best matching prefix.
- The standard invariant for binary search when searching for key K is: "K is in range R".
- Finally, the invariant implies the correct result when the range shrinks to 1.
- Thus the algorithm works correctly; also since it has no backtracking, it takes O(log 2 Wdist) time.
4 Refinements to Basic Scheme
- The basic scheme described in Section 3 takes just 7 hash computations, in the worst case, for 128 bit IPv6 addresses.
- Each hash computation takes at least one access to memory; at gigabit speeds each memory access is significant.
- Thus, in this section, the authors explore a series of optimizations that exploit the deeper structure inherent in the problem to reduce the average number of hash computations.
- The authors main optimization, mutating binary search, is described in the next section.
4.1 Asymmetric Binary Search
- The current algorithm is a fast, yet very general, BMP search engine.
- As can be seen in Figure 7 , the distribution of a typical backbone router's forwarding table as obtained from [Mer96] , the entries are not equally distributed over the different prefix lengths.
- (While this is an improvement for the worst case, in this case, it harms the average performance, as the authors will see later.).
- To build a useful asymmetrical tree, the authors can recursively split both the upper and lower part of the binary search tree's current node's search space, at a point selected by a heuristic weighting function.
- Two different weighting functions with different goals (one strictly picking the level covering most addresses, the other maximizing the entries while keeping the worst case bound) are shown in Figure 9 , with coverage and average/worst case analysis for both weighting functions in Table 2 .
4.2 Mutating Binary Search
- The authors further refine the basic binary search tree to change or mutate to more specialized binary trees each time they encounter a partial match in some hash table.
- This resulting histogram led us to propose asymmetrical binary search, which can improve average speed.
- Further information about prefix distributions can be extracted by dissecting the histogram:.
- There is nothing magic about the 16 bit level, other than it being a good root for a binary search of 32 bit IPv4 addresses.
- In general, every match in the binary search with some marker X, means that the authors need only search among the set of prefixes for which X is a prefix.
Structure of Hash Table Entry:
- Mutating Binary Search Example Doing basic binary search for an IPv4 address whose BMP has length 21 requires checking the prefix lengths 16 (hit), 24 (miss), 20 (hit), 22 (miss), and finally 21, also known as Figure 11.
- Each binary tree has the root level (i.e., the first length to be searched) at the left; the upper child of each binary tree node is the length to be searched on failure, and whenever there is a match, the search switches to the more specific tree.
- Two possible disadvantages of mutating binary search immediately present themselves.
- First, precomputing optimal trees can increase the time to insert a new prefix.
- The starting Rope corresponds to the default binary search tree.
4.3 Using Arrays
- In cases where program complexity and memory use can be traded for speed, it might be desirable to change the first hash table lookup to a simple indexed array lookup, with the index being formed from the first w0 bits of the address, with w0 being the prefix length at which the search would be started.
- Each array entry for index i will contain the bmp of i as well as a Rope which will guide binary search among all prefixes that begin with i.
- An initial array lookup is not only faster than a hash lookup, but also results in reducing the average number of lookups (to around 0.5 using the current data sets the authors have examined.).
4.4 Hardware Implementations
- The inner component, most likely done as a hash table in software implementations, can be implemented using hashing hardware such as described in [Dig95] .
- The outer loop in the Rope scheme can be implemented as a shift register.
- Multiple shift registers, it is possible to pipeline the searches, resulting in one completed routing lookup per hash lookup time.
5 Implementation
- Besides hashing and binary search, a predominant idea in this paper is precomputation.
- Every hash table entry has an associated bmp field and a Rope field, both of which are precomputed.
- Precomputation allows fast search but requires more complex Insertion routines.
- As mentioned earlier, while routes to prefixes may change frequently, the addition of a new prefix (the expensive case) is much rarer.
- Thus it is worth paying a penalty for Insertion in return for improved search speed.
5.2 Rope Search from Scratch
- Building a Rope Search data structure balanced for optimal search speed is more complex, since every possible binary search path needs to be optimized.
- Thus the authors have two passes: Pass 1 builds a conventional trie.
- Inserting from shortest to longest prefix has the nice property that all BMPs for the newly inserted markers are identical and thus only need to be calculated once.
- For typical IPv4 forwarding tables, about half of this maximum number is being used.
5.3 Insertions and Deletions
- Adding and removing single entries from the tree can also be done, but since no rebalancing occurs, the performance of the lookups might slowly degrade over time.
- Adding or deleting a single prefix can change the bmp values of a large number of markers, and thus insertion is potentially expensive in the worst case.
- Such solutions will have adequate throughput (because whenever the build process falls behind, the authors will batch more efficiently), but have poor latency.
- The authors are working on fast incremental insertion and deletion algorithms, but they do not describe them here for want of space.
6.2 Measurements for IPv4
- So far the authors have described how long their algorithm takes (in the average or worst case) in terms of the number of hash computations required.
- It remains to quantify the time taken for a computation on an arbitrary prefix length using software.
- The forwarding table was the same 33,000 entry forwarding table [Mer96] used before.
- Basic Scheme Memory usage is close to 1.2 MByte, for the primary data structures (the most commonly accessed hash tables for length 8, 16, and 24) fit mostly into second level cache, so the first two steps (which is the average number needed) are very likely to be found in the cache.
- Later steps, seldom needed, will be remarkably slower.
Rope Search starting with Array Lookup
- This array fully fits into the cache, leaving ample space for the hash tables.
- The array lookup is much quicker, and there will be less total lookups needed than for the Rope scheme.
6.3 Projections for IP Version 6
- IPv6 address assignment principles have not been finally decided upon.
- All these schemes help to reduce routing information.
- Another new feature of IPv6, Anycast addresses [HD96, DH96] , may (depending on how popular they will become) add a very large number of host routes and other routes with very long prefixes.
- Depending on the actual data, this may still be a win.
- All other optimizations are expected to yield similar improvements.
Did you find this useful? Give us your feedback
Citations
2,595 citations
1,608 citations
Cites background from "Scalable high speed IP routing look..."
...Increasing the routing table size would also decrease performance, a problem existing work on fast lookup in large tables could address [10, 29]....
[...]
822 citations
Cites background from "Scalable high speed IP routing look..."
...…1.58113-135.6/99/0008...55.00 Figure 1: Example network of an ISP (ISPt) connected to two enterprise networks (El and E2) and to two other ISP networks across a NAP. algorithms have been developed (e.g. [1][5][7][9][ 161). attention has turned to the more general problem of packet classification....
[...]
808 citations
803 citations
References
1,886 citations
771 citations
487 citations
441 citations
334 citations
"Scalable high speed IP routing look..." refers methods in this paper
...A modified binary search technique, originally due to Butler Lampson, is described in [ Per92 ]....
[...]
Related Papers (5)
Frequently Asked Questions (11)
Q2. What are the contributions in "Scalable high speed ip routing lookups" ?
The authors also introduce Mutating Binary Search and other optimizations that, for a typical IPv4 backbone router with over 33,000 entries, considerably reduce the average number of hashes to less than 2, of which one hash can be simplified to an indexed array access.
Q3. Why is the Rope used to determine whether a rope is stored in a node?
To minimize storage in the forwarding database, a single bit can be used to decide whether the rope or only a pointer to a rope is stored in a node.
Q4. How many n bit prefixes can the authors draw?
For each possible n bit prefix, the authors could draw 2 n individual histograms with possibly fewer non-empty buckets, thus reducing the depth of the search tree.
Q5. How many bits of Rope is enough for a binary search?
For IPv6, 64 bits of Rope is more than sufficient, though it seems possible to get away with 32 bits of Rope in most practical cases.
Q6. How many entries can be found with a few different prefix lengths?
As long as only a few entries with even fewer distinct prefix lengths dominate the traffic characteristics, the solution can be found easily.
Q7. Do you think hardware lookup engines are needed?
The authors also do not believe that hardware lookup engines are required because their algorithm can be implemented in software and still perform well.
Q8. What did the CIDR system do to make better use of the class B addresses?
To make better use of this scarce resource, especially the class B addresses, bundles of class C networks were given out instead of class B addresses.
Q9. What is the matching prefix problem?
The best matching prefix problem has been around for twenty years in theoretical computer science; to the best of their knowledge, the best theoretical algorithms are based on tries.
Q10. What are the main reasons why a CAM can be used for Internet address lookups?
Thus standard techniques for exact matching, such as perfect hashing, binary search, and standard Content Adressable Memories (CAMs) cannot directly be used for Internet address lookups.
Q11. How is the list of prefixes sorted?
For simplicity of implementation, thelist of prefixes is assumed to be sorted by increasing prefix length in advance (O(N) using bucket sort).