Tapestry: a resilient global-scale overlay for service deployment
Summary (4 min read)
Introduction
- Overlay networks, peer-to-peer (P2P), service deployment, Tapestry.
- Properly implemented, this virtualization enables message delivery to mobile or replicated endpoints in the presence of instability in the underlying infrastructure.
- Its architecture is modular, consisting of an extensible upcall facility wrapped around a simple, high-performance router.
- These results demonstrate Tapestry’s feasibility as a long running service on dynamic, failure-prone networks such as the wide-area Internet.
A. The DOLR Networking API
- Tapestry provides a datagram-like communications interface, with additional mechanisms for manipulating the locations of objects.
- Before describing the API, the authors start with a couple of definitions.
- Tapestry nodes participate in the overlay and are assigned nodeIDs uniformly at random from a large identifier space.
- More than one node may be hosted by one physical host.
- This call is best effort, and receives no confirmation.
B. Routing and Object Location
- Tapestry dynamically maps each identifier to a unique live node, called the identifier’s root or .
- When routing toward , messages are forwarded across neighbor links to nodes whose nodeIDs are progressively closer (i.e., matching larger prefixes) to in the ID space.
- When a digit cannot be matched, Tapestry looks for a “close” digit in the routing table; the authors call this surrogate routing [1], where each non-existent ID is mapped to some live node with a similar ID.
- To help provide resilience, the authors exploit network path diversity in the form of redundant routing paths.
- Each node also stores reverse references to other nodes that point at it.
C. Dynamic Node Algorithms
- Tapestry includes a number of mechanisms to maintain routing table consistency and ensure object availability.
- S sends out an Acknowledged Multicast message that reaches the set of all existing nodes sharing the same prefix by traversing a tree based on their nodeIDs.
- As nodes receive the message, they add N to their routing tables and transfer references of locally rooted pointers as necessary, completing items (a) and (b).
- Nodes contacted during the iterative algorithm use N to optimize their routing tables where applicable, completing item (d). has shown Tapestry’s viability as a resilient routing layer [31].
A. Component Architecture
- Figure 6 illustrates the functional layering for a Tapestry node.
- At the bottom are the transport and neighbor link layers, which together provide a cross-node messaging layer.
- The neighbor link layer notifies higher layers whenever link properties change significantly.
- This layer also optimizes message processing by parsing the message headers and only deserializing the message contents when required.
- Finally, node authentication and message authentication codes (MACs) can be integrated into this layer for additional security.
B. Tapestry Upcall Interface
- While the DOLR API (Section III-A) provides a powerful applications interface, other functionality, such as multicast, requires greater control over the details of routing and object lookup.
- The authors follow their discussion of the Tapestry component architecture with a detailed look at the current implementation, choices made, and the rationale behind them.
- The Core Router utilizes the routing and object reference tables to handle application driven messages, including object publish, object location, and routing of messages to destination nodes.
- UDP alone, however, does not support flow control or congestion control, and can consume an unfair share of bandwidth causing wide-spread congestion if used across the widearea.
- These node instances can exchange messages in less than 10 microseconds, making any overlay network processing overhead and scheduling delay much more expensive in comparison.
D. Toward a Higher-Performance Implementation
- In Section V the authors show that their implementation can handle over 7,000 messages per second.
- A commercial-quality implementation could do much better.
- The simplest piece—computation of NEXTHOP as in Figure 3—is similar to functionality performed by hardware routers: fast table lookup.
- As a result, it is the second aspect of DOLR routing— fast pointer lookup—that presents the greatest challenge to high-throughput routing.
- Assuming that pointers (with all their information) are are 100 bytes, the in-memory footprint of a Bloom filter can be two orders of magnitude smaller than the total size of the pointers.
V. EVALUATION
- The authors evaluate their implementation of Tapestry using several platforms.
- The authors run micro-benchmarks on a local cluster, measure the large scale performance of a deployed Tapestry on the PlanetLab global testbed, and make use of a local network simulation layer to support controlled, repeatable experiments with up to 1,000 Tapestry instances.
A. Evaluation Methodology
- All experiments used a Java Tapestry implementation (see Section IV-C) running in IBM’s JDK 1.3 with node virtualization (see Section V-C).
- The authors micro-benchmarks are run on local cluster machines of dual Pentium III 1GHz servers (1.5 GByte RAM) and Pentium IV 2.4GHz servers (1 GByte RAM).
- The authors run wide-area experiments on PlanetLab, a network testbed consisting of roughly 100 machines at institutions in North America, Europe, Asia, and Australia.
- Finally, in instances where the authors need large-scale, repeatable and controlled experiments, they perform experiments using the Simple OceanStore Simulator (SOSS) [34].
- SOSS is an event-driven network layer that simulates network time with queues driven by a single local clock.
B. Performance in a Stable Network
- The authors first examine Tapestry performance under stable or static network conditions.
- A raw estimate of the processors (as reported by the bogomips metric under Linux) shows the P-IV to be 2.3 times faster.
- The gap between this and the estimate the authors get from calculating the inverse of the per message routing latency can be attributed to scheduling and queuing delays from the asychronous I/O layer.
- The authors also measure routing to object RDP as a ratio of one-way Tapestry route to object latency, versus the one-way network latency ( ping time).
- High variance indicates some client/server combinations will consistently see non-ideal performance and tends to limit the advantages that clients gain through careful object placement.
C. Convergence Under Network Dynamics
- Here, the authors analyze Tapestry’s scalability and stability under dynamic conditions.
- Figure 17 shows that the total bandwidth for a single node insertion scales logarithmically with the network size.
- Figures 19 and 20 demonstrate the ability of Tapestry to recover after massive changes in the overlay network membership.
- For churn tests, the authors measure the success rate of requests on a set of stable nodes while constantly churning a set of dynamic nodes, using insertion and failure rates driven by probability distributions.
- Finally, the authors measure the success rate of routing to nodes under different network changes on the PlanetLab testbed.
VI. DEPLOYING APPLICATIONS WITH TAPESTRY
- In previous sections, the authors explored the implementation and behavior of Tapestry.
- These applications share new challenges in the wide-area: users will find it more difficult to locate nearby resources as the network grows in size, and dependence on more distributed components means a smaller mean time between failures (MTBF) for the system.
- It also scales logarithmically with the network size in both per-node routing state and expected number of overlay hops in a path.
- Applications can achieve additional resilience by replicating data across multiple servers, and relying on Tapestry to direct client requests to nearby replicas.
- OceanStore [4] is a global-scale, highly available storage utility deployed on the PlanetLab testbed.
VII. CONCLUSION
- The authors described Tapestry, an overlay routing network for rapid deployment of new distributed applications and services.
- The authors presented the architecture of Tapestry nodes, highlighting mechanisms for lowoverhead routing and dynamic state repair, and showed how these mechanisms can be enhanced through an extensible API.
- The median RDP or stretch starts around a factor of three for nearby nodes and rapidly approaches one, also known as Routing is efficient.
- Further, the median RDP for object location is below a factor of two in the wide-area.
- Overall, the authors believe that wide-scale Tapestry deployment could be practical, efficient, and useful to a variety of applications.
Did you find this useful? Give us your feedback
Citations
2,511 citations
1,638 citations
Cites background or methods from "Tapestry: a resilient global-scale ..."
...Sharing similar properties as Pastry, Tapestry [7] employs decentralized randomness to achieve both load distribution and routing locality....
[...]
...On a testbed of100 machines with 1000 peers simulations, the results in [103] shows that the good routing rates and maintenance bandwidths during instantaneous failures and continuing churn....
[...]
...…Protocol (IP) networks, offering a mix of various features such as robust wide-area routing architecture, efficient search of data items, selection of nearby peers, redundant storage, permanence, hierarchical naming, trust and authentication, anonymity, massive scalability and fault tolerance....
[...]
1,004 citations
1,000 citations
678 citations
References
10,286 citations
"Tapestry: a resilient global-scale ..." refers background in this paper
...[8] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan, “Chord: A scalable peer-to-peer lookup service for internet applications,” in Proceedings of SIGCOMM, Aug 2001....
[...]
...In addition, several decentralized file systems have been proposed: CFS [23] (Chord), Mnemosyne [24] (Chord, Tapestry), OceanStore [4] (Tapestry), and PAST [25] (Pastry)....
[...]
...One differentiating property between these systems is that neither CAN nor Chord take network distances into account when constructing their routing overlay; thus, a given overlay hop may span the diameter of the network....
[...]
...The second generation of P2P systems are structured peer-to-peer overlay networks, including Tapestry [1], [2], Chord [8], Pastry [7], and CAN [6]....
[...]
...Messages sent to a destination from two nearby nodes will generally cross paths quickly because: each hop increases the length of the prefix required for the next hop; the path to the root is a function of the destination ID only, not of the source nodeID (as in Chord); and neighbor hops are chosen for network locality, which is (usually) transitive....
[...]
7,423 citations
7,390 citations
"Tapestry: a resilient global-scale ..." refers methods in this paper
...We can imagine building a Bloom filter [32] over the set of pointers....
[...]
6,703 citations
[...]
4,816 citations
"Tapestry: a resilient global-scale ..." refers background in this paper
...The Sybil attack [34] is an attack where a user obtains a large number of identities to mount collusion attacks....
[...]
Related Papers (5)
Frequently Asked Questions (17)
Q2. What is the reason for the inflated RDP?
The use of multiple Tapestry instances per machine means that tests under heavy load will produce scheduling delays between instances, resulting in an inflated RDP for short latency paths.
Q3. How does Tapestry improve object availability and routing in a dynamic network?
Tapestry improves object availability and routing in such an environment by building redundancy into routing tables and object location references (e.g., the backup forwarding pointers for each routing table entry).
Q4. How does Tapestry handle the distribution of nodeIDs and GUIDs?
Tapestry assumes nodeIDs and GUIDs are roughly evenly distributed in the namespace, which can be achieved by using a secure hashing algorithm like SHA-1 [29].
Q5. How long does the second churn increase the dynamic rates of insertion and failure?
The second churn increases the dynamic rates of insertion and failure, using 10 seconds and 2 minutes as the parameters respectively.
Q6. What is the routing to objects test?
The routing to objects test sends messages to previously published objects, located at servers which were guaranteed to stay alive in the network.
Q7. How do the authors compute the RDP for node routing?
The authors compute the RDP for node routing by measuring all pairs roundtrip routing latencies between the 400 Tapestry instances, and dividing each by the correspond-ing ping roundtrip time6.
Q8. What are examples of what types of applications that leverage common resources across the network?
Examples include application level multicast, global-scale storage systems, and traffic redirection layers for resiliency or security.
Q9. What is the cost of copying data relative to the message size?
For messages larger than 2 KB, the cost of copying data (memory buffer to network layer) dominates, and processing time becomes linear relative to the message size.
Q10. What is the behavior of the neighbor link layer?
For instance, in response to changing link latencies, the neighbor link layer may reorder the preferences assigned to neighbors occupying the same entry in the routing table.
Q11. What is the name of the digits that are not matched?
When a digit cannot be matched, Tapestry looks for a “close” digit in the routing table; the authors call this surrogate routing [1], where each non-existent ID is mapped to some live node with a similar ID.
Q12. How does the test measure the effects of multiple nodes simultaneously entering the Tapestry?
2) Parallel Node Insertion: Next, the authors measure the effects of multiple nodes simultaneously entering the Tapestry by examining the convergence time for parallel insertions.
Q13. What is the paradigm for asynchronous I/O?
This paradigm requires an asynchronous I/O layer as well as an efficient model for internal communication and control between components.
Q14. How many JVMs does Tapestry require to be run?
Note that the additional number of JVMs increases scheduling delays, resulting in request timeoutsas the size of the network (and virtualization) increases.
Q15. How does the graph scale with the network size?
For small networks where each node knows most of the network (size ), nodes touched by insertion (and the corresponding bandwidth) will likely scale linearly with network size.
Q16. How does a server advertise or publish an object?
A server S, storing an object O (with GUID, O , and root, O 3), periodically advertises or publishes this object by routing a publish message toward O (see Figure 4).
Q17. What is the first time a higher layer wishes to communicate with another node?
The first time a higher layer wishes to communicate with another node, it must provide the destination’s physical address (e.g., IP address and port number).