Journal Article•DOI•

Tapestry: a resilient global-scale overlay for service deployment

Ben Y. Zhao¹, Ling Huang¹, Jeremy Stribling², Sean Rhea¹, Anthony D. Joseph¹, John Kubiatowicz¹ - Show less +2 more•Institutions (2)

University of California, Berkeley¹, Massachusetts Institute of Technology²

07 Jan 2004-IEEE Journal on Selected Areas in Communications (IEEE)-Vol. 22, Iss: 1, pp 41-53

TL;DR: Experimental results show that Tapestry exhibits stable behavior and performance as an overlay, despite the instability of the underlying network layers, illustrating its utility as a deployment infrastructure.

read less

Abstract: We present Tapestry, a peer-to-peer overlay routing infrastructure offering efficient, scalable, location-independent routing of messages directly to nearby copies of an object or service using only localized resources. Tapestry supports a generic decentralized object location and routing applications programming interface using a self-repairing, soft-state-based routing layer. The paper presents the Tapestry architecture, algorithms, and implementation. It explores the behavior of a Tapestry deployment on PlanetLab, a global testbed of approximately 100 machines. Experimental results show that Tapestry exhibits stable behavior and performance as an overlay, despite the instability of the underlying network layers. Several widely distributed applications have been implemented on Tapestry, illustrating its utility as a deployment infrastructure.

...read moreread less

Summary (4 min read)

Jump to: [Introduction] – [II. RELATED WORK] – [A. The DOLR Networking API] – [B. Routing and Object Location] – [C. Dynamic Node Algorithms] – [A. Component Architecture] – [B. Tapestry Upcall Interface] – [D. Toward a Higher-Performance Implementation] – [V. EVALUATION] – [A. Evaluation Methodology] – [B. Performance in a Stable Network] – [C. Convergence Under Network Dynamics] – [VI. DEPLOYING APPLICATIONS WITH TAPESTRY] and [VII. CONCLUSION]

Introduction

Overlay networks, peer-to-peer (P2P), service deployment, Tapestry.
Properly implemented, this virtualization enables message delivery to mobile or replicated endpoints in the presence of instability in the underlying infrastructure.
Its architecture is modular, consisting of an extensible upcall facility wrapped around a simple, high-performance router.
These results demonstrate Tapestry’s feasibility as a long running service on dynamic, failure-prone networks such as the wide-area Internet.

A. The DOLR Networking API

Tapestry provides a datagram-like communications interface, with additional mechanisms for manipulating the locations of objects.
Before describing the API, the authors start with a couple of definitions.
Tapestry nodes participate in the overlay and are assigned nodeIDs uniformly at random from a large identifier space.
More than one node may be hosted by one physical host.
This call is best effort, and receives no confirmation.

B. Routing and Object Location

Tapestry dynamically maps each identifier to a unique live node, called the identifier’s root or .
When routing toward , messages are forwarded across neighbor links to nodes whose nodeIDs are progressively closer (i.e., matching larger prefixes) to in the ID space.
When a digit cannot be matched, Tapestry looks for a “close” digit in the routing table; the authors call this surrogate routing [1], where each non-existent ID is mapped to some live node with a similar ID.
To help provide resilience, the authors exploit network path diversity in the form of redundant routing paths.
Each node also stores reverse references to other nodes that point at it.

C. Dynamic Node Algorithms

Tapestry includes a number of mechanisms to maintain routing table consistency and ensure object availability.
S sends out an Acknowledged Multicast message that reaches the set of all existing nodes sharing the same prefix by traversing a tree based on their nodeIDs.
As nodes receive the message, they add N to their routing tables and transfer references of locally rooted pointers as necessary, completing items (a) and (b).
Nodes contacted during the iterative algorithm use N to optimize their routing tables where applicable, completing item (d). has shown Tapestry’s viability as a resilient routing layer [31].

A. Component Architecture

Figure 6 illustrates the functional layering for a Tapestry node.
At the bottom are the transport and neighbor link layers, which together provide a cross-node messaging layer.
The neighbor link layer notifies higher layers whenever link properties change significantly.
This layer also optimizes message processing by parsing the message headers and only deserializing the message contents when required.
Finally, node authentication and message authentication codes (MACs) can be integrated into this layer for additional security.

B. Tapestry Upcall Interface

While the DOLR API (Section III-A) provides a powerful applications interface, other functionality, such as multicast, requires greater control over the details of routing and object lookup.
The authors follow their discussion of the Tapestry component architecture with a detailed look at the current implementation, choices made, and the rationale behind them.
The Core Router utilizes the routing and object reference tables to handle application driven messages, including object publish, object location, and routing of messages to destination nodes.
UDP alone, however, does not support flow control or congestion control, and can consume an unfair share of bandwidth causing wide-spread congestion if used across the widearea.
These node instances can exchange messages in less than 10 microseconds, making any overlay network processing overhead and scheduling delay much more expensive in comparison.

D. Toward a Higher-Performance Implementation

In Section V the authors show that their implementation can handle over 7,000 messages per second.
A commercial-quality implementation could do much better.
The simplest piece—computation of NEXTHOP as in Figure 3—is similar to functionality performed by hardware routers: fast table lookup.
As a result, it is the second aspect of DOLR routing— fast pointer lookup—that presents the greatest challenge to high-throughput routing.
Assuming that pointers (with all their information) are are 100 bytes, the in-memory footprint of a Bloom filter can be two orders of magnitude smaller than the total size of the pointers.

V. EVALUATION

The authors evaluate their implementation of Tapestry using several platforms.
The authors run micro-benchmarks on a local cluster, measure the large scale performance of a deployed Tapestry on the PlanetLab global testbed, and make use of a local network simulation layer to support controlled, repeatable experiments with up to 1,000 Tapestry instances.

A. Evaluation Methodology

All experiments used a Java Tapestry implementation (see Section IV-C) running in IBM’s JDK 1.3 with node virtualization (see Section V-C).
The authors micro-benchmarks are run on local cluster machines of dual Pentium III 1GHz servers (1.5 GByte RAM) and Pentium IV 2.4GHz servers (1 GByte RAM).
The authors run wide-area experiments on PlanetLab, a network testbed consisting of roughly 100 machines at institutions in North America, Europe, Asia, and Australia.
Finally, in instances where the authors need large-scale, repeatable and controlled experiments, they perform experiments using the Simple OceanStore Simulator (SOSS) [34].
SOSS is an event-driven network layer that simulates network time with queues driven by a single local clock.

B. Performance in a Stable Network

The authors first examine Tapestry performance under stable or static network conditions.
A raw estimate of the processors (as reported by the bogomips metric under Linux) shows the P-IV to be 2.3 times faster.
The gap between this and the estimate the authors get from calculating the inverse of the per message routing latency can be attributed to scheduling and queuing delays from the asychronous I/O layer.
The authors also measure routing to object RDP as a ratio of one-way Tapestry route to object latency, versus the one-way network latency ( ping time).
High variance indicates some client/server combinations will consistently see non-ideal performance and tends to limit the advantages that clients gain through careful object placement.

C. Convergence Under Network Dynamics

Here, the authors analyze Tapestry’s scalability and stability under dynamic conditions.
Figure 17 shows that the total bandwidth for a single node insertion scales logarithmically with the network size.
Figures 19 and 20 demonstrate the ability of Tapestry to recover after massive changes in the overlay network membership.
For churn tests, the authors measure the success rate of requests on a set of stable nodes while constantly churning a set of dynamic nodes, using insertion and failure rates driven by probability distributions.
Finally, the authors measure the success rate of routing to nodes under different network changes on the PlanetLab testbed.

VI. DEPLOYING APPLICATIONS WITH TAPESTRY

In previous sections, the authors explored the implementation and behavior of Tapestry.
These applications share new challenges in the wide-area: users will find it more difficult to locate nearby resources as the network grows in size, and dependence on more distributed components means a smaller mean time between failures (MTBF) for the system.
It also scales logarithmically with the network size in both per-node routing state and expected number of overlay hops in a path.
Applications can achieve additional resilience by replicating data across multiple servers, and relying on Tapestry to direct client requests to nearby replicas.
OceanStore [4] is a global-scale, highly available storage utility deployed on the PlanetLab testbed.

VII. CONCLUSION

The authors described Tapestry, an overlay routing network for rapid deployment of new distributed applications and services.
The authors presented the architecture of Tapestry nodes, highlighting mechanisms for lowoverhead routing and dynamic state repair, and showed how these mechanisms can be enhanced through an extensible API.
The median RDP or stretch starts around a factor of three for nearby nodes and rapidly approaches one, also known as Routing is efficient.
Further, the median RDP for object location is below a factor of two in the wide-area.
Overall, the authors believe that wide-scale Tapestry deployment could be practical, efficient, and useful to a variety of applications.

Did you find this useful? Give us your feedback

Figures (23)

Fig. 7. Message processing. Object location requests enter from neighbor link layer at the left. Some messages are forwarded to an extensibility layer; for others, the router first looks for object pointers, then forwards the message to the next hop.

Fig. 16. Node Insertion Latency. Time for single node insertion, from the initial request message to network stabilization.

Fig. 17. Node Insertion Bandwidth. Total control traffic bandwidth for single node insertion.

Fig. 18. Parallel Insertion Convergence. Time for the network to stabilize after nodes are inserted in parallel, as a function of the ratio of nodes in the parallel insertion to size of the stable network.

Fig. 8. Tapestry Implementation. Tapestry is implemented in Java as a series of independently-scheduled stages (shown here as bubbles) that interact by passing events to one another.

Fig. 14. RDP of Routing to Objects. The ratio of Tapestry routing to an object versus the shortest one-way IP distance between the client and the object’s location.

Fig. 13. RDP of Routing to Nodes. The ratio of Tapestry routing to a node versus the shortest roundtrip IP distance between the sender and receiver.

Fig. 15. 90 percentile RDP of Routing to Objects with Optimization. Each line represents a set of optimization parameters (k backups, l nearest neighbors, m hops), with cost (additional pointers per object) in brackets.

Fig. 1. Tapestry routing mesh from the perspective of a single node. Outgoing neighbor links point to nodes with a common matching prefix. Higher-level entries match more digits. Together, these links form the local routing table.

Fig. 3. Pseudocode for NEXTHOP(). This function locates the next hop towards the root given the previous hop number, , and the destination GUID, . Returns next hop or self if local node is the root.

Fig. 2. Path of a message. The path taken by a message originating from node 5230 destined for node 42AD in a Tapestry mesh.

Fig. 23. Failure, join and churn on PlanetLab. Impact of network dynamics on the success rate of route to node requests.

Fig. 21. Route to Node under churn. Routing to nodes under two churn periods, starting with 830 nodes. Churn 1 uses a poisson process with average inter-arrival time of 20 seconds and randomly kills nodes such that the average lifetime is 4 minutes. Churn 2 uses 10 seconds and 2 minutes.

Fig. 22. Route to Object under churn. The performance of Tapestry route to objects under two periods of churn, starting from 830 nodes. Churn 1 uses random parameters of one node every 20 seconds and average lifetime of 4 minutes. Churn 2 uses 10 seconds and 2 minutes.

Fig. 9. Enhanced Pointer Lookup. We quickly check for object pointers using a Bloom filter to eliminate definite non-matches, then use an in-memory cache to check for recently used pointers. Only when both of these fail do we (asynchronously) fall back to a slower repository.

Fig. 4. Tapestry object publish example. Two copies of an object (4378) are published to their root node at 4377. Publish messages route to root, depositing a location pointer for the object at each hop encountered along the way.

Fig. 5. Tapestry route to object example. Several nodes send messages to object 4378 from different points in the network. The messages route towards the root node of 4378. When they intersect the publish path, they follow the location pointer to the nearest copy of the object.

Fig. 19. Route to Node under failure and joins. The performance of Tapestry route to node with two massive network membership change events. Starting with 830 nodes, 20% of nodes (166) fail, followed 16 minutes later by a massive join of 50% (333 nodes).

Fig. 20. Route to Object under failure and joins. The performance of Tapestry route to objects with two massive network membership change events. Starting with 830 nodes, 20% of nodes (166) fail, followed 16 minutes later by a massive join of 50% (333 nodes).

Fig. 11. Message Processing Latency. Processing latency (full turnaround time) per message at a single Tapestry overlay hop, as a function of the message payload size.

Fig. 12. Max Routing Throughput. Maximum sustainable message traffic throughput as a function of message size.

Content maybe subject to copyright Report

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 22, NO. 1, JANUARY 2004 1

Tapestry: A Resilient Global-scale Overlay for

Service Deployment

Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, Member, IEEE, and

John D. Kubiatowicz, Member, IEEE

Abstract— We present Tapestry, a peer-to-peer overlay

routing infrastructure offering efﬁcient, scalable, location-

independent routing of messages directly to nearby copies

of an object or service using only localized resources.

Tapestry supports a generic Decentralized Object Location

and Routing (DOLR) API using a self-repairing, soft-

state based routing layer. This paper presents the Tapestry

architecture, algorithms, and implementation. It explores

the behavior of a Tapestry deployment on PlanetLab, a

global testbed of approximately 100 machines. Experimen-

tal results show that Tapestry exhibits stable behavior and

performance as an overlay, despite the instability of the

underlying network layers. Several widely-distributed ap-

plications have been implemented on Tapestry, illustrating

its utility as a deployment infrastructure.

Index Terms— Overlay networks, peer-to-peer (P2P),

service deployment, Tapestry.

I. INTRODUCTION

Internet developers are constantly proposing new and

visionary distributed applications. These new applica-

tions have a variety of requirements for availability,

durability, and performance. One technique for achieving

these properties is to adapt to failures or changes in load

through migration and replication of data and services.

Unfortunately, the ability to place replicas or the fre-

quency with which they may be moved is limited by

underlying infrastructure. The traditional way to deploy

new applications is to adapt them somehow to existing

infrastructures (often an imperfect match) or to standard-

ize new Internet protocols (encountering signiﬁcant iner-

tia to deployment). A ﬂexible but standardized substrate

on which to develop new applications is needed.

In this article, we present Tapestry [1], [2], an extensi-

ble infrastructure that provides Decentralized Object Lo-

cation and Routing (DOLR) [3]. The DOLR interface fo-

cuses on routing of messages to endpoints such as nodes

This paper was supported in part by the National Science Foundation

(NSF) under Career Award #ANI-9985129 and Career Award #ANI-

9985250, in part by the NSF Information Technology Research (ITR)

under Award 5710001344, in part by the California Micro Fund under

Award 02-032 and Award 02-035, and in part by Grants from IBM and

Sprint. B. Y. Zhao, L. Huang, S. C. Rhea, A. D. Joseph, and J. D. Kubi-

atowicz are with the University of California, Berkeley, CA 94720 USA

(e-mail:



ravenben, hling, srhea, adj, kubitron



@eecs.berkeley.edu). J.

Stribling is with Massachusetts Institute of Technology, Cambridge,

MA 02139 USA (e-mail: strib@mit.edu).

or object replicas. DOLR virtualizes resources, since

endpoints are named by opaque identiﬁers encoding

nothing about physical location. Properly implemented,

this virtualization enables message delivery to mobile

or replicated endpoints in the presence of instability

in the underlying infrastructure. As a result, a DOLR

network provides a simple platform on which to im-

plement distributed applications—developers can ignore

the dynamics of the network except as an optimization.

Already, Tapestry has enabled the deployment of global-

scale storage applications such as OceanStore [4] and

multicast distribution systems such as Bayeux [5]; we

return to this in Section VI.

Tapestry is a peer-to-peer overlay network that

provides high-performance, scalable, and location-

independent routing of messages to close-by endpoints,

using only localized resources. The focus on routing

brings with it a desire for efﬁciency: minimizing message

latency and maximizing message throughput. Thus, for

instance, Tapestry exploits locality in routing messages

to mobile endpoints such as object replicas; this behavior

is in contrast to other structured peer-to-peer overlay

networks [6]–[11].

Tapestry uses adaptive algorithms with soft-state to

maintain fault-tolerance in the face of changing node

membership and network faults. Its architecture is mod-

ular, consisting of an extensible upcall facility wrapped

around a simple, high-performance router. This Applica-

tions Programming Interface (API) enables developers to

develop and extend overlay functionality when the basic

DOLR functionality is insufﬁcient.

In the following pages, we describe a Java-based

implementation of Tapestry, and present both micro-

and macro-benchmarks from an actual, deployed sys-

tem. During normal operation, the relative delay penalty

(RDP)

to locate mobile endpoints is two or less in the

wide area. Simulations show that Tapestry operations

succeed nearly 100% of the time under both constant

network changes and massive failures or joins, with

small periods of degraded performance during self-

RDP, or stretch, is the ratio between the distance traveled by a

message to an endpoint and the minimal distance from the source to

that endpoint.

2 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 22, NO. 1, JANUARY 2004

repair. These results demonstrate Tapestry’s feasibility

as a long running service on dynamic, failure-prone

networks such as the wide-area Internet.

The following section discusses related work. Then,

Tapestry’s core algorithms appear in Section III, with

details of the architecture and implementation in Sec-

tion IV. Section V evaluates Tapestry’s performance. We

then discuss the use of Tapestry as an application infras-

tructure in Section VI and conclude with Section VII.

II. R

ELATED WORK

The ﬁrst generation of peer-to-peer (P2P) systems

included ﬁle-sharing and storage applications: Napster,

Gnutella, MojoNation, and Freenet. Napster uses central

directory servers to locate ﬁles. Gnutella provides a

similar, but distributed service using scoped broadcast

queries, limiting scalability. MojoNation [12] uses an on-

line economic model to encourage sharing of resources.

Freenet [13] is a ﬁle-sharing network designed to resist

censorship. Neither Gnutella nor Freenet guarantee that

ﬁles can be located—even in a functioning network.

The second generation of peer-to-peer systems are

structured peer-to-peer overlay networks, including

Tapestry [1], [2], Chord [8], Pastry [7], and CAN [6].

These overlays implement a basic Key-Based Routing

(KBR) interface, that supports deterministic routing of

messages to a live node that has responsibility for the

destination key. They can also support higher level

interfaces such as a distributed hash table (DHT) or a de-

centralized object location and routing (DOLR) layer [3].

These systems scale well, and guarantee that queries ﬁnd

existing objects under non-failure conditions.

One differentiating property between these systems

is that neither CAN nor Chord take network distances

into account when constructing their routing overlay;

thus, a given overlay hop may span the diameter of the

network. Both protocols route on the shortest overlay

hops available, and use runtime heuristics to assist. In

contrast, Tapestry and Pastry construct locally optimal

routing tables from initialization, and maintain them in

order to reduce routing stretch.

While some systems ﬁx the number and location

of object replicas by providing a distributed hash ta-

ble (DHT) interface, Tapestry allows applications to

place objects according to their needs. Tapestry “pub-

lishes” location pointers throughout the network to fa-

cilitate efﬁcient routing to those objects with low net-

work stretch. This technique makes Tapestry locality-

aware [14]: queries for nearby objects are generally

satisﬁed in time proportional to the distance between the

query source and a nearby object replica.

Both Pastry and Tapestry share similarities to the

work of Plaxton, Rajamaran, and Richa [15] for a static

network. Others [16], [17] explore distributed object

location schemes with provably low search overhead,

but they require precomputation, and so are not suitable

for dynamic networks. Recent works include systems

such as Kademlia [9], which uses XOR for overlay

routing, and Viceroy [10], which provides logarithmic

hops through nodes with constant degree routing tables.

SkipNet [11] uses a multi-dimensional skip-list data

structure to support overlay routing, maintaining both

a DNS-based namespace for operational locality and

a randomized namespace for network locality. Other

overlay proposals [18], [19] attain lower bounds on local

routing state. Finally, proposals such as Brocade [20]

differentiate between local and inter-domain routing to

reduce wide-area trafﬁc and routing latency.

A new generation of applications have been pro-

posed on top of these P2P systems, validating them

as novel application infrastructures. Several systems

have application level multicast: CAN-MC [21] (CAN),

Scribe [22] (Pastry), and Bayeux [5] (Tapestry). In

addition, several decentralized ﬁle systems have been

proposed: CFS [23] (Chord), Mnemosyne [24] (Chord,

Tapestry), OceanStore [4] (Tapestry), and PAST [25]

(Pastry). Structured P2P overlays also support novel

applications (e.g., attack resistant networks [26], network

indirection layers [27], and similarity searching [28]).

III. T

APESTRY ALGORITHMS

This section details Tapestry’s algorithms for routing

and object location, and describes how network integrity

is maintained under dynamic network conditions.

A. The DOLR Networking API

Tapestry provides a datagram-like communications

interface, with additional mechanisms for manipulating

the locations of objects. Before describing the API, we

start with a couple of deﬁnitions.

Tapestry nodes participate in the overlay and are

assigned nodeIDs uniformly at random from a large

identiﬁer space. More than one node may be hosted

by one physical host. Application-speciﬁc endpoints

are assigned Globally Unique IDentiﬁers (GUIDs), se-

lected from the same identiﬁer space. Tapestry currently

uses an identiﬁer space of 160-bit values with a glob-

ally deﬁned radix (e.g., hexadecimal, yielding 40-digit

identiﬁers). Tapestry assumes nodeIDs and GUIDs are

roughly evenly distributed in the namespace, which can

be achieved by using a secure hashing algorithm like

SHA-1 [29]. We say that node N has nodeID N



, and

an object O has GUID O



Since the efﬁciency of Tapestry generally improves

with network size, it is advantageous for multiple appli-

cations to share a single large Tapestry overlay network.

ZHAO et al.: TAPESTRY: A RESILIENT GLOBAL-SCALE OVERLAY FOR SERVICE DEPLOYMENT 3

42A2

1D76

27AB

4228

4227

44AF

6F43

43C9

51E5

Fig. 1. Tapestry routing mesh from the perspective of a

single node. Outgoing neighbor links point to nodes with

a common matching preﬁx. Higher-level entries match more

digits. Together, these links form the local routing table.

5230

AC78

4227

42A2

42AD

4629

400F

42A7

4112

42A9

4211

42E0

Fig. 2. Path of a message. The path taken by a message

originating from node 5230 destined for node 42AD in a

Tapestry mesh.

To enable application coexistence, every message con-

tains an application-speciﬁc identiﬁer, A



, which is used

to select a process, or application for message delivery at

the destination (similar to the role of a port in TCP/IP),

or an upcall handler where appropriate.

Given the above deﬁnitions, we state the four-part

DOLR networking API as follows:

1) P

UBLISHOBJECT(O



, A



): Publish, or make

available, object



on the local node. This call

is best effort, and receives no conﬁrmation.

2) U

NPUBLISHOBJECT(O



, A



): Best-effort attempt

to remove location mappings for



3) R

OUTETOOBJECT(O



, A



): Routes message to

location of an object with GUID O



4) R

OUTETONODE(N, A



, Exact): Route message

to application A



on node N. “Exact” speciﬁes

whether destination ID needs to be matched ex-

actly to deliver payload.

B. Routing and Object Location

Tapestry dynamically maps each identiﬁer



to a

unique live node, called the identiﬁer’s root or





.If

a node



exists with N





, then this node is the root



. To deliver messages, each node maintains a routing

table consisting of nodeIDs and IP addresses of the nodes

with which it communicates. We refer to these nodes as

NEXTHOP (





1 if





MAXHOP







then

2 return self

3 else







;







5 while



= nil do



  











8 endwhile

9 if



 

then

10 return

NEXTHOP (







)

11 else

12 return



13 endif

14 endif

Fig. 3. Pseudocode for NEXTHOP(). This function locates the

next hop towards the root given the previous hop number,



and the destination GUID,



. Returns next hop or self if local

node is the root.

neighbors of the local node. When routing toward





messages are forwarded across neighbor links to nodes

whose nodeIDs are progressively closer (i.e., matching

larger preﬁxes) to



in the ID space.

1) Routing Mesh: Tapestry uses local tables at each

node, called neighbor maps, to route overlay messages to

the destination ID digit by digit (e.g., 4***





42**





42A*





42AD, where *’s represent wildcards).

This approach is similar to longest preﬁx routing used

by CIDR IP address allocation [30]. A node N has a

neighbor map with multiple levels, where each level

contains links to nodes matching a preﬁx up to a digit

position in the ID, and contains a number of entries equal

to the ID’s base. The primary





entry in the





level

is the ID and location of the closest node that begins

with preﬁx(









)+“



”(e.g., the





entry of the





level for node 325AE is the closest node with an ID

that begins with 3259. It is this prescription of “closest

node” that provides the locality properties of Tapestry.

Figure 1 shows some of the outgoing links of a node.

Figure 2 shows a path that a message might take

through the infrastructure. The router for the





hop

shares a preﬁx of length





with the destination ID;

thus, to route, Tapestry looks in its (







level map

for the entry matching the next digit in the destination

ID. This method guarantees that any existing node in the

system will be reached in at most







logical hops,

in a system with namespace size



, IDs of base



, and

assuming consistent neighbor maps. When a digit cannot

be matched, Tapestry looks for a “close” digit in the

routing table; we call this surrogate routing [1], where

each non-existent ID is mapped to some live node with

a similar ID. Figure 3 details the

NEXTHOP function for

chosing an outgoing link. It is this dynamic process that

maps every identiﬁer



to a unique root node





4 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 22, NO. 1, JANUARY 2004

4228

4A6D

4361

43FE

Location Mappin

Publish Path

4377

437A

(4378)

Phil’s

Books

AA93

(4378)

Phil’s

Books

Tapestry Pointer

4664

4B4F

57EC

E791

4228

4A6D

4361

43FE

Location Mappin

4377

437A

(4378)

Phil’s

Books

AA93

(4378)

Phil’s

Books

Tapestry Pointer

4664

4B4F

57EC

E791

Query Path

Fig. 4. Tapestry object publish example. Two copies of an

object (4378) are published to their root node at 4377.

Publish messages route to root, depositing a location pointer

for the object at each hop encountered along the way.

Fig. 5. Tapestry route to object example. Several nodes send

messages to object 4378 from different points in the network.

The messages route towards the root node of 4378. When they

intersect the publish path, they follow the location pointer to

the nearest copy of the object.

The challenge in a dynamic network environment is to

continue to route reliably even when intermediate links

are changing or faulty. To help provide resilience, we

exploit network path diversity in the form of redundant

routing paths. Primary neighbor links shown in Figure 1

are augmented by backup links, each sharing the same

preﬁx

. At the





routing level, the



neighbor links

differ only on the





digit. There are







pointers

on a level, and the total size of the neighbor map is















. Each node also stores reverse refer-

ences (backpointers) to other nodes that point at it. The

expected total number of such entries is















2) Object Publication and Location: As shown

above, each identiﬁer



has a unique root node





assigned by the routing process. Each such root node in-

herits a unique spanning tree for routing, with messages

from leaf nodes traversing intermediate nodes en route

to the root. We utilize this property to locate objects by

distributing soft-state directory information across nodes

(including the object’s root).

A server S, storing an object O (with GUID, O



and root, O



), periodically advertises or publishes this

object by routing a publish message toward O



(see

Figure 4). In general, the nodeID of O



is different from



; O



is the unique [2] node reached through surrogate

routing by successive calls to

NEXTHOP(*, O



). Each

node along the publication path stores a pointer mapping,





, S



, instead of a copy of the object itself. When

there are replicas of an object on separate servers, each

server publishes its copy. Tapestry nodes store location

mappings for object replicas in sorted order of network

latency from themselves.

A client locates O by routing a message to O



(see

Current implementations keep two additional backups.

Note that objects can be assigned multiple GUIDs mapped to

different root nodes for fault-tolerance.

Figure 5). Each node on the path checks whether it has

a location mapping for O. If so, it redirects the message

to S. Otherwise, it forwards the message onwards to O



(guaranteed to have a location mapping).

Each hop towards the root reduces the number of

nodes satisfying the next hop preﬁx constraint by a factor

of the identiﬁer base. Messages sent to a destination

from two nearby nodes will generally cross paths quickly

because: each hop increases the length of the preﬁx

required for the next hop; the path to the root is a

function of the destination ID only, not of the source

nodeID (as in Chord); and neighbor hops are chosen

for network locality, which is (usually) transitive. Thus,

the closer (in network distance) a client is to an object,

the sooner its queries will likely cross paths with the

object’s publish path, and the faster they will reach the

object. Since nodes sort object pointers by distance to

themselves, queries are routed to nearby object replicas.

C. Dynamic Node Algorithms

Tapestry includes a number of mechanisms to main-

tain routing table consistency and ensure object availabil-

ity. In this section, we brieﬂy explore these mechanisms.

See [2] for complete algorithms and proofs. The majority

of control messages described here require acknowledg-

ments, and are retransmitted where required.

1) Node Insertion: There are four components to

inserting a new node N into a Tapestry network:

a) Need-to-know nodes are notiﬁed of N, because N

ﬁlls a null entry in their routing tables.

b) N might become the new object root for existing

objects. References to those objects must be moved

to N to maintain object availability.

c) The algorithms must construct a near optimal

routing table for N.

ZHAO et al.: TAPESTRY: A RESILIENT GLOBAL-SCALE OVERLAY FOR SERVICE DEPLOYMENT 5

d) Nodes near N are notiﬁed and may consider using

N in their routing tables as an optimization.

Node insertion begins at N’s surrogate S (the “root” node

that N



maps to in the existing network). S ﬁnds



the length of the longest preﬁx its ID shares with N



S sends out an Acknowledged Multicast message that

reaches the set of all existing nodes sharing the same

preﬁx by traversing a tree based on their nodeIDs. As

nodes receive the message, they add N to their routing

tables and transfer references of locally rooted pointers

as necessary, completing items (a) and (b).

Nodes reached by the multicast contact N and become

an initial neighbor set used in its routing table con-

struction. N performs an iterative nearest neighbor search

beginning with routing level



. N uses the neighbor set to

ﬁll routing level



, trims the list to the closest



nodes

and requests these



nodes send their backpointers (see

Section III-B) at that level. The resulting set contains all

nodes that point to any of the



nodes at the previous

routing level, and becomes the next neighbor set. N then

decrements



, and repeats the process until all levels are

ﬁlled. This completes item (c). Nodes contacted during

the iterative algorithm use N to optimize their routing

tables where applicable, completing item (d).

To ensure that nodes inserting into the network in

unison do not fail to notify each other about their

existence, every node



in the multicast keeps state on

every node



that is still multicasting down one of its

neighbors. This state is used to tell each node



with



in its multicast tree about



. Additionally, the multicast

message includes a list of holes in the new node’s routing

table. Nodes check their tables against the routing table

and notify the new node of entries to ﬁll those holes.

2) Voluntary Node Deletion: If node N leaves

Tapestry voluntarily, it tells the set



of nodes in N’s

backpointers of its intention, along with a replacement

node for each routing level from its own routing table.

The notiﬁed nodes each send object republish trafﬁc

to both N and its replacement. Meanwhile, N routes

references to locally rooted objects to their new roots,

and signals nodes in



when ﬁnished.

3) Involuntary Node Deletion: In a dynamic, failure-

prone network such as the wide-area Internet, nodes

generally exit the network far less gracefully due to node

and link failures or network partitions, and may enter and

leave many times in a short interval. Tapestry improves

object availability and routing in such an environment

by building redundancy into routing tables and object

location references (e.g., the







backup forwarding

pointers for each routing table entry). Ongoing work



is a knob for tuning the tradeoff between resources used and

optimality of the resulting routing table.

Management

Dynamic Node

Object Pointer Database

and

Routing Table

Router

Decentralized

File System

Multicast

Application−Level

Collaborative

Text Filtering

Application Interface / Upcall API

Neighbor Link Management

Transport Protocols

Single Tapestry Node

Fig. 6. Tapestry component architecture. Messages pass up

from physical network layers and down from application

layers. The Router is a central conduit for communication.

has shown Tapestry’s viability as a resilient routing

layer [31].

To maintain availability and redundancy, nodes use

periodic beacons to detect outgoing link and node fail-

ures. Such events trigger repair of the routing mesh and

initiate redistribution and replication of object location

references. Furthermore, the repair process is augmented

by soft-state republishing of object references. Tapestry

repair is highly effective, as shown in Section V-C. De-

spite continuous node turnover, Tapestry retains nearly

a 100% success rate at routing messages to nodes and

objects.

IV. T

APESTRY NODE ARCHITECTURE AND

IMPLEMENTATION

In this section, we present the architecture of a

Tapestry node, an API for Tapestry extension, details

of our current implementation, and an architecture for a

higher-performance implementation suitable for use on

network processors.

A. Component Architecture

Figure 6 illustrates the functional layering for a

Tapestry node. Shown on top are applications that in-

terface with the rest of the system through the Tapestry

API. Below this are the router and the dynamic node

management components. The former processes routing

and location messages, while the latter handles the

arrival and departure of nodes in the network. These two

components communicate through the routing table. At

the bottom are the transport and neighbor link layers,

which together provide a cross-node messaging layer.

We now describe several of these layers.

1) Transport: The transport layer provides the ab-

straction of communication channels from one overlay

node to another, and corresponds to layer 4 in the OSI

layering. Utilizing native Operating System (OS) func-

tionality, many channel implementations are possible.

HTML Viewer

Frequently Asked Questions (17)

Q1. What have the authors contributed in "Tapestry: a resilient global-scale overlay for service deployment" ?

The authors present Tapestry, a peer-to-peer overlay routing infrastructure offering efficient, scalable, locationindependent routing of messages directly to nearby copies of an object or service using only localized resources. This paper presents the Tapestry architecture, algorithms, and implementation.

Q2. What is the reason for the inflated RDP?

The use of multiple Tapestry instances per machine means that tests under heavy load will produce scheduling delays between instances, resulting in an inflated RDP for short latency paths.

Q3. How does Tapestry improve object availability and routing in a dynamic network?

Tapestry improves object availability and routing in such an environment by building redundancy into routing tables and object location references (e.g., the backup forwarding pointers for each routing table entry).

Q4. How does Tapestry handle the distribution of nodeIDs and GUIDs?

Tapestry assumes nodeIDs and GUIDs are roughly evenly distributed in the namespace, which can be achieved by using a secure hashing algorithm like SHA-1 [29].

Q5. How long does the second churn increase the dynamic rates of insertion and failure?

The second churn increases the dynamic rates of insertion and failure, using 10 seconds and 2 minutes as the parameters respectively.

Q6. What is the routing to objects test?

The routing to objects test sends messages to previously published objects, located at servers which were guaranteed to stay alive in the network.

Q7. How do the authors compute the RDP for node routing?

The authors compute the RDP for node routing by measuring all pairs roundtrip routing latencies between the 400 Tapestry instances, and dividing each by the correspond-ing ping roundtrip time6.

Q8. What are examples of what types of applications that leverage common resources across the network?

Examples include application level multicast, global-scale storage systems, and traffic redirection layers for resiliency or security.

Q9. What is the cost of copying data relative to the message size?

For messages larger than 2 KB, the cost of copying data (memory buffer to network layer) dominates, and processing time becomes linear relative to the message size.

Q10. What is the behavior of the neighbor link layer?

For instance, in response to changing link latencies, the neighbor link layer may reorder the preferences assigned to neighbors occupying the same entry in the routing table.

Q11. What is the name of the digits that are not matched?

When a digit cannot be matched, Tapestry looks for a “close” digit in the routing table; the authors call this surrogate routing [1], where each non-existent ID is mapped to some live node with a similar ID.

Q12. How does the test measure the effects of multiple nodes simultaneously entering the Tapestry?

2) Parallel Node Insertion: Next, the authors measure the effects of multiple nodes simultaneously entering the Tapestry by examining the convergence time for parallel insertions.

Q13. What is the paradigm for asynchronous I/O?

This paradigm requires an asynchronous I/O layer as well as an efficient model for internal communication and control between components.

Q14. How many JVMs does Tapestry require to be run?

Note that the additional number of JVMs increases scheduling delays, resulting in request timeoutsas the size of the network (and virtualization) increases.

Q15. How does the graph scale with the network size?

For small networks where each node knows most of the network (size ), nodes touched by insertion (and the corresponding bandwidth) will likely scale linearly with network size.

Q16. How does a server advertise or publish an object?

A server S, storing an object O (with GUID, O , and root, O 3), periodically advertises or publishes this object by routing a publish message toward O (see Figure 4).

Q17. What is the first time a higher layer wishes to communicate with another node?

The first time a higher layer wishes to communicate with another node, it must provide the destination’s physical address (e.g., IP address and port number).

Tapestry: a resilient global-scale overlay for service deployment

Summary (4 min read)

Introduction

II. RELATED WORK

A. The DOLR Networking API

B. Routing and Object Location

C. Dynamic Node Algorithms

A. Component Architecture

B. Tapestry Upcall Interface

D. Toward a Higher-Performance Implementation

V. EVALUATION

A. Evaluation Methodology

B. Performance in a Stable Network

C. Convergence Under Network Dynamics

VI. DEPLOYING APPLICATIONS WITH TAPESTRY

VII. CONCLUSION

Figures (23)

Citations

Cites background or methods from "Tapestry: a resilient global-scale ..."

References

"Tapestry: a resilient global-scale ..." refers background in this paper

"Tapestry: a resilient global-scale ..." refers methods in this paper

"Tapestry: a resilient global-scale ..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (17)

Q1. What have the authors contributed in "Tapestry: a resilient global-scale overlay for service deployment" ?

Q2. What is the reason for the inflated RDP?

Q3. How does Tapestry improve object availability and routing in a dynamic network?

Q4. How does Tapestry handle the distribution of nodeIDs and GUIDs?

Q5. How long does the second churn increase the dynamic rates of insertion and failure?

Q6. What is the routing to objects test?

Q7. How do the authors compute the RDP for node routing?

Q8. What are examples of what types of applications that leverage common resources across the network?

Q9. What is the cost of copying data relative to the message size?

Q10. What is the behavior of the neighbor link layer?

Q11. What is the name of the digits that are not matched?

Q12. How does the test measure the effects of multiple nodes simultaneously entering the Tapestry?

Q13. What is the paradigm for asynchronous I/O?

Q14. How many JVMs does Tapestry require to be run?

Q15. How does the graph scale with the network size?

Q16. How does a server advertise or publish an object?

Q17. What is the first time a higher layer wishes to communicate with another node?