scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Tapestry: a resilient global-scale overlay for service deployment

TL;DR: Experimental results show that Tapestry exhibits stable behavior and performance as an overlay, despite the instability of the underlying network layers, illustrating its utility as a deployment infrastructure.
Abstract: We present Tapestry, a peer-to-peer overlay routing infrastructure offering efficient, scalable, location-independent routing of messages directly to nearby copies of an object or service using only localized resources. Tapestry supports a generic decentralized object location and routing applications programming interface using a self-repairing, soft-state-based routing layer. The paper presents the Tapestry architecture, algorithms, and implementation. It explores the behavior of a Tapestry deployment on PlanetLab, a global testbed of approximately 100 machines. Experimental results show that Tapestry exhibits stable behavior and performance as an overlay, despite the instability of the underlying network layers. Several widely distributed applications have been implemented on Tapestry, illustrating its utility as a deployment infrastructure.

Summary (4 min read)

Introduction

  • Overlay networks, peer-to-peer (P2P), service deployment, Tapestry.
  • Properly implemented, this virtualization enables message delivery to mobile or replicated endpoints in the presence of instability in the underlying infrastructure.
  • Its architecture is modular, consisting of an extensible upcall facility wrapped around a simple, high-performance router.
  • These results demonstrate Tapestry’s feasibility as a long running service on dynamic, failure-prone networks such as the wide-area Internet.

A. The DOLR Networking API

  • Tapestry provides a datagram-like communications interface, with additional mechanisms for manipulating the locations of objects.
  • Before describing the API, the authors start with a couple of definitions.
  • Tapestry nodes participate in the overlay and are assigned nodeIDs uniformly at random from a large identifier space.
  • More than one node may be hosted by one physical host.
  • This call is best effort, and receives no confirmation.

B. Routing and Object Location

  • Tapestry dynamically maps each identifier to a unique live node, called the identifier’s root or .
  • When routing toward , messages are forwarded across neighbor links to nodes whose nodeIDs are progressively closer (i.e., matching larger prefixes) to in the ID space.
  • When a digit cannot be matched, Tapestry looks for a “close” digit in the routing table; the authors call this surrogate routing [1], where each non-existent ID is mapped to some live node with a similar ID.
  • To help provide resilience, the authors exploit network path diversity in the form of redundant routing paths.
  • Each node also stores reverse references to other nodes that point at it.

C. Dynamic Node Algorithms

  • Tapestry includes a number of mechanisms to maintain routing table consistency and ensure object availability.
  • S sends out an Acknowledged Multicast message that reaches the set of all existing nodes sharing the same prefix by traversing a tree based on their nodeIDs.
  • As nodes receive the message, they add N to their routing tables and transfer references of locally rooted pointers as necessary, completing items (a) and (b).
  • Nodes contacted during the iterative algorithm use N to optimize their routing tables where applicable, completing item (d). has shown Tapestry’s viability as a resilient routing layer [31].

A. Component Architecture

  • Figure 6 illustrates the functional layering for a Tapestry node.
  • At the bottom are the transport and neighbor link layers, which together provide a cross-node messaging layer.
  • The neighbor link layer notifies higher layers whenever link properties change significantly.
  • This layer also optimizes message processing by parsing the message headers and only deserializing the message contents when required.
  • Finally, node authentication and message authentication codes (MACs) can be integrated into this layer for additional security.

B. Tapestry Upcall Interface

  • While the DOLR API (Section III-A) provides a powerful applications interface, other functionality, such as multicast, requires greater control over the details of routing and object lookup.
  • The authors follow their discussion of the Tapestry component architecture with a detailed look at the current implementation, choices made, and the rationale behind them.
  • The Core Router utilizes the routing and object reference tables to handle application driven messages, including object publish, object location, and routing of messages to destination nodes.
  • UDP alone, however, does not support flow control or congestion control, and can consume an unfair share of bandwidth causing wide-spread congestion if used across the widearea.
  • These node instances can exchange messages in less than 10 microseconds, making any overlay network processing overhead and scheduling delay much more expensive in comparison.

D. Toward a Higher-Performance Implementation

  • In Section V the authors show that their implementation can handle over 7,000 messages per second.
  • A commercial-quality implementation could do much better.
  • The simplest piece—computation of NEXTHOP as in Figure 3—is similar to functionality performed by hardware routers: fast table lookup.
  • As a result, it is the second aspect of DOLR routing— fast pointer lookup—that presents the greatest challenge to high-throughput routing.
  • Assuming that pointers (with all their information) are are 100 bytes, the in-memory footprint of a Bloom filter can be two orders of magnitude smaller than the total size of the pointers.

V. EVALUATION

  • The authors evaluate their implementation of Tapestry using several platforms.
  • The authors run micro-benchmarks on a local cluster, measure the large scale performance of a deployed Tapestry on the PlanetLab global testbed, and make use of a local network simulation layer to support controlled, repeatable experiments with up to 1,000 Tapestry instances.

A. Evaluation Methodology

  • All experiments used a Java Tapestry implementation (see Section IV-C) running in IBM’s JDK 1.3 with node virtualization (see Section V-C).
  • The authors micro-benchmarks are run on local cluster machines of dual Pentium III 1GHz servers (1.5 GByte RAM) and Pentium IV 2.4GHz servers (1 GByte RAM).
  • The authors run wide-area experiments on PlanetLab, a network testbed consisting of roughly 100 machines at institutions in North America, Europe, Asia, and Australia.
  • Finally, in instances where the authors need large-scale, repeatable and controlled experiments, they perform experiments using the Simple OceanStore Simulator (SOSS) [34].
  • SOSS is an event-driven network layer that simulates network time with queues driven by a single local clock.

B. Performance in a Stable Network

  • The authors first examine Tapestry performance under stable or static network conditions.
  • A raw estimate of the processors (as reported by the bogomips metric under Linux) shows the P-IV to be 2.3 times faster.
  • The gap between this and the estimate the authors get from calculating the inverse of the per message routing latency can be attributed to scheduling and queuing delays from the asychronous I/O layer.
  • The authors also measure routing to object RDP as a ratio of one-way Tapestry route to object latency, versus the one-way network latency ( ping time).
  • High variance indicates some client/server combinations will consistently see non-ideal performance and tends to limit the advantages that clients gain through careful object placement.

C. Convergence Under Network Dynamics

  • Here, the authors analyze Tapestry’s scalability and stability under dynamic conditions.
  • Figure 17 shows that the total bandwidth for a single node insertion scales logarithmically with the network size.
  • Figures 19 and 20 demonstrate the ability of Tapestry to recover after massive changes in the overlay network membership.
  • For churn tests, the authors measure the success rate of requests on a set of stable nodes while constantly churning a set of dynamic nodes, using insertion and failure rates driven by probability distributions.
  • Finally, the authors measure the success rate of routing to nodes under different network changes on the PlanetLab testbed.

VI. DEPLOYING APPLICATIONS WITH TAPESTRY

  • In previous sections, the authors explored the implementation and behavior of Tapestry.
  • These applications share new challenges in the wide-area: users will find it more difficult to locate nearby resources as the network grows in size, and dependence on more distributed components means a smaller mean time between failures (MTBF) for the system.
  • It also scales logarithmically with the network size in both per-node routing state and expected number of overlay hops in a path.
  • Applications can achieve additional resilience by replicating data across multiple servers, and relying on Tapestry to direct client requests to nearby replicas.
  • OceanStore [4] is a global-scale, highly available storage utility deployed on the PlanetLab testbed.

VII. CONCLUSION

  • The authors described Tapestry, an overlay routing network for rapid deployment of new distributed applications and services.
  • The authors presented the architecture of Tapestry nodes, highlighting mechanisms for lowoverhead routing and dynamic state repair, and showed how these mechanisms can be enhanced through an extensible API.
  • The median RDP or stretch starts around a factor of three for nearby nodes and rapidly approaches one, also known as Routing is efficient.
  • Further, the median RDP for object location is below a factor of two in the wide-area.
  • Overall, the authors believe that wide-scale Tapestry deployment could be practical, efficient, and useful to a variety of applications.

Did you find this useful? Give us your feedback

Figures (23)

Content maybe subject to copyright    Report

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 22, NO. 1, JANUARY 2004 1
Tapestry: A Resilient Global-scale Overlay for
Service Deployment
Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, Member, IEEE, and
John D. Kubiatowicz, Member, IEEE
Abstract We present Tapestry, a peer-to-peer overlay
routing infrastructure offering efficient, scalable, location-
independent routing of messages directly to nearby copies
of an object or service using only localized resources.
Tapestry supports a generic Decentralized Object Location
and Routing (DOLR) API using a self-repairing, soft-
state based routing layer. This paper presents the Tapestry
architecture, algorithms, and implementation. It explores
the behavior of a Tapestry deployment on PlanetLab, a
global testbed of approximately 100 machines. Experimen-
tal results show that Tapestry exhibits stable behavior and
performance as an overlay, despite the instability of the
underlying network layers. Several widely-distributed ap-
plications have been implemented on Tapestry, illustrating
its utility as a deployment infrastructure.
Index Terms Overlay networks, peer-to-peer (P2P),
service deployment, Tapestry.
I. INTRODUCTION
Internet developers are constantly proposing new and
visionary distributed applications. These new applica-
tions have a variety of requirements for availability,
durability, and performance. One technique for achieving
these properties is to adapt to failures or changes in load
through migration and replication of data and services.
Unfortunately, the ability to place replicas or the fre-
quency with which they may be moved is limited by
underlying infrastructure. The traditional way to deploy
new applications is to adapt them somehow to existing
infrastructures (often an imperfect match) or to standard-
ize new Internet protocols (encountering significant iner-
tia to deployment). A flexible but standardized substrate
on which to develop new applications is needed.
In this article, we present Tapestry [1], [2], an extensi-
ble infrastructure that provides Decentralized Object Lo-
cation and Routing (DOLR) [3]. The DOLR interface fo-
cuses on routing of messages to endpoints such as nodes
This paper was supported in part by the National Science Foundation
(NSF) under Career Award #ANI-9985129 and Career Award #ANI-
9985250, in part by the NSF Information Technology Research (ITR)
under Award 5710001344, in part by the California Micro Fund under
Award 02-032 and Award 02-035, and in part by Grants from IBM and
Sprint. B. Y. Zhao, L. Huang, S. C. Rhea, A. D. Joseph, and J. D. Kubi-
atowicz are with the University of California, Berkeley, CA 94720 USA
(e-mail:
ravenben, hling, srhea, adj, kubitron
@eecs.berkeley.edu). J.
Stribling is with Massachusetts Institute of Technology, Cambridge,
MA 02139 USA (e-mail: strib@mit.edu).
or object replicas. DOLR virtualizes resources, since
endpoints are named by opaque identifiers encoding
nothing about physical location. Properly implemented,
this virtualization enables message delivery to mobile
or replicated endpoints in the presence of instability
in the underlying infrastructure. As a result, a DOLR
network provides a simple platform on which to im-
plement distributed applications—developers can ignore
the dynamics of the network except as an optimization.
Already, Tapestry has enabled the deployment of global-
scale storage applications such as OceanStore [4] and
multicast distribution systems such as Bayeux [5]; we
return to this in Section VI.
Tapestry is a peer-to-peer overlay network that
provides high-performance, scalable, and location-
independent routing of messages to close-by endpoints,
using only localized resources. The focus on routing
brings with it a desire for efficiency: minimizing message
latency and maximizing message throughput. Thus, for
instance, Tapestry exploits locality in routing messages
to mobile endpoints such as object replicas; this behavior
is in contrast to other structured peer-to-peer overlay
networks [6]–[11].
Tapestry uses adaptive algorithms with soft-state to
maintain fault-tolerance in the face of changing node
membership and network faults. Its architecture is mod-
ular, consisting of an extensible upcall facility wrapped
around a simple, high-performance router. This Applica-
tions Programming Interface (API) enables developers to
develop and extend overlay functionality when the basic
DOLR functionality is insufficient.
In the following pages, we describe a Java-based
implementation of Tapestry, and present both micro-
and macro-benchmarks from an actual, deployed sys-
tem. During normal operation, the relative delay penalty
(RDP)
1
to locate mobile endpoints is two or less in the
wide area. Simulations show that Tapestry operations
succeed nearly 100% of the time under both constant
network changes and massive failures or joins, with
small periods of degraded performance during self-
1
RDP, or stretch, is the ratio between the distance traveled by a
message to an endpoint and the minimal distance from the source to
that endpoint.

2 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 22, NO. 1, JANUARY 2004
repair. These results demonstrate Tapestry’s feasibility
as a long running service on dynamic, failure-prone
networks such as the wide-area Internet.
The following section discusses related work. Then,
Tapestry’s core algorithms appear in Section III, with
details of the architecture and implementation in Sec-
tion IV. Section V evaluates Tapestry’s performance. We
then discuss the use of Tapestry as an application infras-
tructure in Section VI and conclude with Section VII.
II. R
ELATED WORK
The first generation of peer-to-peer (P2P) systems
included file-sharing and storage applications: Napster,
Gnutella, MojoNation, and Freenet. Napster uses central
directory servers to locate files. Gnutella provides a
similar, but distributed service using scoped broadcast
queries, limiting scalability. MojoNation [12] uses an on-
line economic model to encourage sharing of resources.
Freenet [13] is a file-sharing network designed to resist
censorship. Neither Gnutella nor Freenet guarantee that
files can be located—even in a functioning network.
The second generation of peer-to-peer systems are
structured peer-to-peer overlay networks, including
Tapestry [1], [2], Chord [8], Pastry [7], and CAN [6].
These overlays implement a basic Key-Based Routing
(KBR) interface, that supports deterministic routing of
messages to a live node that has responsibility for the
destination key. They can also support higher level
interfaces such as a distributed hash table (DHT) or a de-
centralized object location and routing (DOLR) layer [3].
These systems scale well, and guarantee that queries find
existing objects under non-failure conditions.
One differentiating property between these systems
is that neither CAN nor Chord take network distances
into account when constructing their routing overlay;
thus, a given overlay hop may span the diameter of the
network. Both protocols route on the shortest overlay
hops available, and use runtime heuristics to assist. In
contrast, Tapestry and Pastry construct locally optimal
routing tables from initialization, and maintain them in
order to reduce routing stretch.
While some systems fix the number and location
of object replicas by providing a distributed hash ta-
ble (DHT) interface, Tapestry allows applications to
place objects according to their needs. Tapestry “pub-
lishes” location pointers throughout the network to fa-
cilitate efficient routing to those objects with low net-
work stretch. This technique makes Tapestry locality-
aware [14]: queries for nearby objects are generally
satisfied in time proportional to the distance between the
query source and a nearby object replica.
Both Pastry and Tapestry share similarities to the
work of Plaxton, Rajamaran, and Richa [15] for a static
network. Others [16], [17] explore distributed object
location schemes with provably low search overhead,
but they require precomputation, and so are not suitable
for dynamic networks. Recent works include systems
such as Kademlia [9], which uses XOR for overlay
routing, and Viceroy [10], which provides logarithmic
hops through nodes with constant degree routing tables.
SkipNet [11] uses a multi-dimensional skip-list data
structure to support overlay routing, maintaining both
a DNS-based namespace for operational locality and
a randomized namespace for network locality. Other
overlay proposals [18], [19] attain lower bounds on local
routing state. Finally, proposals such as Brocade [20]
differentiate between local and inter-domain routing to
reduce wide-area traffic and routing latency.
A new generation of applications have been pro-
posed on top of these P2P systems, validating them
as novel application infrastructures. Several systems
have application level multicast: CAN-MC [21] (CAN),
Scribe [22] (Pastry), and Bayeux [5] (Tapestry). In
addition, several decentralized file systems have been
proposed: CFS [23] (Chord), Mnemosyne [24] (Chord,
Tapestry), OceanStore [4] (Tapestry), and PAST [25]
(Pastry). Structured P2P overlays also support novel
applications (e.g., attack resistant networks [26], network
indirection layers [27], and similarity searching [28]).
III. T
APESTRY ALGORITHMS
This section details Tapestry’s algorithms for routing
and object location, and describes how network integrity
is maintained under dynamic network conditions.
A. The DOLR Networking API
Tapestry provides a datagram-like communications
interface, with additional mechanisms for manipulating
the locations of objects. Before describing the API, we
start with a couple of definitions.
Tapestry nodes participate in the overlay and are
assigned nodeIDs uniformly at random from a large
identifier space. More than one node may be hosted
by one physical host. Application-specific endpoints
are assigned Globally Unique IDentifiers (GUIDs), se-
lected from the same identifier space. Tapestry currently
uses an identifier space of 160-bit values with a glob-
ally defined radix (e.g., hexadecimal, yielding 40-digit
identifiers). Tapestry assumes nodeIDs and GUIDs are
roughly evenly distributed in the namespace, which can
be achieved by using a secure hashing algorithm like
SHA-1 [29]. We say that node N has nodeID N

, and
an object O has GUID O
.
Since the efficiency of Tapestry generally improves
with network size, it is advantageous for multiple appli-
cations to share a single large Tapestry overlay network.

ZHAO et al.: TAPESTRY: A RESILIENT GLOBAL-SCALE OVERLAY FOR SERVICE DEPLOYMENT 3
L4
42A2
1D76
27AB
4228
4227
44AF
6F43
43C9
51E5
L1
L1
L1
L1
L2
L2
L3
Fig. 1. Tapestry routing mesh from the perspective of a
single node. Outgoing neighbor links point to nodes with
a common matching prefix. Higher-level entries match more
digits. Together, these links form the local routing table.
L4
L2
L1
L3
L4
L4
L3
L2
L3
L2
5230
L1
AC78
4227
42A2
42AD
4629
400F
42A7
4112
42A9
4211
42E0
Fig. 2. Path of a message. The path taken by a message
originating from node 5230 destined for node 42AD in a
Tapestry mesh.
To enable application coexistence, every message con-
tains an application-specific identifier, A

, which is used
to select a process, or application for message delivery at
the destination (similar to the role of a port in TCP/IP),
or an upcall handler where appropriate.
Given the above definitions, we state the four-part
DOLR networking API as follows:
1) P
UBLISHOBJECT(O
, A

): Publish, or make
available, object
on the local node. This call
is best effort, and receives no confirmation.
2) U
NPUBLISHOBJECT(O
, A

): Best-effort attempt
to remove location mappings for
.
3) R
OUTETOOBJECT(O
, A

): Routes message to
location of an object with GUID O
.
4) R
OUTETONODE(N, A

, Exact): Route message
to application A

on node N. “Exact” specifies
whether destination ID needs to be matched ex-
actly to deliver payload.
B. Routing and Object Location
Tapestry dynamically maps each identifier
to a
unique live node, called the identifier’s root or
.If
a node
exists with N

=
, then this node is the root
of
. To deliver messages, each node maintains a routing
table consisting of nodeIDs and IP addresses of the nodes
with which it communicates. We refer to these nodes as
NEXTHOP (

1 if
MAXHOP
then
2 return self
3 else
4

;


5 while
= nil do
6
 
7


8 endwhile
9 if

then
10 return
NEXTHOP (

,
)
11 else
12 return
13 endif
14 endif
Fig. 3. Pseudocode for NEXTHOP(). This function locates the
next hop towards the root given the previous hop number,
,
and the destination GUID,
. Returns next hop or self if local
node is the root.
neighbors of the local node. When routing toward
,
messages are forwarded across neighbor links to nodes
whose nodeIDs are progressively closer (i.e., matching
larger prefixes) to
in the ID space.
1) Routing Mesh: Tapestry uses local tables at each
node, called neighbor maps, to route overlay messages to
the destination ID digit by digit (e.g., 4***
42**
42A*
42AD, where *’s represent wildcards).
This approach is similar to longest prefix routing used
by CIDR IP address allocation [30]. A node N has a
neighbor map with multiple levels, where each level
contains links to nodes matching a prefix up to a digit
position in the ID, and contains a number of entries equal
to the ID’s base. The primary

entry in the

level
is the ID and location of the closest node that begins
with prefix(
,
)+“
”(e.g., the

entry of the

level for node 325AE is the closest node with an ID
that begins with 3259. It is this prescription of “closest
node” that provides the locality properties of Tapestry.
Figure 1 shows some of the outgoing links of a node.
Figure 2 shows a path that a message might take
through the infrastructure. The router for the

hop
shares a prefix of length
with the destination ID;
thus, to route, Tapestry looks in its (


level map
for the entry matching the next digit in the destination
ID. This method guarantees that any existing node in the
system will be reached in at most

logical hops,
in a system with namespace size
, IDs of base
, and
assuming consistent neighbor maps. When a digit cannot
be matched, Tapestry looks for a “close” digit in the
routing table; we call this surrogate routing [1], where
each non-existent ID is mapped to some live node with
a similar ID. Figure 3 details the
NEXTHOP function for
chosing an outgoing link. It is this dynamic process that
maps every identifier
to a unique root node
.

4 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 22, NO. 1, JANUARY 2004
4228
4A6D
4361
43FE
Location Mappin
g
Publish Path
4377
437A
(4378)
Phil’s
Books
AA93
(4378)
Phil’s
Books
Tapestry Pointer
s
4664
4B4F
57EC
E791
4228
4A6D
4361
43FE
Location Mappin
g
4377
437A
(4378)
Phil’s
Books
AA93
(4378)
Phil’s
Books
Tapestry Pointer
s
4664
4B4F
57EC
E791
Query Path
Fig. 4. Tapestry object publish example. Two copies of an
object (4378) are published to their root node at 4377.
Publish messages route to root, depositing a location pointer
for the object at each hop encountered along the way.
Fig. 5. Tapestry route to object example. Several nodes send
messages to object 4378 from different points in the network.
The messages route towards the root node of 4378. When they
intersect the publish path, they follow the location pointer to
the nearest copy of the object.
The challenge in a dynamic network environment is to
continue to route reliably even when intermediate links
are changing or faulty. To help provide resilience, we
exploit network path diversity in the form of redundant
routing paths. Primary neighbor links shown in Figure 1
are augmented by backup links, each sharing the same
prefix
2
. At the

routing level, the
neighbor links
differ only on the

digit. There are
pointers
on a level, and the total size of the neighbor map is

. Each node also stores reverse refer-
ences (backpointers) to other nodes that point at it. The
expected total number of such entries is

.
2) Object Publication and Location: As shown
above, each identifier
has a unique root node
assigned by the routing process. Each such root node in-
herits a unique spanning tree for routing, with messages
from leaf nodes traversing intermediate nodes en route
to the root. We utilize this property to locate objects by
distributing soft-state directory information across nodes
(including the object’s root).
A server S, storing an object O (with GUID, O
,
and root, O
3
), periodically advertises or publishes this
object by routing a publish message toward O
(see
Figure 4). In general, the nodeID of O
is different from
O
; O
is the unique [2] node reached through surrogate
routing by successive calls to
NEXTHOP(*, O
). Each
node along the publication path stores a pointer mapping,
O
, S
, instead of a copy of the object itself. When
there are replicas of an object on separate servers, each
server publishes its copy. Tapestry nodes store location
mappings for object replicas in sorted order of network
latency from themselves.
A client locates O by routing a message to O
(see
2
Current implementations keep two additional backups.
3
Note that objects can be assigned multiple GUIDs mapped to
different root nodes for fault-tolerance.
Figure 5). Each node on the path checks whether it has
a location mapping for O. If so, it redirects the message
to S. Otherwise, it forwards the message onwards to O
(guaranteed to have a location mapping).
Each hop towards the root reduces the number of
nodes satisfying the next hop prefix constraint by a factor
of the identifier base. Messages sent to a destination
from two nearby nodes will generally cross paths quickly
because: each hop increases the length of the prefix
required for the next hop; the path to the root is a
function of the destination ID only, not of the source
nodeID (as in Chord); and neighbor hops are chosen
for network locality, which is (usually) transitive. Thus,
the closer (in network distance) a client is to an object,
the sooner its queries will likely cross paths with the
object’s publish path, and the faster they will reach the
object. Since nodes sort object pointers by distance to
themselves, queries are routed to nearby object replicas.
C. Dynamic Node Algorithms
Tapestry includes a number of mechanisms to main-
tain routing table consistency and ensure object availabil-
ity. In this section, we briefly explore these mechanisms.
See [2] for complete algorithms and proofs. The majority
of control messages described here require acknowledg-
ments, and are retransmitted where required.
1) Node Insertion: There are four components to
inserting a new node N into a Tapestry network:
a) Need-to-know nodes are notified of N, because N
fills a null entry in their routing tables.
b) N might become the new object root for existing
objects. References to those objects must be moved
to N to maintain object availability.
c) The algorithms must construct a near optimal
routing table for N.

ZHAO et al.: TAPESTRY: A RESILIENT GLOBAL-SCALE OVERLAY FOR SERVICE DEPLOYMENT 5
d) Nodes near N are notified and may consider using
N in their routing tables as an optimization.
Node insertion begins at Ns surrogate S (the “root” node
that N

maps to in the existing network). S finds
,
the length of the longest prefix its ID shares with N

.
S sends out an Acknowledged Multicast message that
reaches the set of all existing nodes sharing the same
prefix by traversing a tree based on their nodeIDs. As
nodes receive the message, they add N to their routing
tables and transfer references of locally rooted pointers
as necessary, completing items (a) and (b).
Nodes reached by the multicast contact N and become
an initial neighbor set used in its routing table con-
struction. N performs an iterative nearest neighbor search
beginning with routing level
. N uses the neighbor set to
fill routing level
, trims the list to the closest
nodes
4
,
and requests these
nodes send their backpointers (see
Section III-B) at that level. The resulting set contains all
nodes that point to any of the
nodes at the previous
routing level, and becomes the next neighbor set. N then
decrements
, and repeats the process until all levels are
filled. This completes item (c). Nodes contacted during
the iterative algorithm use N to optimize their routing
tables where applicable, completing item (d).
To ensure that nodes inserting into the network in
unison do not fail to notify each other about their
existence, every node
in the multicast keeps state on
every node
that is still multicasting down one of its
neighbors. This state is used to tell each node
with
in its multicast tree about
. Additionally, the multicast
message includes a list of holes in the new node’s routing
table. Nodes check their tables against the routing table
and notify the new node of entries to fill those holes.
2) Voluntary Node Deletion: If node N leaves
Tapestry voluntarily, it tells the set
of nodes in N’s
backpointers of its intention, along with a replacement
node for each routing level from its own routing table.
The notified nodes each send object republish traffic
to both N and its replacement. Meanwhile, N routes
references to locally rooted objects to their new roots,
and signals nodes in
when finished.
3) Involuntary Node Deletion: In a dynamic, failure-
prone network such as the wide-area Internet, nodes
generally exit the network far less gracefully due to node
and link failures or network partitions, and may enter and
leave many times in a short interval. Tapestry improves
object availability and routing in such an environment
by building redundancy into routing tables and object
location references (e.g., the
backup forwarding
pointers for each routing table entry). Ongoing work
4
is a knob for tuning the tradeoff between resources used and
optimality of the resulting routing table.
Management
Dynamic Node
Object Pointer Database
and
Routing Table
Router
Decentralized
File System
Multicast
Application−Level
Collaborative
Text Filtering
Application Interface / Upcall API
Neighbor Link Management
Transport Protocols
Single Tapestry Node
Fig. 6. Tapestry component architecture. Messages pass up
from physical network layers and down from application
layers. The Router is a central conduit for communication.
has shown Tapestry’s viability as a resilient routing
layer [31].
To maintain availability and redundancy, nodes use
periodic beacons to detect outgoing link and node fail-
ures. Such events trigger repair of the routing mesh and
initiate redistribution and replication of object location
references. Furthermore, the repair process is augmented
by soft-state republishing of object references. Tapestry
repair is highly effective, as shown in Section V-C. De-
spite continuous node turnover, Tapestry retains nearly
a 100% success rate at routing messages to nodes and
objects.
IV. T
APESTRY NODE ARCHITECTURE AND
IMPLEMENTATION
In this section, we present the architecture of a
Tapestry node, an API for Tapestry extension, details
of our current implementation, and an architecture for a
higher-performance implementation suitable for use on
network processors.
A. Component Architecture
Figure 6 illustrates the functional layering for a
Tapestry node. Shown on top are applications that in-
terface with the rest of the system through the Tapestry
API. Below this are the router and the dynamic node
management components. The former processes routing
and location messages, while the latter handles the
arrival and departure of nodes in the network. These two
components communicate through the routing table. At
the bottom are the transport and neighbor link layers,
which together provide a cross-node messaging layer.
We now describe several of these layers.
1) Transport: The transport layer provides the ab-
straction of communication channels from one overlay
node to another, and corresponds to layer 4 in the OSI
layering. Utilizing native Operating System (OS) func-
tionality, many channel implementations are possible.

Citations
More filters
Journal ArticleDOI
TL;DR: A survey of the different security risks that pose a threat to the cloud is presented and a new model targeting at improving features of an existing model must not risk or threaten other important features of the current model.

2,511 citations

Journal ArticleDOI
TL;DR: A survey and comparison of various Structured and Unstructured P2P overlay networks is presented, categorize the various schemes into these two groups in the design spectrum, and discusses the application-level network performance of each group.
Abstract: Over the Internet today, computing and communications environments are significantly more complex and chaotic than classical distributed systems, lacking any centralized organization or hierarchical control. There has been much interest in emerging Peer-to-Peer (P2P) network overlays because they provide a good substrate for creating large-scale data sharing, content distribution, and application-level multicast applications. These P2P overlay networks attempt to provide a long list of features, such as: selection of nearby peers, redundant storage, efficient search/location of data items, data permanence or guarantees, hierarchical naming, trust and authentication, and anonymity. P2P networks potentially offer an efficient routing architecture that is self-organizing, massively scalable, and robust in the wide-area, combining fault tolerance, load balancing, and explicit notion of locality. In this article we present a survey and comparison of various Structured and Unstructured P2P overlay networks. We categorize the various schemes into these two groups in the design spectrum, and discuss the application-level network performance of each group.

1,638 citations


Cites background or methods from "Tapestry: a resilient global-scale ..."

  • ...Sharing similar properties as Pastry, Tapestry [7] employs decentralized randomness to achieve both load distribution and routing locality....

    [...]

  • ...On a testbed of100 machines with 1000 peers simulations, the results in [103] shows that the good routing rates and maintenance bandwidths during instantaneous failures and continuing churn....

    [...]

  • ...…Protocol (IP) networks, offering a mix of various features such as robust wide-area routing architecture, efficient search of data items, selection of nearby peers, redundant storage, permanence, hierarchical naming, trust and authentication, anonymity, massive scalability and fault tolerance....

    [...]

Proceedings Article
27 Jun 2004
TL;DR: It is argued that DHTs should perform lookups quickly and consistently under churn rates at least as high as those observed in deployed P2P systems such as Kazaa, and it is shown that current DHT implementations cannot handle such churn rates.
Abstract: This paper addresses the problem of churn--the continuous process of node arrival and departure--in distributed hash tables (DHTs). We argue that DHTs should perform lookups quickly and consistently under churn rates at least as high as those observed in deployed P2P systems such as Kazaa. We then show through experiments on an emulated network that current DHT implementations cannot handle such churn rates. Next, we identify and explore three factors affecting DHT performance under churn: reactive versus periodic failure recovery, message timeout calculation, and proximity neighbor selection. We work in the context of a mature DHT implementation called Bamboo, using the ModelNet network emulator, which models in-network queuing, cross-traffic, and packet loss. These factors are typically missing in earlier simulation-based DHT studies, and we show that careful attention to them in Bamboo's design allows it to function effectively at churn rates at or higher than that observed in P2P file-sharing applications, while using lower maintenance bandwidth than other DHT implementations.

1,004 citations

Proceedings ArticleDOI
21 Apr 2008
TL;DR: This paper objectify the WS-* vs. REST debate by giving a quantitative technical comparison based on architectural principles and decisions and shows that the two approaches differ in the number of architectural decisions that must be made and in theNumber of available alternatives.
Abstract: Recent technology trends in the Web Services (WS) domain indicate that a solution eliminating the presumed complexity of the WS-* standards may be in sight: advocates of REpresentational State Transfer (REST) have come to believe that their ideas explaining why the World Wide Web works are just as applicable to solve enterprise application integration problems and to simplify the plumbing required to build service-oriented architectures. In this paper we objectify the WS-* vs. REST debate by giving a quantitative technical comparison based on architectural principles and decisions. We show that the two approaches differ in the number of architectural decisions that must be made and in the number of available alternatives. This discrepancy between freedom-from-choice and freedom-of-choice explains the complexity difference perceived. However, we also show that there are significant differences in the consequences of certain decisions in terms of resulting development and maintenance costs. Our comparison helps technical decision makers to assess the two integration styles and technologies more objectively and select the one that best fits their needs: REST is well suited for basic, ad hoc integration scenarios, WS-* is more flexible and addresses advanced quality of service requirements commonly occurring in enterprise computing.

1,000 citations

Journal ArticleDOI
Cong Wang1, Qian Wang1, Kui Ren1, Ning Cao, Wenjing Lou 
TL;DR: This paper proposes a flexible distributed storage integrity auditing mechanism, utilizing the homomorphic token and distributed erasure-coded data, which is highly efficient and resilient against Byzantine failure, malicious data modification attack, and even server colluding attacks.
Abstract: Cloud storage enables users to remotely store their data and enjoy the on-demand high quality cloud applications without the burden of local hardware and software management. Though the benefits are clear, such a service is also relinquishing users' physical possession of their outsourced data, which inevitably poses new security risks toward the correctness of the data in cloud. In order to address this new problem and further achieve a secure and dependable cloud storage service, we propose in this paper a flexible distributed storage integrity auditing mechanism, utilizing the homomorphic token and distributed erasure-coded data. The proposed design allows users to audit the cloud storage with very lightweight communication and computation cost. The auditing result not only ensures strong cloud storage correctness guarantee, but also simultaneously achieves fast data error localization, i.e., the identification of misbehaving server. Considering the cloud data are dynamic in nature, the proposed design further supports secure and efficient dynamic operations on outsourced data, including block modification, deletion, and append. Analysis shows the proposed scheme is highly efficient and resilient against Byzantine failure, malicious data modification attack, and even server colluding attacks.

678 citations

References
More filters
Proceedings ArticleDOI
27 Aug 2001
TL;DR: Results from theoretical analysis, simulations, and experiments show that Chord is scalable, with communication cost and the state maintained by each node scaling logarithmically with the number of Chord nodes.
Abstract: A fundamental problem that confronts peer-to-peer applications is to efficiently locate the node that stores a particular data item. This paper presents Chord, a distributed lookup protocol that addresses this problem. Chord provides support for just one operation: given a key, it maps the key onto a node. Data location can be easily implemented on top of Chord by associating a key with each data item, and storing the key/data item pair at the node to which the key maps. Chord adapts efficiently as nodes join and leave the system, and can answer queries even if the system is continuously changing. Results from theoretical analysis, simulations, and experiments show that Chord is scalable, with communication cost and the state maintained by each node scaling logarithmically with the number of Chord nodes.

10,286 citations


"Tapestry: a resilient global-scale ..." refers background in this paper

  • ...[8] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan, “Chord: A scalable peer-to-peer lookup service for internet applications,” in Proceedings of SIGCOMM, Aug 2001....

    [...]

  • ...In addition, several decentralized file systems have been proposed: CFS [23] (Chord), Mnemosyne [24] (Chord, Tapestry), OceanStore [4] (Tapestry), and PAST [25] (Pastry)....

    [...]

  • ...One differentiating property between these systems is that neither CAN nor Chord take network distances into account when constructing their routing overlay; thus, a given overlay hop may span the diameter of the network....

    [...]

  • ...The second generation of P2P systems are structured peer-to-peer overlay networks, including Tapestry [1], [2], Chord [8], Pastry [7], and CAN [6]....

    [...]

  • ...Messages sent to a destination from two nearby nodes will generally cross paths quickly because: each hop increases the length of the prefix required for the next hop; the path to the root is a function of the destination ID only, not of the source nodeID (as in Chord); and neighbor hops are chosen for network locality, which is (usually) transitive....

    [...]

Book ChapterDOI
TL;DR: Pastry as mentioned in this paper is a scalable, distributed object location and routing substrate for wide-area peer-to-peer ap- plications, which performs application-level routing and object location in a po- tentially very large overlay network of nodes connected via the Internet.
Abstract: This paper presents the design and evaluation of Pastry, a scalable, distributed object location and routing substrate for wide-area peer-to-peer ap- plications. Pastry performs application-level routing and object location in a po- tentially very large overlay network of nodes connected via the Internet. It can be used to support a variety of peer-to-peer applications, including global data storage, data sharing, group communication and naming. Each node in the Pastry network has a unique identifier (nodeId). When presented with a message and a key, a Pastry node efficiently routes the message to the node with a nodeId that is numerically closest to the key, among all currently live Pastry nodes. Each Pastry node keeps track of its immediate neighbors in the nodeId space, and notifies applications of new node arrivals, node failures and recoveries. Pastry takes into account network locality; it seeks to minimize the distance messages travel, according to a to scalar proximity metric like the number of IP routing hops. Pastry is completely decentralized, scalable, and self-organizing; it automatically adapts to the arrival, departure and failure of nodes. Experimental results obtained with a prototype implementation on an emulated network of up to 100,000 nodes confirm Pastry's scalability and efficiency, its ability to self-organize and adapt to node failures, and its good network locality properties.

7,423 citations

Journal ArticleDOI
TL;DR: Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.
Abstract: In this paper trade-offs among certain computational factors in hash coding are analyzed. The paradigm problem considered is that of testing a series of messages one-by-one for membership in a given set of messages. Two new hash-coding methods are examined and compared with a particular conventional hash-coding method. The computational factors considered are the size of the hash area (space), the time required to identify a message as a nonmember of the given set (reject time), and an allowable error frequency.The new methods are intended to reduce the amount of space required to contain the hash-coded information from that associated with conventional methods. The reduction in space is accomplished by exploiting the possibility that a small fraction of errors of commission may be tolerable in some applications, in particular, applications in which a large amount of data is involved and a core resident hash area is consequently not feasible using conventional methods.In such applications, it is envisaged that overall performance could be improved by using a smaller core resident hash area in conjunction with the new methods and, when necessary, by using some secondary and perhaps time-consuming test to “catch” the small fraction of errors associated with the new methods. An example is discussed which illustrates possible areas of application for the new methods.Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.

7,390 citations


"Tapestry: a resilient global-scale ..." refers methods in this paper

  • ...We can imagine building a Bloom filter [32] over the set of pointers....

    [...]

Proceedings ArticleDOI
27 Aug 2001
TL;DR: The concept of a Content-Addressable Network (CAN) as a distributed infrastructure that provides hash table-like functionality on Internet-like scales is introduced and its scalability, robustness and low-latency properties are demonstrated through simulation.
Abstract: Hash tables - which map "keys" onto "values" - are an essential building block in modern software systems. We believe a similar functionality would be equally valuable to large distributed systems. In this paper, we introduce the concept of a Content-Addressable Network (CAN) as a distributed infrastructure that provides hash table-like functionality on Internet-like scales. The CAN is scalable, fault-tolerant and completely self-organizing, and we demonstrate its scalability, robustness and low-latency properties through simulation.

6,703 citations

Book ChapterDOI
John R. Douceur1
07 Mar 2002
TL;DR: It is shown that, without a logically centralized authority, Sybil attacks are always possible except under extreme and unrealistic assumptions of resource parity and coordination among entities.
Abstract: Large-scale peer-to-peer systems face security threats from faulty or hostile remote computing elements. To resist these threats, many such systems employ redundancy. However, if a single faulty entity can present multiple identities, it can control a substantial fraction of the system, thereby undermining this redundancy. One approach to preventing these "Sybil attacks" is to have a trusted agency certify identities. This paper shows that, without a logically centralized authority, Sybil attacks are always possible except under extreme and unrealistic assumptions of resource parity and coordination among entities.

4,816 citations


"Tapestry: a resilient global-scale ..." refers background in this paper

  • ...The Sybil attack [34] is an attack where a user obtains a large number of identities to mount collusion attacks....

    [...]

Frequently Asked Questions (17)
Q1. What have the authors contributed in "Tapestry: a resilient global-scale overlay for service deployment" ?

The authors present Tapestry, a peer-to-peer overlay routing infrastructure offering efficient, scalable, locationindependent routing of messages directly to nearby copies of an object or service using only localized resources. This paper presents the Tapestry architecture, algorithms, and implementation. 

The use of multiple Tapestry instances per machine means that tests under heavy load will produce scheduling delays between instances, resulting in an inflated RDP for short latency paths. 

Tapestry improves object availability and routing in such an environment by building redundancy into routing tables and object location references (e.g., the backup forwarding pointers for each routing table entry). 

Tapestry assumes nodeIDs and GUIDs are roughly evenly distributed in the namespace, which can be achieved by using a secure hashing algorithm like SHA-1 [29]. 

The second churn increases the dynamic rates of insertion and failure, using 10 seconds and 2 minutes as the parameters respectively. 

The routing to objects test sends messages to previously published objects, located at servers which were guaranteed to stay alive in the network. 

The authors compute the RDP for node routing by measuring all pairs roundtrip routing latencies between the 400 Tapestry instances, and dividing each by the correspond-ing ping roundtrip time6. 

Examples include application level multicast, global-scale storage systems, and traffic redirection layers for resiliency or security. 

For messages larger than 2 KB, the cost of copying data (memory buffer to network layer) dominates, and processing time becomes linear relative to the message size. 

For instance, in response to changing link latencies, the neighbor link layer may reorder the preferences assigned to neighbors occupying the same entry in the routing table. 

When a digit cannot be matched, Tapestry looks for a “close” digit in the routing table; the authors call this surrogate routing [1], where each non-existent ID is mapped to some live node with a similar ID. 

2) Parallel Node Insertion: Next, the authors measure the effects of multiple nodes simultaneously entering the Tapestry by examining the convergence time for parallel insertions. 

This paradigm requires an asynchronous I/O layer as well as an efficient model for internal communication and control between components. 

Note that the additional number of JVMs increases scheduling delays, resulting in request timeoutsas the size of the network (and virtualization) increases. 

For small networks where each node knows most of the network (size ), nodes touched by insertion (and the corresponding bandwidth) will likely scale linearly with network size. 

A server S, storing an object O (with GUID, O , and root, O 3), periodically advertises or publishes this object by routing a publish message toward O (see Figure 4). 

The first time a higher layer wishes to communicate with another node, it must provide the destination’s physical address (e.g., IP address and port number).