scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A Balanced Consistency Maintenance Protocol for Structured P2P Systems

14 Mar 2010-pp 286-290
TL;DR: A framework for balanced consistency maintenance (BCoM) in structured P2P systems with a fast recovery scheme to strengthen the robustness against node and link failures; and a node migration policy to remove and prevent the bottleneck for better system performance are proposed.
Abstract: A fundamental challenge of managing mutable data replication in a Peer-to-Peer (P2P) system is how to efficiently maintain consistency under various sharing patterns with heterogeneous resource capabilities. This paper presents a framework for balanced consistency maintenance (BCoM) in structured P2P systems. Replica nodes of each object are organized into a tree for disseminating updates, and a sliding window update protocol is developed to bound the consistency. The effect of window size in response to dynamic network conditions, workload updates and resource limits is analyzed through a queueing model. This enables us to balance availability, performance and consistency strictness for various application requirements. On top of the dissemination tree, two enhancements are proposed: a fast recovery scheme to strengthen the robustness against node and link failures; and a node migration policy to remove and prevent the bottleneck for better system performance. Simulations are conducted using P2PSim to evaluate BCoM in comparison to SCOPE \cite{SCOPE}. The experimental results demonstrate that BCoM significantly improves the availability of SCOPE by lowering the discard rate from almost $100\%$ to $5\%$ with slight increase in latency.

Summary (4 min read)

Introduction

  • P2P dynamic network characteristics combined with the diverse application consistency requirements and heterogeneous peer resource constraints also impose unique challenges for P2P consistency management.
  • Since each replica node takes charge of its children in update propagation and consistency maintenance, the work of consistency maintenance is evenly distributed.
  • D ESCRIPTION OFBCOM BCoM aims to: (1) provide bounded consistency for maintaining a large number of replicas of a mutable object; (2) balance the consistency strictness, availability and performance in response to dynamic network conditions, update workload, an resource constraints; (3) make the consistency maintenanc robust against frequently node churns and failures.
  • The authors first introduce thedDT structure, and then explain the three techniques in detail.

A. Dissemination Tree Structure

  • For each object BCoM builds a tree with node degreed rooted at the node whose ID is closest to the object ID in the overlay identifier space.
  • Node1 assigned node3 as its child, since it had a space for a new child.
  • Sec.II-C explains how to contact an ancestor for rejoining.
  • Algorithm 1 dDT Construction(p, q) Input: nodep receives nodeq’s join request Output: parent of nodeq in dDT if p does not haved children then Subno.(p) = +Subno.(q) return p else find a childf of p s.t. f has the smallestSubno.
  • This makes the real time maintenance of the tree depth quite difficult and unnecessary when tree nodes are frequently joining and leaving.

B. Sliding Window Update Protocol

  • 1) Basic Operation in Sliding Window Update: Sliding window regulates the consistency bound for update propagations to all replica nodes in adDT .
  • When receiving an RACK from a child, the root sends the next update to this child if there is a buffered update that has not been sent to this child.
  • The sliding window sizeki plays a critical role in balancing the consistency strictness, the object availability and the update dissemination performance.
  • The aggregation is performed as follows: each leaf node initializes the tree height to zero (L = 0) and the bottleneck service timeµL to its update propagation time.
  • Once an internal node receives the maintenance messages from all children, it updatesL as the maximum value of its children’s tree height plus 1 and µL as the maximum value among its and every child’s service time.

C. Ancestor Cache Maintenance

  • Each replica node maintains a cache ofm ancestors starting from its parent leading to the root in thedDT .
  • This can be detected by ACK and maintenance message transmissions.
  • The root is finally contacted for relocation if all the other ancestors crash.
  • The ancestor cache provides fast recovery from node and link failures with a small overhead and high success probability.
  • Each node refers to the newly received ancestor list to refresh its cache.

D. Tree Node Migration

  • Any internal node with the subtree rooted at it will be blocked from receiving new updates if one of its slowest child is blocked due to the sliding window constraint.
  • When blocking occurs, node0 can swap the bottleneck node1 with a faster descendant with more recent updates, like node4, to remove the blocking.
  • The performance improvement through node migration is confirmed by their queuing model ofdDT in Fig.3.
  • The non-blocking migration helps promote the faster nodes to upper layers, which makes the searching in blocking-triggered migration easier.
  • Since the overlay DHT routing in structured P2P networks relies on cooperative nodes, the authors assume BCoM is run by these cooperative P2P nodes transparent to the end users.

E. Basic Operations in BCoM

  • BCoM provides three basic operations: Subscribe: when a nodep wants to read the objecti and keep it updated,p sends the subscription request to the root ofdDTi by overlay routing.
  • The message overhead for a node leaving is O(1), since the number of the affected node is no more than d, and each has constant overhead to update the related maintenance information.
  • After subscribing, if a nodep wants to update the object, it sends the update request directly to the root using IP routing, also known as Update.
  • Updates are serialized at the root by their arrival time.
  • The specific policy for resolving conflicts is application dependent.

III. A NALYTICAL MODEL FORSLIDING WINDOW SETTING

  • The unstableness of P2P systems forbids us to use any complicated optimization techniques that require severalhours of computation at workstations (e.g. [28]) or every node information in the entire system (e.g. [31]).
  • BCoM adjusts the sliding window size timely to dyanmic P2P systems relying on limited information.
  • This section presents the analytical model of the sliding window sizeki of object i, where the update propagation to all replica nodes is modeled by a queuing system.
  • The authors first analyze the queueing behavior when an update is discarded, then calculate the update discard probability and the expected latency for a replica node to receive an update, finally, they set ki to balance the availability and latency constrained by consistency bounds.

B. Availability and Latency Computation

  • Define the update request intensity asρ. ρ = λ µL−1 (1) Define the probability ofn updates in the queue asπn.
  • Based on the queueing theory forM/M/1 finite queue [6],πn is represented as Eq.2. πn = ρ nπ0 (2) The discard probability isπL∗ki , which indicates the buffer overflow.

C. Window Size Setting

  • The effectiveness of a consistency protocol is measured by three attributes: consistency strictness, object availability and latency for receiving an update, and the three are in subtle tension towards each other.
  • It is hard to accurately model the delay for an update to be received by each replica node, since besides the queueing delay at each node, the dynamic node joining and leaving cause disturbance on the update propagation process.
  • In their simulation, empiricallysetting Ts to 1.3 achieves good results shown in Fig.6 and Fig.7, the discard probability is improved from almost100% to 5% at the cost of latency increases less than one third most of the time.
  • The authors extend the P2PSim tool [1] to simulate the heterogeneous node capacities and transmission latency.

A. Simulation Setting

  • The authors simulate a network of1000 nodes because anything larger cannot be executed stably in P2PSim.
  • Given that transmitting one update uses only10 to 100 slots, the number of time slots covered in a simulation cycle (i.e.7.2 ∗ 106) is large enough to generate sustainable results.
  • Network topology is simulated by two transit-stub topologies generated by GT ITM [9] to model and sparse networks: (1) ts1ksmall -2 transit domains each with4 transit nodes,4 stub domains attached to each transit node, and31 nodes in each stub domain.
  • The node degree is set to5, since the average Gnutella node degree is3 to 5.

B. Efficiency of the Window Size

  • This simulation explores the efficiency of applying sliding window protocol.
  • The curves in Fig.4 and Fig.5 show that by increasing the window size from1 to 20, the discard rate is dropped from80% to around5% and the latency is increased only by20%, which confirms that BCoM significantly improves the availability with slight sacrifice of latency performance compared to the sequential consistency.

C. Scalability of BCoM

  • This simulation verifies the scalability of BCoM with comparison to SCOPE by varying the number of replica nodes and the update rate of each object.
  • The sliding window protocol and the adaptive window size setting contribute to good availability maintenance under dynamic system conditions.
  • But the increase is controlled within1/3rd of the latency of SCOPE, which matches with the latency increase bound in window size setting for improved discard rate.
  • The results of Fig.9 show that the latency of BCoM is similar to that of SCOPE when update rates are low, and longer than SCOPE when update rates are high.
  • Such good balance confirms the objectives in the analytical model of the window size setting.

E. Fault Tolerance of BCoM

  • This simulation examines BCoM’s robustness against node failures by varying the node mean life time.
  • The smaller the life time is, the more frequently the nodes join and leave.
  • The results of SCOPE are not presented because their discard rate is nearly100% when the nodes are joining or leaving.
  • The results of Fig.11, Fig.12, and Fig.13 show that BCoM keeps the tree depth, the discard rate and the latency in good status for different frequencies of node joining and leaving.
  • And adaptive window size setting keeps the availability and latency performance stable.

A. Consistency Maintenance in P2P systems

  • In structured P2P systems, strong consistency is provided by organizing replica nodes to an auxiliary structure on top of the overlay for update propagation, like the tree structure in SCOPE [24], two-tired structure in OceanStore [16], and a hybrid of tree and two-tired structure in [29].
  • The tree constructions in [24] [29] follow the node ID partitioning, instead,dDT inserts the new node to the smallest subtree to make it balanced under dynamic node joining and leaving.
  • The impact of churn rate on discard rate needs to check validity with the source to serve the following read requests.

B. Overlay Content Distribution

  • Update delivery in P2P overlay has four requirements: (1) a bounded delay for update delivery, (2) robustness to frequent node churns and update workload changes, (3) awareness of heterogeneous peer capacities, and (4) scalability with a large number of peers.
  • The major difference is that LagOver improves the performance to meet the individual replica node’s requirement, while node migration improves performance system-wide.
  • The “side link” is used in content dissemination tree in [11] to address (2), where each node keeps multiple side links from other subtrees to minimize the impact of loss multiplication in a tree structure.
  • The authors ancestor cache achieves the same goal by only caching ancestors and contacting the ancestor one layer above the failed nodes.
  • Besides, in BCoM a node sequentially contacts the cached ancestors to avoid conflict relocation decisions while in [11] a node uses multiple side links in parallel to retrieve the lost packets, serving different aims.

C. Tunable Consistency Models

  • Previous works have explored continuous models for consistency maintenance [17] [27], which have been extended by a composable consistency model [23] for P2P applications.
  • Hybrid push and pull methods are also used to provide application tailored cache consistency [32] [25].
  • While in BCoM updates are serialized to eliminate the update conflicts and potential cascading effects.
  • Updates in highly unreliable, replicated peer-to-peer systems.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A Balanced Consistency Maintenance Protocol for
Structured P2P Systems
Yi Hu, Min Feng, Laxmi N. Bhuyan
Department of Computer Science and Engineering
University of California at Riverside, CA, U.S.A
yihu, mfeng, bhuyan@cs.ucr.edu
Abstract—A fundamental challenge of managing mutable data
replication in a Peer-to-Peer (P2P) system is how to efficiently
maintain consistency under various sharing patterns with hetero-
geneous resource capabilities. This paper presents a framework
for balanced consistency maintenance (BCoM) in structured P2P
systems. Replica nodes of each object are organized into a tree
for disseminating updates, and a sliding window update protocol
is developed to bound the consistency. The effect of window size
in response to dynamic network conditions, workload updates
and resource limits is analyzed through a queueing model. This
enables us to balance availability, performance and consistency
strictness for various application requirements. On top of the
dissemination tree, two enhancements are proposed: a fast
recovery scheme to strengthen the robustness against node and
link failures; and a node migration policy to remove and prevent
the bottleneck for better system performance. Simulations are
conducted using P2PSim to evaluate BCoM in comparison to
SCOPE [24]. The experimental results demonstrate that BCoM
significantly improves the availability of SCOPE by lowering the
discard rate from almost 100% to 5% with slight increase in
latency.
I. INTRODUCTION
Structured P2P systems have been effectively designed for
wide area data applications [21] [10] [16] [18] [22] [20]. While
most of them are designed for read-only or low-write sharing
contents, a lot of promising P2P applications demand for sup-
porting mutable contents, such as modifiable storage systems
(e.g. OceanStore [16], Publius [19]), mutable content sharing
(e.g. P2P WiKi [13]), even interactive ones (e.g. P2P online
games [2] [5] and P2P collaborative workspace [12]). P2P
organization improves availability, fault tolerance, and scala-
bility for static content sharing. But mutable content sharing
raises issues of replication and consistency management. P2P
dynamic network characteristics combined with the diverse
application consistency requirements and heterogeneous peer
resource constraints also impose unique challenges for P2P
consistency management. This requires a consistency solution
to work efficiently in such dynamic conditions.
P2P systems are typically large scale, where peers with var-
ious resource capabilities experience diverse network latency.
Also, their dynamic joining and leaving make the P2P overlay
failure prone. Neither sequential consistency [15] nor eventual
consistency [8] individually works well in P2P environment. It
has been proved [14] that among the three properties, atomic
consistency, availability and partition-tolerance, only two can
be satisfied at a time. Applying sequential consistency leads
to prohibitively long synchronization delay due to the large
number of peers and unreliable overlay. Even “deadlock”
may occur when a crashed replica node makes other replica
nodes wait forever. Hence, the system scalability is restricted
due to the lowered availability from long synchronization
delay for a large number of nodes. At the other extreme,
eventual consistency allows replica nodes concurrently update
their local copies and only requires that all replica copies
become identical after a long enough failure-free and update-
free interval. Since in P2P systems replica nodes are highly
unreliable, the update-issuing node may have gone offline by
the time update conflicts are detected, leading to unresolvable
conflicts. It is infeasible to rely on a long duration without any
failure or further updates, due to which eventual consistency
fails to provide any end-to-end performance guarantee to P2P
users. As surveyed in [23], wide area data sharing applications
vary widely in their frequency of reads and updates among
replicas, in their tolerance of stale data and handling of update
conflicts.
This paper presents a Balanced Consistency Maintenance
(BoM) protocol for in structured P2P systems for balancing
the consistency strictness, availability and performance. Due
consideration is given to dynamic workload, frequent replica
node churns, heterogeneous resource capabilities, and different
application consistency requirements. BCoM protocol serial-
izes all updates to eliminate the complicated conflict handling
in P2P systems, while allowing certain obsoleteness in each
replica node to improve the availability and performance. A
sliding window update protocol is used to specify the number
of allowable updates buffered by each replica nodes. This
provides bounded consistency, the performance of which falls
between the sequential and the eventual consistency.
Two main categories of bounded consistency are proposed
for P2P systems: probabilistic consistency [4] [30] and time-
bounded consistency [25] [26], both of which have main
limitations, but are relaxed with BCoM. (1) In the probabilistic
consistency the probability is guaranteed with regard to all
replica nodes but not for an individual node. BCoM ensures
node level as well as system-wide consistency bound. (2)
Time-bounded consistency sets the validation timer so that the
estimated number of updates within the timer valid duration is
small. To avoid the inaccuracy in this translation, BCoM uses
the sliding window to directly bound the number of updates
allowed to be buffered at each node. (3) BCoM eliminates both

redundant propagations in probabilistic bounded consistency
and the individual computations of the timer in time-bounded
consistency. Since redundancy is not needed for consistency
probability and the window size does not depend on the
latency at individual nodes, it is convenient to assign one node
to set and adjust the window size.
An update window protocol has been designed for web-
server systems [31] to bound the uncommitted updates in
each replica node. But update conflicts and potential cascad-
ing impacts can hardly been addressed when optimizing the
window size. Moreover, there are two challenges for applying
this technique to P2P systems: (1) unlike the web-servers,
P2P replica nodes are highly dynamic and unreliable; (2) the
number of replicas in P2P systems is orders of magnitude
larger than that in web-server systems. (1) and (2) together
make any optimization model impractical for P2P systems
because it requires information on each node’s update rate,
propagating latency, etc. BCoM analyzes the window size
through a queueing model based on dynamic network condi-
tion, update workload and available resources. It periodically
collects the general system information, such as the total layers
of replica node and the bottleneck latency, and guides the
window size setting with extremely low overhead. In this way,
the consistency maintenance and performance optimization in
BCoM scale well with the P2P systems and adapt promptly
to the dynamic conditions.
In BCoM, replica nodes of each object are organized into a
d-ary dissemination tree (dDT ) on top of the overlay structure.
The system-wide consistency bound is incrementally achieved
by each internal tree node through applying the sliding window
update protocol to its children. This makes the consistency
scalable with the total number of replica nodes. Since each
replica node takes charge of its children in update propagation
and consistency maintenance, the work of consistency mainte-
nance is evenly distributed. Even though the root is responsible
for serializing updates and accepting new joining node, we
show that it will not become a bottleneck.The overhead of
dDT is lightweight and evenly distributed to prevent “hot
spot” and “single node failure” problems as efficiently as the
previous identifier space partitioning methods in [24] [29].
Another primary goal of constructing a dDT is to reduce the
latency experienced by each replica node to receive an update
from the root. Thus dDT inserts the new join or re-join nodes
to the smallest subtree and tries to balance the tree to shorten
the overlay distance.
BCoM presents two enhancements to further improve the
performance of a dDT . One is the ancestor cache scheme,
where each node maintains a cache of ancestors for fast
recovery from parent node failures. This also relieves tree-
structure’s “multiplication of loss” problem [11] (i.e. all the
subtree nodes rooted at the crashed node will lose the updates),
which is especially critical in P2P systems. Maintaining the
ancestor cache does not introduce extra overhead since the
needed information conveniently piggybacks on update prop-
agation. A small size of cache can also significantly improve
the robustness against node failures. The other is the node
migration scheme, that is to migrate more capable nodes
to upper layers and less capable nodes to lower layers to
minimize the side effect of the bottleneck node and maximize
the overall performance. If an upper layer node is slow in
propagating updates, the consistency constraint blocks ances-
tors from receiving new updates, and all its subtree nodes
do not receive updates in a timely manner. Two forms of
node migration are presented, one is to remove the blocking
and the other is to prevent the blocking so that unnecessary
performance and availability degradations are removed.
The contributions of our paper are the following:
Propose a consistency maintenance framework in struc-
tured P2P systems for balancing the consistency strict-
ness, availability and performance through a sliding win-
dow update protocol with two enhancement schemes.
Analyze the problem of optimizing the window size in
response to dynamic network conditions, update work-
load, and resource constraints through a queueing model
to serve diverse consistency requirements from various
mutable data sharing applications.
Evaluate the performance of BCoM with comparison to
SCOPE using the P2PSim simulation tool.
The rest of the paper is organized as follows: Sec.II intro-
duces the three core techniques in BCoM and the protocol
deployment. Sec.III presents the analytical model for window
size setting. The performance evaluation is given in Sec.IV
and the existing literature is reviewed in Sec.V. The paper is
concluded in Sec.VI.
II. DESCRIPTION OF BCOM
BCoM aims to: (1) provide bounded consistency for main-
taining a large number of replicas of a mutable object; (2) bal-
ance the consistency strictness, availability and performance in
response to dynamic network conditions, update workload, and
resource constraints; (3) make the consistency maintenance
robust against frequently node churns and failures. To fulfill
these objectives, BCoM organizes all replica nodes of an
object into a d-ary dissemination tree (dDT ) on top of the
P2P overlay for disseminating updates. It applies three core
techniques: sliding window update, ancestor cache, and tree
node migration on the dDT for consistency maintenance. In
this section, we first introduce the dDT structure, and then
explain the three techniques in detail.
A. Dissemination Tree Structure
For each object BCoM builds a tree with node degree d
rooted at the node whose ID is closest to the object ID in the
overlay identifier space. We denote this d-ary dissemination
tree of object i as dDT
i
, which consists of only the peers
holding copies of object i. We name such a peer as a “replica
node” of i, or simply as a replica node. An update can
be issued by any replica node, but it should be submitted
to the root. The root serializes the updates to eliminate the
complicated handling of update conflicts because the update-
issuing nodes may have gone offline.

The dynamic node behavior requires the construction of
dDT to serve two cases (1) single node joining and (2)
node with subtree rejoining. The goal of tree construction is
to minimize the tree height under both cases, which lowers
the update propagation latency and object discard rate for
consistency maintenance.
We show an example of dDT
i
construction for case (1)
with node degree d set to 2 in Fig.1. The replica nodes are
ordered by their joining time as node 0, node 1 and so on.
At the beginning when node 1 and node 2 joined, both were
assigned by node 0 (i.e. the root) as a child. Then, node 3
joined when node 0s degree was full, so it passed node 3 to
its child who has the smallest number of subtree nodes denoted
as as Sub
no.
. Since both children (i.e. node 1 and node 2) had
the same Sub
no.
, it randomly selected one to break the tie,
say node 1, and updated the Sub
no.
(1) accordingly. Sub
no.
of
a join node is one standing for itself. Node 1 assigned node 3
as its child, since it had a space for a new child. When node 4
joined, node 0 did not have space for a new child and passed
node 4 to the child with smallest Sub
no.
, node 2. Similarly,
node 5 and node 6 joined. The tree construction algorithm is
given in Alg.1. For case (2) when node 6 crashed, all of its
children detected the crash independently and contacted other
ancestor to rejoin the tree, each acting as a delegate of its
subtree to save individual rejoining of subtree nodes. Sub
no.
counts for all its subtree nodes and itself. Sec.II-C explains
how to contact an ancestor for rejoining.
Fig. 1. Dissemination Tree Example
Algorithm 1 dDT Construction (p, q)
Input: node p receives node qs join request
Output: parent of node q in dDT
if p does not have d children then
Sub
no.
(p) = +Sub
no.
(q)
return p
else
find a child f of p s.t. f has the smallest Sub
no.
Sub
no.
(f) = +Sub
no.
(q)
return dDT Construction (f, q)
dDT directs a join node and a rejoin node with its subtree to
the child node with the smallest subtree nodes when the parent
node degree is full. The reason for not using the tree depth as
the traditional tree balanced algorithm is that rejoining with
subtree may increase the tree depth by more than 1, which
is beyond the one by one tree height increase handled by
them. Another important reason is that maintaining the total
number of nodes in each subtree is simpler and more time
efficient than the depth of each subtree. Since the internal
nodes need to wait until the insertion completes, the updated
tree depth can be collected layer by layer from the leaves
back to the root. This makes the real time maintenance of the
tree depth quite difficult and unnecessary when tree nodes are
frequently joining and leaving. However, the internal nodes
can immediately update the total number of nodes in the
subtree after forwarding the joining node to a child. The tree
depth is periodically collected to help set the sliding window
size as discussed in Sec.II-B2, where its result does not need
to be updated in real time. But using an outdated tree depth
for dDT construction will lead to unbalanced tree and degrade
the performance.
B. Sliding Window Update Protocol
1) Basic Operation in Sliding Window Update:
Sliding window regulates the consistency bound for update
propagations to all replica nodes in a dDT . “Sliding” refers
to the incremental adjustment of window size in response to
dynamic system condition. If dDT
i
of object i is assigned a
sliding window size k
i
, any replica node in dDT
i
can buffer
up to k
i
unacknowledged updates before being blocked from
receiving new updates. At the beginning, root receives the first
update, sends to all children and waits for their ACKs. There
are two types of ACKs, R
ACK and NR ACK, both indicating
the successful receiving of the update. R ACK indicates that
the sender is ready to receive the next update; NR
ACK means
the sender is not ready. While waiting, the root accepts and
buffers the incoming updates as long as its k
i
size buffer does
not overflow. When receiving an R
ACK from a child, the
root sends the next update to this child if there is a buffered
update that has not been sent to this child. When receiving an
NR
ACK from a child, it will not send the next update, but
the update is marked to be received by this child.
After receiving ACKs from all children, the update is re-
moved from its buffer. There are two cases of buffer overflow:
1) when the root’s buffer is full, the new updates are discarded
until there is a space; 2) when an internal node’s buffer is full,
the node sends NR
ACK to its parent for the last received
update. An R ACK is sent to its parent when there is space in
the buffer. A leaf node does not maintain such update buffer.
After receiving an update, it immediately sends R
ACK to its
parent. Fig.2 shows an example of window size set to 8, V
stands for the version number of the update, as V 10 V 13
means the node keeps the updates from 10th version to 13th
version. Each internal node keeps the next version for its
slowest child until the latest version it received, and each leaf
node only keeps the latest version it received.
2) Setting of Sliding Window Size:
The sliding window size k
i
plays a critical role in balancing
the consistency strictness, the object availability and the update

Fig. 2. An example of sliding window update protocol
dissemination performance. The value of k
i
is an indicator of
consistency strictness. The larger k
i
helps mask the long net-
work latency and temporary unavailability of the replica nodes,
lowers the update discards and improves the availability. The
disadvantages of a larger k
i
are (1) discrepancy between the
replica local view and the most updated view at the root giving
rise to weaker consistency; and (2) longer queueing delay in
update propagation, thus lowering the update dissemination
performance. On the extremes, infinite buffer size provides
eventual consistency without discarding updates, and buffer
size zero provides sequential consistency with worst update
discards.
We explain here how the root updates the window size with
the analytical model in Sec.III giving the specific formula to
guide the update. The root measures input metrics every T
seconds and adjusts the k
i
value only when the metrics stable
and the old k
i
violates the constraint in Eq.7. In this way,
the unnecessary changes due to the temporary disturbances
are eliminated to keep the dDT
i
stable. In case k
i
needs to
be adjusted, it is incrementally increased or decreased one by
one until the constraints are satisfied.
The computation of k
i
requires the information on the
update arrival rate λ, the tree height L, and the bottleneck
service time µ
L
. The arrival rate is directly measured by
the root. The tree height and bottleneck service time are
collected periodically from leaf nodes to the root in a bottom-
up fashion. The two metrics are aggregated at every internal
node, so that the maintenance message always keeps the same
size. The aggregation is performed as follows: each leaf node
initializes the tree height to zero (L = 0) and the bottleneck
service time µ
L
to its update propagation time. Each node
sends the maintenance message to its parent. Once an internal
node receives the maintenance messages from all children, it
updates L as the maximum value of its children’s tree height
plus 1 and µ
L
as the maximum value among its and every
child’s service time. If its service time is longer than a childs,
a non-blocking migration is executed to swap the parent with
the child. This aggregation continues until the root is reached.
C. Ancestor Cache Maintenance
Each replica node maintains a cache of m ancestors starting
from its parent leading to the root in the dDT . The value of m
is set based on the node churn rate (i.e. the number of nodes
leaving the system during a given period) so that the possibility
of all m nodes simultaneously failing is unlikely. When the
node does not have m ancestors, it caches information for all
the nodes beginning from the root.
A node contacts its cached ancestors sequentially layer by
layer upwards when its parent becomes unreachable. This can
be detected by ACK and maintenance message transmissions.
The sequential contact operation will find the closest ancestor,
no matter how many layers of node crashes exist. The root is
finally contacted for relocation if all the other ancestors crash.
We assume the root is reliable, since the overlay routing will
automatically handle the root failure by letting the node with
the nearest ID to replace the crashed root of dDT .
The contacted ancestor runs the tree construction Alg.1 to
find a new position for this rejoining node with its subtree.
BCoM does not replace the crashed node by a leaf node to
maintain the original tree structure, since migration brings
the bottleneck node down to the leaf layer for performance
improvement. The new parent transfers the latest version of
the object to this new child position if necessary. Since each
node only keeps k
i
previous updates, content transmission
is used to avoid the communication overhead for getting the
missing updates from other nodes. The sliding window update
propagation resumes for incoming updates.
The ancestor cache provides fast recovery from node and
link failures with a small overhead and high success proba-
bility. Assuming the probability of a replica node failure as
p, the ancestor cache with size m has a successful recovery
probability of 1 p
m
. It is very unlikely that all of the m
cached ancestors fail simultaneously; even if it occurs, the
root can be contacted for the relocation. An ancestor cache
is easily maintained by piggybacking an ancestor list to each
update. Whenever a node receives this update it adds itself to
the ancestor list before propagating the update to the children.
Each node refers to the newly received ancestor list to refresh
its cache. There is no extra communication for the piggyback,
and the storage overhead is also negligible for keeping the
information of m ancestors.
D. Tree Node Migration
Any internal node with the subtree rooted at it will be
blocked from receiving new updates if one of its slowest
child is blocked due to the sliding window constraint. It is
quite possible that a lower layer node performs faster than the
bottleneck node, so we should promote the faster node to a
higher level and degrade the bottleneck node to a lower level.
For example in Fig.1, assume node 1 is the bottleneck getting
the root 0 blocked. The faster node may be a descendant
of the bottleneck node (A) or a descendant of a sibling of
the bottleneck node (B). When blocking occurs, node 0 can
swap the bottleneck node 1 with a faster descendant with
more recent updates, like node 4, to remove the blocking.
Before blocking occurs, node 1 can be swapped with its fastest
child with the same update version to prevent the blocking.
The performance improvement through node migration is
confirmed by our queuing model of dDT in Fig.3. There are
two forms of node migration, as described below.

Blocking triggered migration: the blocked node searches
for a faster descendant, which has a more recent update
than the bottleneck node and swaps them to remove the
blocking.
Non-blocking migration: when a node observes a child
performing faster than itself, it swaps with this child.
This migration prevents the potential blocking and speeds
up the update propagation for the subtree rooted at the
parent.
The swapping of (A) in Fig.1 is an example of blocking
triggered migration and (B) is an example of non-blocking
migration. Both forms of migration swap one layer at a time
and, hence, multiple times of migrations are needed for multi-
layer swapping. The non-blocking migration helps promote
the faster nodes to upper layers, which makes the searching in
blocking-triggered migration easier. Since the overlay DHT
routing in structured P2P networks relies on cooperative
nodes, we assume BCoM is run by these cooperative P2P
nodes transparent to the end users. Tree node migration uses
only the local information and improves the overall system
performance.
E. Basic Operations in BCoM
BCoM provides three basic operations:
Subscribe: when a node p wants to read the object i
and keep it updated, p sends the subscription request to
the root of dDT
i
by overlay routing. After receiving the
request, the root runs Alg.1 to locate a parent for p in
dDT
i
, who will transfer its most updated version to p.
The subsequent updates are received under sliding win-
dow protocol. The message overhead for a subscription
is O(log
d
N), since locating a new node at most searches
along a path from the root to a leaf in dDT
i
.
Unsubscribe: when a node p does not want object i
anymore, it promotes its fastest child as the new parent
and transfers its parent and other children’s information
to the newly promoted node. p also notifies them of the
newly promoted node to update their related maintenance
information. The message overhead for a node leaving is
O(1), since the number of the affected node is no more
than d, and each has constant overhead to update the
related maintenance information.
Update: after subscribing, if a node p wants to update
the object, it sends the update request directly to the
root using IP routing. The root’s IP address is obtained
through the subscription or the ancestor cache. If the root
crashes, p submits the update to the new root through
overlay routing. Updates are serialized at the root by their
arrival time. The specific policy for resolving conflicts
is application dependent. The message overhead of an
update is O(1) for the direct submission to the root.
III. ANALYTICAL MODEL FOR SLIDING WINDOW SETTING
The unstableness of P2P systems forbids us to use any
complicated optimization techniques that require several hours
of computation at workstations (e.g. [28]) or every node
information in the entire system (e.g. [31]). BCoM adjusts the
sliding window size timely to dyanmic P2P systems relying
on limited information.
This section presents the analytical model of the sliding
window size k
i
of object i, where the update propagation to
all replica nodes is modeled by a queuing system. We first
analyze the queueing behavior when an update is discarded,
then calculate the update discard probability and the expected
latency for a replica node to receive an update, finally, we
set k
i
to balance the availability and latency constrained by
consistency bounds.
A. Queueing Model
Assuming the total number of replica nodes as N, the node
degree as d, and there are L (L = O(log
d
N)) layers of
internal nodes with update buffer size k
i
(i.e. layer 0 . . . L 1
nodes with sliding window k
i
). The leaf nodes are in layer-
L
and do not need buffer. The update arrivals are modeled by
a Poisson process with average arrival rate λ
i
(simply as λ),
as each update is issued by a replica node independently and
identically at random. The latency of receiving an update from
the parent and acknowledged by the child is denoted as the
service time for update propagation. The service time for one
layer to its adjacent layer below is the longest parent-child
service time in these two layers. µ
l
denotes the service time
for update propagation from layer-
l
to layer-
l+1
. For examples,
µ
0
is the service time from the root to its slowest child, µ
L1
is the longest service time from a layer-
L1
node to its child
(i.e. a leaf node). The update propagation delay is assumed to
be exponential distributed. The update propagations in dDT
i
are modeled as a queuing process shown in Fig.3 (a): The
updates arrive with average rate λ at the root, then go to the
layer-
0
buffer with size k
i
. The service time for propagating
from layer-
0
to layer-
1
is µ
0
. After that, the updates go to
layer-
1
nodes’ buffer of size k
i
with service time as µ
1
for
propagating to layer-
2
nodes. The propagations end when
updates are received by the leaves in the layer-
L
.
Fig. 3. Queuing Model of Update Propagation
An update may only be discarded by the root when its buffer
overflows. This happens when the root is waiting for R
ACK
from the slowest child in layer-
1
, who is waiting for R ACK
from its slowest child in layer-
2
. The waiting cascades until
the bottleneck node of the dDT
i
is reached, say in the layer-
l
,
0 l L. The nodes in layers l + 1 . . . L (if l < L) do
not receive any update even when their buffers are not full.

Citations
More filters
Proceedings ArticleDOI
01 Oct 2013
TL;DR: This paper aims to reduce inter-datacenter communications while still achieving low service latency, and proposes Selective Data replication mechanism in Distributed Datacenters that incorporates three strategies to further enhance its performance: locality-aware multicast update tree, replica deactivation, and datacenter congestion control.
Abstract: Though the new OSN model with many worldwide distributed small datacenters helps reduce service latency, it brings a problem of higher inter-datacenter communication load. In Facebook, each datacenter has a full copy of all data and the master datacenter updates all other datacenters, which obviously generates tremendous load in this new model. Distributed data storage that only stores a user's data to his/her geographically-closest datacenters mitigates the problem. However, frequent interactions between far-away users lead to frequent inter-datacenter communication and hence long service latency. In this paper, we aim to reduce inter-datacenter communications while still achieve low service latency. We first verify the benefits of the new model and present OSN typical properties that lay the basis of our design. We then propose Selective Data replication mechanism in Distributed Datacenters (SD3). In SD3, a datacenter jointly considers update rate and visit rate to select user data for replication, and further atomizes a user's different types of data (e.g., status update, friend post) for replication, making sure that a replica always reduces inter-datacenter communication. The results of trace-driven experiments on the real-world PlanetLab testbed demonstrate the higher efficiency and effectiveness of SD3 in comparison to other replication methods.

61 citations

Journal ArticleDOI
TL;DR: This paper proposes Selective Data replication mechanism in Distributed Datacenters (SD3), where in SD3, a datacenter jointly considers update rate and visit rate to select user data for replication, and further atomizes a user's different types of data for replicate, making sure that a replica always reduces inter-datacenter communication.
Abstract: Though the new OSN model, which deploys datacenters globally, helps reduce service latency, it causes higher inter-datacenter communication load. In Facebook, each datacenter has a full copy of all data, and the master datacenter updates all other datacenters, generating tremendous load in this new model. Distributed data storage, which only stores a user's data to his/her geographically closest datacenters mitigates the problem. However, frequent interactions between distant users lead to frequent inter-datacenter communication and hence long service latencies. In this paper, we aim to reduce inter-datacenter communications while still achieving low service latency. We first verify the benefits of the new model and present OSN typical properties that underlie the basis of our design. We then propose Selective Data replication mechanism in Distributed Datacenters ( $SD^3$ ). Since replicas need inter-datacenter data updates, datacenters in $SD^3$ jointly consider update rates and visit rates to select user data for replication; furthermore, $SD^3$ atomizes users’ different types of data (e.g., status update, friend post, music) for replication, ensuring that a replica always reduces inter-datacenter communication. $SD^3$ also incorporates three strategies to further enhance its performance: locality-aware multicast update tree, replica deactivation, and datacenter congestion control. The results of trace-driven experiments on the real-world PlanetLab testbed demonstrate the higher efficiency and effectiveness of $SD^3$ in comparison to other replication methods and the effectiveness of its three schemes.

49 citations


Cites background from "A Balanced Consistency Maintenance ..."

  • ...Many structures for data updating [43], [44], [45]...

    [...]

Journal ArticleDOI
TL;DR: SWARM is presented, a file replication mechanism based on swarm intelligence that can reduce querying latency, reduce the number of replicas, and reduce the consistency maintenance overhead by 49-99 percent compared to previous consistency maintenance methods.
Abstract: In peer-to-peer file sharing systems, file replication helps to avoid overloading file owners and improve file query efficiency. There exists a tradeoff between minimizing the number of replicas (i.e., replication overhead) and maximizing the replica hit rate (which reduces file querying latency). More replicas lead to increased replication overhead and higher replica hit rates and vice versa. An ideal replication method should generate a low overhead burden to the system while providing low query latency to the users. However, previous replication methods either achieve high hit rates at the cost of many replicas or produce low hit rates. To reduce replicas while guaranteeing high hit rate, this paper presents SWARM, a file replication mechanism based on swarm intelligence. Recognizing the power of collective behaviors, SWARM identifies node swarms with common node interests and close proximity. Unlike most earlier methods, SWARM determines the placement of a file replica based on the accumulated query rates of nodes in a swarm rather than a single node. Replicas are shared by the nodes in a swarm, leading to fewer replicas and high querying efficiency. In addition, SWARM has a novel consistency maintenance algorithm that propagates an update message between proximity-close nodes in a tree fashion from the top to the bottom. Experimental results from the real-world PlanetLab testbed and the PeerSim simulator demonstrate the effectiveness of the SWARM mechanism in comparison with other file replication and consistency maintenance methods. SWARM can reduce querying latency by 40-58 percent, reduce the number of replicas by 39-76 percent, and achieves more than 84 percent higher hit rates compared to previous methods. It also can reduce the consistency maintenance overhead by 49-99 percent compared to previous consistency maintenance methods.

19 citations


Cites background from "A Balanced Consistency Maintenance ..."

  • ...They generally can be classified into two categories: structure based [7], [8], [13], [14], [15], [23], [24], [25], [26] and message spreading based [16]....

    [...]

  • ...[15] presented a tree-like framework for balanced...

    [...]

  • ...Most consistency maintenance methods update files by relying on structures [7], [8], [13], [14], [15] or message spreading [16], [17]....

    [...]

Journal ArticleDOI
TL;DR: A poll-based distributed file consistency maintenance method called geographically aware wave (GeWave), which dramatically reduces the overhead and yields significant improvements on effectiveness, scalability, and churn resilience of previous consistency maintenance methods.
Abstract: File consistency maintenance in P2P systems is a technique for maintaining consistency between files and their replicas. Most previous consistency maintenance methods depend on either message spreading or structure-based pushing. Message spreading generates high overhead due to a large amount of messages; structure-based pushing methods reduce this overhead. However, both approaches cannot guarantee that every replica node receives an update in churn, because replica nodes passively wait for updates. As opposed to push-based methods that are not effective in high-churn and low-resource P2P systems, polling is churn resilient and generates low overhead. However, it is faced with a number of challenges: 1) ensuring a limited inconsistency; 2) realizing polling in a distributed manner; 3) considering physical proximity in polling; and 4) leveraging polling to further reduce polling overhead. To handle these challenges, this paper introduces a poll-based distributed file consistency maintenance method called geographically aware wave (GeWave). GeWave further reduces update overhead, enhances the fidelity of file consistency, and takes proximity into account. Using adaptive polling in a dynamic structure, GeWave avoids redundant file updates and ensures that every node receives an update in a limited time period even in churn. Furthermore, it propagates updates between geographically close nodes in a distributed manner. Extensive experimental results from the PlanetLab real-world testbed demonstrate the efficiency and effectiveness of GeWave in comparison with other representative consistency maintenance schemes. It dramatically reduces the overhead and yields significant improvements on effectiveness, scalability, and churn resilience of previous file consistency maintenance methods.

18 citations


Cites background from "A Balanced Consistency Maintenance ..."

  • ...[35] presented a framework for balanced consistency maintenance (BCoM) in structured P2P systems....

    [...]

Journal ArticleDOI
TL;DR: Several asynchronous replica consistencies are classified and analyzed based on various strategies such as topology, level of abstraction, update propagation, and locality to enhance the performance and ensure the fault tolerant results to the users.
Abstract: Summary Data grid provides an efficient solution for data-oriented applications that need to manage and process large data sets located at geographically distributed storage resources. Data grid relies on data replicas to enhance the performance and to ensure the fault tolerant results to the users. Replicas are developed to increase the availability of data and to provide better data access. Replicas have their own advantages, but there are a number of issues that must be resolved. Among various existing issues, the critical concern is replica consistency. Various replica consistency strategies are available in the literature. These strategies rationalize and investigate various parameters like bandwidth consumption, access cost, scalability, execution time, storage consumption, staleness, and freshness of replicas. In this paper, several asynchronous replica consistencies are classified and analyzed based on various strategies such as topology, level of abstraction, update propagation, and locality. Some other strategies are also discussed and analyzed like adaptive consistency, quorum-based consistency, load balancing, and agent-based economically efficient, check-pointing, fault tolerance, and conflict management. Parameters on which these strategies are analyzed are methodology, replication classification, consistency, grid topology, environment, evaluation parameters, and performance. Copyright © 2016 John Wiley & Sons, Ltd.

15 citations


Cites background from "A Balanced Consistency Maintenance ..."

  • ...Gossip-based [62] replica consistency maintenance is the push and pull-combined algorithm, namely, balanced consistency maintenance, in which an update message is pushed to the replica nodes actively by the node that creates the update message and in pull approach, replica node sends query messages to obtain the current updated replica....

    [...]

  • ...The spatial constraints of consistency describe coherence predicate [62] that indicates the degree of replica consistency to the primary copy....

    [...]

References
More filters
Proceedings ArticleDOI
25 Jun 2007
TL;DR: In order to do the aforementioned joint optimization, it is sufficient to find random nodes based on only the latency constraint, since even if the capacity of individual nodes is saturated it does not matter since the LagOver network can potentially be reconfigured.
Abstract: We propose a new genre of overlay network for disseminating information from popular but resource constrained sources. We call this communication primitive as latency gradated overlay, where information consumers self- organize themselves according to their individual resource constraints and the latency they are willing to tolerate in receiving the information from the source. Such a communication primitive finds immediate use in applications like RSS feeds aggregation. We propose heuristic algorithms to construct LagOver based on preferably some partial knowledge of the network at users (no knowledge slows the construction process) but no global coordination. The algorithms are evaluated based on simulations and show good characteristics including convergence, satisfying peers' latency and bandwidth constraints even in presence of moderately high membership dynamics. There are two points worth noting. First, optimizing jointly for latency and capacity (i.e., placing nodes that have free capacity close to the source) as long as latency constraint of other nodes are not violated performs better than optimizing for latency only. The joint optimization strategy has faster convergence of the LagOver network, and can deal with adversarial workloads that optimization of only latency can not deal with. Secondly, somewhat counter-intuitively, in order to do the aforementioned joint optimization, it is sufficient to find random nodes based on only the latency constraint, since even if the capacity of individual nodes is saturated it does not matter since the LagOver network can potentially be reconfigured.

25 citations


"A Balanced Consistency Maintenance ..." refers methods in this paper

  • ...The LagOver [3] constructed an update delivery tree by jointly considering each user’s capacity and latency requirements to address (1) and (3), both of which are also handled by tree node migration in BCoM....

    [...]

Journal ArticleDOI
TL;DR: Experimental results show that UPTReC can significantly reduce (up to 70%) overhead messages and also achieve smaller stale query ratio for files prone to frequent updates.

21 citations


"A Balanced Consistency Maintenance ..." refers background in this paper

  • ...Two main categories of bounded consistency are proposed for P2P systems: probabilistic consistency [3] [15] and timebounded consistency [10] [12], both of which have limita-...

    [...]

Proceedings ArticleDOI
04 Jul 2006
TL;DR: Experimental evaluation based on microbenchmarks and scientific applications show that with application-tailored cache consistency, GVFS is able to both improve application runtimes and reduce server load significantly, compared to kernel-level NFS in WAN.
Abstract: The inability to perform optimizations based on application-specific information presents a hurdle to the deployment of pervasive LAN file systems across WAN environments. This paper proposes a novel approach addressing this problem through application-tailored caching and consistency in widearea file systems. It leverages widely available Network File System (NFS) deployments without any modifications to kernels nor applications, and employs middleware to dynamically establish Grid-wide Virtual File System (GVFS) sessions with application-tailored cache consistency. Two consistency models are discussed in this paper: a relaxed model based on invalidation polling, and a stronger model based on delegation and callback. Experimental evaluation based on microbenchmarks and scientific applications show that with application-tailored cache consistency, GVFS is able to both improve application runtimes and reduce server load significantly, compared to kernel-level NFS in WAN.

13 citations


"A Balanced Consistency Maintenance ..." refers methods in this paper

  • ...Hybrid push and pull methods are also used to provide application tailored cache consistency [32] [25]....

    [...]

Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "A balanced consistency maintenance protocol for structured p2p systems" ?

This paper presents a framework for balanced consistency maintenance ( BCoM ) in structured P2P systems.