What contributions have the authors mentioned in the paper "A balanced consistency maintenance protocol for structured p2p systems" ?

(Open Access) A Balanced Consistency Maintenance Protocol for Structured P2P Systems (2010) | Yi Hu

A Balanced Consistency Maintenance Protocol for

Structured P2P Systems

Yi Hu, Min Feng, Laxmi N. Bhuyan

Department of Computer Science and Engineering

University of California at Riverside, CA, U.S.A

yihu, mfeng, bhuyan@cs.ucr.edu

Abstract—A fundamental challenge of managing mutable data

replication in a Peer-to-Peer (P2P) system is how to efﬁciently

maintain consistency under various sharing patterns with hetero-

geneous resource capabilities. This paper presents a framework

for balanced consistency maintenance (BCoM) in structured P2P

systems. Replica nodes of each object are organized into a tree

for disseminating updates, and a sliding window update protocol

is developed to bound the consistency. The effect of window size

in response to dynamic network conditions, workload updates

and resource limits is analyzed through a queueing model. This

enables us to balance availability, performance and consistency

strictness for various application requirements. On top of the

dissemination tree, two enhancements are proposed: a fast

recovery scheme to strengthen the robustness against node and

link failures; and a node migration policy to remove and prevent

the bottleneck for better system performance. Simulations are

conducted using P2PSim to evaluate BCoM in comparison to

SCOPE [24]. The experimental results demonstrate that BCoM

signiﬁcantly improves the availability of SCOPE by lowering the

discard rate from almost 100% to 5% with slight increase in

latency.

I. INTRODUCTION

Structured P2P systems have been effectively designed for

wide area data applications [21] [10] [16] [18] [22] [20]. While

most of them are designed for read-only or low-write sharing

contents, a lot of promising P2P applications demand for sup-

porting mutable contents, such as modiﬁable storage systems

(e.g. OceanStore [16], Publius [19]), mutable content sharing

(e.g. P2P WiKi [13]), even interactive ones (e.g. P2P online

games [2] [5] and P2P collaborative workspace [12]). P2P

organization improves availability, fault tolerance, and scala-

bility for static content sharing. But mutable content sharing

raises issues of replication and consistency management. P2P

dynamic network characteristics combined with the diverse

application consistency requirements and heterogeneous peer

resource constraints also impose unique challenges for P2P

consistency management. This requires a consistency solution

to work efﬁciently in such dynamic conditions.

P2P systems are typically large scale, where peers with var-

ious resource capabilities experience diverse network latency.

Also, their dynamic joining and leaving make the P2P overlay

failure prone. Neither sequential consistency [15] nor eventual

consistency [8] individually works well in P2P environment. It

has been proved [14] that among the three properties, atomic

consistency, availability and partition-tolerance, only two can

be satisﬁed at a time. Applying sequential consistency leads

to prohibitively long synchronization delay due to the large

number of peers and unreliable overlay. Even “deadlock”

may occur when a crashed replica node makes other replica

nodes wait forever. Hence, the system scalability is restricted

due to the lowered availability from long synchronization

delay for a large number of nodes. At the other extreme,

eventual consistency allows replica nodes concurrently update

their local copies and only requires that all replica copies

become identical after a long enough failure-free and update-

free interval. Since in P2P systems replica nodes are highly

unreliable, the update-issuing node may have gone ofﬂine by

the time update conﬂicts are detected, leading to unresolvable

conﬂicts. It is infeasible to rely on a long duration without any

failure or further updates, due to which eventual consistency

fails to provide any end-to-end performance guarantee to P2P

users. As surveyed in [23], wide area data sharing applications

vary widely in their frequency of reads and updates among

replicas, in their tolerance of stale data and handling of update

conﬂicts.

This paper presents a Balanced Consistency Maintenance

(BoM) protocol for in structured P2P systems for balancing

the consistency strictness, availability and performance. Due

consideration is given to dynamic workload, frequent replica

node churns, heterogeneous resource capabilities, and different

application consistency requirements. BCoM protocol serial-

izes all updates to eliminate the complicated conﬂict handling

in P2P systems, while allowing certain obsoleteness in each

replica node to improve the availability and performance. A

sliding window update protocol is used to specify the number

of allowable updates buffered by each replica nodes. This

provides bounded consistency, the performance of which falls

between the sequential and the eventual consistency.

Two main categories of bounded consistency are proposed

for P2P systems: probabilistic consistency [4] [30] and time-

bounded consistency [25] [26], both of which have main

limitations, but are relaxed with BCoM. (1) In the probabilistic

consistency the probability is guaranteed with regard to all

replica nodes but not for an individual node. BCoM ensures

node level as well as system-wide consistency bound. (2)

Time-bounded consistency sets the validation timer so that the

estimated number of updates within the timer valid duration is

small. To avoid the inaccuracy in this translation, BCoM uses

the sliding window to directly bound the number of updates

allowed to be buffered at each node. (3) BCoM eliminates both

redundant propagations in probabilistic bounded consistency

and the individual computations of the timer in time-bounded

consistency. Since redundancy is not needed for consistency

probability and the window size does not depend on the

latency at individual nodes, it is convenient to assign one node

to set and adjust the window size.

An update window protocol has been designed for web-

server systems [31] to bound the uncommitted updates in

each replica node. But update conﬂicts and potential cascad-

ing impacts can hardly been addressed when optimizing the

window size. Moreover, there are two challenges for applying

this technique to P2P systems: (1) unlike the web-servers,

P2P replica nodes are highly dynamic and unreliable; (2) the

number of replicas in P2P systems is orders of magnitude

larger than that in web-server systems. (1) and (2) together

make any optimization model impractical for P2P systems

because it requires information on each node’s update rate,

propagating latency, etc. BCoM analyzes the window size

through a queueing model based on dynamic network condi-

tion, update workload and available resources. It periodically

collects the general system information, such as the total layers

of replica node and the bottleneck latency, and guides the

window size setting with extremely low overhead. In this way,

the consistency maintenance and performance optimization in

BCoM scale well with the P2P systems and adapt promptly

to the dynamic conditions.

In BCoM, replica nodes of each object are organized into a

d-ary dissemination tree (dDT ) on top of the overlay structure.

The system-wide consistency bound is incrementally achieved

by each internal tree node through applying the sliding window

update protocol to its children. This makes the consistency

scalable with the total number of replica nodes. Since each

replica node takes charge of its children in update propagation

and consistency maintenance, the work of consistency mainte-

nance is evenly distributed. Even though the root is responsible

for serializing updates and accepting new joining node, we

show that it will not become a bottleneck.The overhead of

dDT is lightweight and evenly distributed to prevent “hot

spot” and “single node failure” problems as efﬁciently as the

previous identiﬁer space partitioning methods in [24] [29].

Another primary goal of constructing a dDT is to reduce the

latency experienced by each replica node to receive an update

from the root. Thus dDT inserts the new join or re-join nodes

to the smallest subtree and tries to balance the tree to shorten

the overlay distance.

BCoM presents two enhancements to further improve the

performance of a dDT . One is the ancestor cache scheme,

where each node maintains a cache of ancestors for fast

recovery from parent node failures. This also relieves tree-

structure’s “multiplication of loss” problem [11] (i.e. all the

subtree nodes rooted at the crashed node will lose the updates),

which is especially critical in P2P systems. Maintaining the

ancestor cache does not introduce extra overhead since the

needed information conveniently piggybacks on update prop-

agation. A small size of cache can also signiﬁcantly improve

the robustness against node failures. The other is the node

migration scheme, that is to migrate more capable nodes

to upper layers and less capable nodes to lower layers to

minimize the side effect of the bottleneck node and maximize

the overall performance. If an upper layer node is slow in

propagating updates, the consistency constraint blocks ances-

tors from receiving new updates, and all its subtree nodes

do not receive updates in a timely manner. Two forms of

node migration are presented, one is to remove the blocking

and the other is to prevent the blocking so that unnecessary

performance and availability degradations are removed.

The contributions of our paper are the following:

• Propose a consistency maintenance framework in struc-

tured P2P systems for balancing the consistency strict-

ness, availability and performance through a sliding win-

dow update protocol with two enhancement schemes.

• Analyze the problem of optimizing the window size in

response to dynamic network conditions, update work-

load, and resource constraints through a queueing model

to serve diverse consistency requirements from various

mutable data sharing applications.

• Evaluate the performance of BCoM with comparison to

SCOPE using the P2PSim simulation tool.

The rest of the paper is organized as follows: Sec.II intro-

duces the three core techniques in BCoM and the protocol

deployment. Sec.III presents the analytical model for window

size setting. The performance evaluation is given in Sec.IV

and the existing literature is reviewed in Sec.V. The paper is

concluded in Sec.VI.

II. DESCRIPTION OF BCOM

BCoM aims to: (1) provide bounded consistency for main-

taining a large number of replicas of a mutable object; (2) bal-

ance the consistency strictness, availability and performance in

response to dynamic network conditions, update workload, and

resource constraints; (3) make the consistency maintenance

robust against frequently node churns and failures. To fulﬁll

these objectives, BCoM organizes all replica nodes of an

object into a d-ary dissemination tree (dDT ) on top of the

P2P overlay for disseminating updates. It applies three core

techniques: sliding window update, ancestor cache, and tree

node migration on the dDT for consistency maintenance. In

this section, we ﬁrst introduce the dDT structure, and then

explain the three techniques in detail.

A. Dissemination Tree Structure

For each object BCoM builds a tree with node degree d

rooted at the node whose ID is closest to the object ID in the

overlay identiﬁer space. We denote this d-ary dissemination

tree of object i as dDT

, which consists of only the peers

holding copies of object i. We name such a peer as a “replica

node” of i, or simply as a replica node. An update can

be issued by any replica node, but it should be submitted

to the root. The root serializes the updates to eliminate the

complicated handling of update conﬂicts because the update-

issuing nodes may have gone ofﬂine.

The dynamic node behavior requires the construction of

dDT to serve two cases (1) single node joining and (2)

node with subtree rejoining. The goal of tree construction is

to minimize the tree height under both cases, which lowers

the update propagation latency and object discard rate for

consistency maintenance.

We show an example of dDT

construction for case (1)

with node degree d set to 2 in Fig.1. The replica nodes are

ordered by their joining time as node 0, node 1 and so on.

At the beginning when node 1 and node 2 joined, both were

assigned by node 0 (i.e. the root) as a child. Then, node 3

joined when node 0’s degree was full, so it passed node 3 to

its child who has the smallest number of subtree nodes denoted

as as Sub

no.

. Since both children (i.e. node 1 and node 2) had

the same Sub

no.

, it randomly selected one to break the tie,

say node 1, and updated the Sub

no.

(1) accordingly. Sub

no.

a join node is one standing for itself. Node 1 assigned node 3

as its child, since it had a space for a new child. When node 4

joined, node 0 did not have space for a new child and passed

node 4 to the child with smallest Sub

no.

, node 2. Similarly,

node 5 and node 6 joined. The tree construction algorithm is

given in Alg.1. For case (2) when node 6 crashed, all of its

children detected the crash independently and contacted other

ancestor to rejoin the tree, each acting as a delegate of its

subtree to save individual rejoining of subtree nodes. Sub

no.

counts for all its subtree nodes and itself. Sec.II-C explains

how to contact an ancestor for rejoining.

Fig. 1. Dissemination Tree Example

Algorithm 1 dDT Construction (p, q)

Input: node p receives node q’s join request

Output: parent of node q in dDT

if p does not have d children then

Sub

no.

(p) = +Sub

no.

(q)

return p

else

ﬁnd a child f of p s.t. f has the smallest Sub

no.

Sub

no.

(f) = +Sub

no.

(q)

return dDT Construction (f, q)

dDT directs a join node and a rejoin node with its subtree to

the child node with the smallest subtree nodes when the parent

node degree is full. The reason for not using the tree depth as

the traditional tree balanced algorithm is that rejoining with

subtree may increase the tree depth by more than 1, which

is beyond the one by one tree height increase handled by

them. Another important reason is that maintaining the total

number of nodes in each subtree is simpler and more time

efﬁcient than the depth of each subtree. Since the internal

nodes need to wait until the insertion completes, the updated

tree depth can be collected layer by layer from the leaves

back to the root. This makes the real time maintenance of the

tree depth quite difﬁcult and unnecessary when tree nodes are

frequently joining and leaving. However, the internal nodes

can immediately update the total number of nodes in the

subtree after forwarding the joining node to a child. The tree

depth is periodically collected to help set the sliding window

size as discussed in Sec.II-B2, where its result does not need

to be updated in real time. But using an outdated tree depth

for dDT construction will lead to unbalanced tree and degrade

the performance.

B. Sliding Window Update Protocol

1) Basic Operation in Sliding Window Update:

Sliding window regulates the consistency bound for update

propagations to all replica nodes in a dDT . “Sliding” refers

to the incremental adjustment of window size in response to

dynamic system condition. If dDT

of object i is assigned a

sliding window size k

, any replica node in dDT

can buffer

up to k

unacknowledged updates before being blocked from

receiving new updates. At the beginning, root receives the ﬁrst

update, sends to all children and waits for their ACKs. There

are two types of ACKs, R

ACK and NR ACK, both indicating

the successful receiving of the update. R ACK indicates that

the sender is ready to receive the next update; NR

ACK means

the sender is not ready. While waiting, the root accepts and

buffers the incoming updates as long as its k

size buffer does

not overﬂow. When receiving an R

ACK from a child, the

root sends the next update to this child if there is a buffered

update that has not been sent to this child. When receiving an

ACK from a child, it will not send the next update, but

the update is marked to be received by this child.

After receiving ACKs from all children, the update is re-

moved from its buffer. There are two cases of buffer overﬂow:

1) when the root’s buffer is full, the new updates are discarded

until there is a space; 2) when an internal node’s buffer is full,

the node sends NR

ACK to its parent for the last received

update. An R ACK is sent to its parent when there is space in

the buffer. A leaf node does not maintain such update buffer.

After receiving an update, it immediately sends R

ACK to its

parent. Fig.2 shows an example of window size set to 8, V

stands for the version number of the update, as V 10 − V 13

means the node keeps the updates from 10th version to 13th

version. Each internal node keeps the next version for its

slowest child until the latest version it received, and each leaf

node only keeps the latest version it received.

2) Setting of Sliding Window Size:

The sliding window size k

plays a critical role in balancing

the consistency strictness, the object availability and the update

Fig. 2. An example of sliding window update protocol

dissemination performance. The value of k

is an indicator of

consistency strictness. The larger k

helps mask the long net-

work latency and temporary unavailability of the replica nodes,

lowers the update discards and improves the availability. The

disadvantages of a larger k

are (1) discrepancy between the

replica local view and the most updated view at the root giving

rise to weaker consistency; and (2) longer queueing delay in

update propagation, thus lowering the update dissemination

performance. On the extremes, inﬁnite buffer size provides

eventual consistency without discarding updates, and buffer

size zero provides sequential consistency with worst update

discards.

We explain here how the root updates the window size with

the analytical model in Sec.III giving the speciﬁc formula to

guide the update. The root measures input metrics every T

seconds and adjusts the k

value only when the metrics stable

and the old k

violates the constraint in Eq.7. In this way,

the unnecessary changes due to the temporary disturbances

are eliminated to keep the dDT

stable. In case k

needs to

be adjusted, it is incrementally increased or decreased one by

one until the constraints are satisﬁed.

The computation of k

requires the information on the

update arrival rate λ, the tree height L, and the bottleneck

service time µ

. The arrival rate is directly measured by

the root. The tree height and bottleneck service time are

collected periodically from leaf nodes to the root in a bottom-

up fashion. The two metrics are aggregated at every internal

node, so that the maintenance message always keeps the same

size. The aggregation is performed as follows: each leaf node

initializes the tree height to zero (L = 0) and the bottleneck

service time µ

to its update propagation time. Each node

sends the maintenance message to its parent. Once an internal

node receives the maintenance messages from all children, it

updates L as the maximum value of its children’s tree height

plus 1 and µ

as the maximum value among its and every

child’s service time. If its service time is longer than a child’s,

a non-blocking migration is executed to swap the parent with

the child. This aggregation continues until the root is reached.

C. Ancestor Cache Maintenance

Each replica node maintains a cache of m ancestors starting

from its parent leading to the root in the dDT . The value of m

is set based on the node churn rate (i.e. the number of nodes

leaving the system during a given period) so that the possibility

of all m nodes simultaneously failing is unlikely. When the

node does not have m ancestors, it caches information for all

the nodes beginning from the root.

A node contacts its cached ancestors sequentially layer by

layer upwards when its parent becomes unreachable. This can

be detected by ACK and maintenance message transmissions.

The sequential contact operation will ﬁnd the closest ancestor,

no matter how many layers of node crashes exist. The root is

ﬁnally contacted for relocation if all the other ancestors crash.

We assume the root is reliable, since the overlay routing will

automatically handle the root failure by letting the node with

the nearest ID to replace the crashed root of dDT .

The contacted ancestor runs the tree construction Alg.1 to

ﬁnd a new position for this rejoining node with its subtree.

BCoM does not replace the crashed node by a leaf node to

maintain the original tree structure, since migration brings

the bottleneck node down to the leaf layer for performance

improvement. The new parent transfers the latest version of

the object to this new child position if necessary. Since each

node only keeps k

previous updates, content transmission

is used to avoid the communication overhead for getting the

missing updates from other nodes. The sliding window update

propagation resumes for incoming updates.

The ancestor cache provides fast recovery from node and

link failures with a small overhead and high success proba-

bility. Assuming the probability of a replica node failure as

p, the ancestor cache with size m has a successful recovery

probability of 1 − p

. It is very unlikely that all of the m

cached ancestors fail simultaneously; even if it occurs, the

root can be contacted for the relocation. An ancestor cache

is easily maintained by piggybacking an ancestor list to each

update. Whenever a node receives this update it adds itself to

the ancestor list before propagating the update to the children.

Each node refers to the newly received ancestor list to refresh

its cache. There is no extra communication for the piggyback,

and the storage overhead is also negligible for keeping the

information of m ancestors.

D. Tree Node Migration

Any internal node with the subtree rooted at it will be

blocked from receiving new updates if one of its slowest

child is blocked due to the sliding window constraint. It is

quite possible that a lower layer node performs faster than the

bottleneck node, so we should promote the faster node to a

higher level and degrade the bottleneck node to a lower level.

For example in Fig.1, assume node 1 is the bottleneck getting

the root 0 blocked. The faster node may be a descendant

of the bottleneck node (A) or a descendant of a sibling of

the bottleneck node (B). When blocking occurs, node 0 can

swap the bottleneck node 1 with a faster descendant with

more recent updates, like node 4, to remove the blocking.

Before blocking occurs, node 1 can be swapped with its fastest

child with the same update version to prevent the blocking.

The performance improvement through node migration is

conﬁrmed by our queuing model of dDT in Fig.3. There are

two forms of node migration, as described below.

• Blocking triggered migration: the blocked node searches

for a faster descendant, which has a more recent update

than the bottleneck node and swaps them to remove the

blocking.

• Non-blocking migration: when a node observes a child

performing faster than itself, it swaps with this child.

This migration prevents the potential blocking and speeds

up the update propagation for the subtree rooted at the

parent.

The swapping of (A) in Fig.1 is an example of blocking

triggered migration and (B) is an example of non-blocking

migration. Both forms of migration swap one layer at a time

and, hence, multiple times of migrations are needed for multi-

layer swapping. The non-blocking migration helps promote

the faster nodes to upper layers, which makes the searching in

blocking-triggered migration easier. Since the overlay DHT

routing in structured P2P networks relies on cooperative

nodes, we assume BCoM is run by these cooperative P2P

nodes transparent to the end users. Tree node migration uses

only the local information and improves the overall system

performance.

E. Basic Operations in BCoM

BCoM provides three basic operations:

• Subscribe: when a node p wants to read the object i

and keep it updated, p sends the subscription request to

the root of dDT

by overlay routing. After receiving the

request, the root runs Alg.1 to locate a parent for p in

dDT

, who will transfer its most updated version to p.

The subsequent updates are received under sliding win-

dow protocol. The message overhead for a subscription

is O(log

N), since locating a new node at most searches

along a path from the root to a leaf in dDT

• Unsubscribe: when a node p does not want object i

anymore, it promotes its fastest child as the new parent

and transfers its parent and other children’s information

to the newly promoted node. p also notiﬁes them of the

newly promoted node to update their related maintenance

information. The message overhead for a node leaving is

O(1), since the number of the affected node is no more

than d, and each has constant overhead to update the

related maintenance information.

• Update: after subscribing, if a node p wants to update

the object, it sends the update request directly to the

root using IP routing. The root’s IP address is obtained

through the subscription or the ancestor cache. If the root

crashes, p submits the update to the new root through

overlay routing. Updates are serialized at the root by their

arrival time. The speciﬁc policy for resolving conﬂicts

is application dependent. The message overhead of an

update is O(1) for the direct submission to the root.

III. ANALYTICAL MODEL FOR SLIDING WINDOW SETTING

The unstableness of P2P systems forbids us to use any

complicated optimization techniques that require several hours

of computation at workstations (e.g. [28]) or every node

information in the entire system (e.g. [31]). BCoM adjusts the

sliding window size timely to dyanmic P2P systems relying

on limited information.

This section presents the analytical model of the sliding

window size k

of object i, where the update propagation to

all replica nodes is modeled by a queuing system. We ﬁrst

analyze the queueing behavior when an update is discarded,

then calculate the update discard probability and the expected

latency for a replica node to receive an update, ﬁnally, we

set k

to balance the availability and latency constrained by

consistency bounds.

A. Queueing Model

Assuming the total number of replica nodes as N, the node

degree as d, and there are L (L = O(log

N)) layers of

internal nodes with update buffer size k

(i.e. layer 0 . . . L − 1

nodes with sliding window k

). The leaf nodes are in layer-

and do not need buffer. The update arrivals are modeled by

a Poisson process with average arrival rate λ

(simply as λ),

as each update is issued by a replica node independently and

identically at random. The latency of receiving an update from

the parent and acknowledged by the child is denoted as the

service time for update propagation. The service time for one

layer to its adjacent layer below is the longest parent-child

service time in these two layers. µ

denotes the service time

for update propagation from layer-

to layer-

l+1

. For examples,

is the service time from the root to its slowest child, µ

L−1

is the longest service time from a layer-

L−1

node to its child

(i.e. a leaf node). The update propagation delay is assumed to

be exponential distributed. The update propagations in dDT

are modeled as a queuing process shown in Fig.3 (a): The

updates arrive with average rate λ at the root, then go to the

layer-

buffer with size k

. The service time for propagating

from layer-

to layer-

is µ

. After that, the updates go to

layer-

nodes’ buffer of size k

with service time as µ

for

propagating to layer-

nodes. The propagations end when

updates are received by the leaves in the layer-

Fig. 3. Queuing Model of Update Propagation

An update may only be discarded by the root when its buffer

overﬂows. This happens when the root is waiting for R

ACK

from the slowest child in layer-

, who is waiting for R ACK

from its slowest child in layer-

. The waiting cascades until

the bottleneck node of the dDT

is reached, say in the layer-

0 ≤ l ≤ L. The nodes in layers l + 1 . . . L (if l < L) do

not receive any update even when their buffers are not full.

A Balanced Consistency Maintenance Protocol for Structured P2P Systems

Figures

Citations

Selective Data replication for Online Social Networks with Distributed Datacenters

Selective Data Replication for Online Social Networks with Distributed Datacenters

Swarm Intelligence Based File Replication and Consistency Maintenance in Structured P2P File Sharing Systems

A Geographically Aware Poll-Based Distributed File Consistency Maintenance Method for P2P Systems

Strategies for replica consistency in data grid – a comprehensive survey

References

Data networks

Linearizability: a correctness condition for concurrent objects

OceanStore: an architecture for global-scale persistent storage

Tapestry: a resilient global-scale overlay for service deployment

How to model an internetwork

Related Papers (5)

SCOPE: scalable consistency maintenance in structured P2P systems

Maintaining Data Consistency in Structured P2P Systems

A Geographically Aware Poll-Based Distributed File Consistency Maintenance Method for P2P Systems

Efficient and Scalable Consistency Maintenance for Heterogeneous Peer-to-Peer Systems

Updates in highly unreliable, replicated peer-to-peer systems

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "A balanced consistency maintenance protocol for structured p2p systems" ?