What are the contributions mentioned in the paper "Maintaining data consistency in structured p2p systems" ?

This paper presents a framework for balanced consistency maintenance ( BCoM ) in structured P2P systems with heterogeneous node capabilities and various workload patterns. The authors present an analytical model to optimize the window size according to the dynamic network conditions, workload patterns and resource limits.

(Open Access) Maintaining Data Consistency in Structured P2P Systems (2012) | Yi Hu

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1

Maintaining Data Consistency in Structured P2P

Systems

Yi Hu, Student Member, IEEE, Laxmi N. Bhuyan, Fellow, IEEE, and Min Feng

Abstract—A fundamental challenge of supporting mutable data repli-

cation in a Peer-to-Peer (P2P) system is to efﬁciently maintain con-

sistency. This paper presents a framework for balanced consistency

maintenance (BCoM) in structured P2P systems with heterogeneous

node capabilities and various workload patterns. Replica nodes of each

object are organized into a tree structure for disseminating updates, and

a sliding window update protocol is developed for consistency main-

tenance. We present an analytical model to optimize the window size

according to the dynamic network conditions, workload patterns and

resource limits. In this way, BCoM balances the consistency strictness,

object availability for updates, and update propagation performance for

various application requirements. On top of the dissemination tree, two

enhancements are proposed: (1) a fast recovery scheme to strengthen

the robustness against node and link failures, and (2) a node migration

policy to remove and prevent bottlenecks allowing more efﬁcient update

delivery. Simulations are conducted using P2PSim to evaluate BCoM in

comparison to SCOPE [1]. The experimental results demonstrate that

BCoM outperforms SCOPE with lower discard rates. BCoM achieves a

discard rate as low as 5% in most cases while SCOPE has almost 100%

discard rate.

Index Terms—Peer-to-Peer, consistency, protocol design, simulations.

1INTRODUCTION

TRUCTURED P2P systems have been effectively de-

signed for wide area data applications [2] [3] [4]

[5] [6] [7]. While most of them are designed for read-

only or low-write sharing contents, a lot of promis-

ing P2P applications demand support for mutable con-

tents. Such examples are modiﬁable storage systems (e.g.

OceanStore [4], Publius [8]), mutable content sharing

(e.g. P2P WiKi [9]), even interactive ones (e.g. P2P online

games [10], P2P Social Networking [11], and P2P col-

laborative workspace [12]). The P2P approach improves

data availability, fault tolerance, and scalability for static

content sharing. But mutable content sharing raises is-

sues of replication and consistency management. P2P

dynamic network characteristics combined with diverse

application requirements and heterogeneous resource

constraints pose unique challenges for P2P consistency

management [13].

• Y. Hu, L. N. Bhuyan, and M. Feng are with the Department of Computer

Science and Engineering, University of California at Riverside, 900

University Ave., Riverside, CA, 92521.

E-mail: yihu@cs.ucr.edu

P2P systems are typically large, where peers with

heterogeneous resource capabilities experience varying

network latencies. Also, the frequent joining and leaving

of nodes make the P2P overlay failure prone. Neither

sequential consistency [14] nor eventual consistency [15]

individually works well in a P2P environment. It has

been proved [16] that among three properties, atomic

consistency, availability and partition-tolerance, only two

can be satisﬁed at a time. Applying sequential consis-

tency leads to prohibitively long synchronization delays

due to the large number of peers and the unreliable

overlay. Even “deadlock” may occur when a crashed

replica node causes other replica nodes to wait forever.

Hence, system scalability is restricted due to low data

availability resulting from long synchronization delay.

At the other extreme, eventual consistency allows replica

nodes to concurrently update their local copies, only

requiring that all replica copies become identical after a

long enough failure-free and update-free interval. Since

replica nodes are highly unreliable in P2P systems, the

node issuing update may have gone ofﬂine by the time

update conﬂicts are detected, leading to unresolvable

conﬂicts. It is infeasible to rely on a long duration with-

out any failure or further updates. As a result, eventual

consistency fails to provide any end-to-end performance

guarantee to P2P users.

This paper presents a Balanced Consistency Main-

tenance (BCoM) protocol for structured P2P systems

to balance between consistency strictness, object avail-

ability for updates, and update dissemination latency.

BCoM is designed for P2P implementations of social

networking [11] (e.g. Facebook) and collaborative editing

[9], [12] (e.g. WiKi or CVS). Users of these P2P appli-

cations frequently update common objects. They prefer

objects highly available for updating although they can

tolerate a certain extent of temporary inconsistency as

long as they get the latest version within a time bound.

Usually these updates are insertions where conﬂicts are

infrequent.

BCoM protocol serializes all updates to eliminate the

complicated conﬂict handling in P2P systems. It also

allows certain obsolescence in each replica node to re-

duce the update discard rate of implementing sequen-

tial consistency. BCoM limits the extent of temporary

inconsistency by developing a sliding window update

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2

protocol. The size of the sliding window regulates the

number of allowable updates buffered by each replica

node. Thus, BCoM provides a measure of consistency

guarantee which is speciﬁed by an application rather

than eventual consistency. BCoM develops an analytical

model to set the window size as follows: given an

inconsistency bound, the window size is set to minimize

the update discard rate while ensuring the expected

delay is no worse than the baseline by a small given

threshold.

Existing bounded consistency techniques for P2P sys-

tems can be divided into two categories: probabilistic

consistency [17] [18] and time-bounded consistency [19]

[20], both of which have limitations. Probabilistic con-

sistency only guarantees consistency on most nodes. It

cannot guarantee that every node receives all updates.

Thus, it is not applicable to the situation where all

intermediate updates are valued by every user. BCoM

overcomes this problem by ensuring consistency bounds

on every node. Besides, existing probabilistic consistency

protocols involve redundant update propagation, which

is eliminated in BCoM. Time-bounded consistency sets

a uniform temporal constraint on inconsistency for all

nodes. In the situation where nodes have various update

frequencies, it is impossible to set a temporal constrain

that works for all nodes. To solve this problem, BCoM

uses a sliding window to directly limit the number of

updates that have not been received by all replica nodes.

An update window protocol has been designed for

web-server systems [21] to limit the number of uncom-

mitted updates at each replica node. But the authors of

[21] do not address update conﬂicts and potential cas-

cading impacts. Moreover, their window size optimiza-

tion model requires information on each node. There are

two obstacles making it impractical to apply this tech-

nique to P2P systems: (1) unlike the web-servers, P2P

replica nodes are highly dynamic and unreliable which

makes the update conﬂict problem worse, (2) the num-

ber of replicas in P2P systems is orders of magnitude

larger than that in web-server systems. Hence, collecting

information from each node is infeasible in P2P systems.

BCoM develops a sliding window protocol that avoids

these obstacles. In BCoM, updates are serialized to avoid

conﬂicts and a distributed analytical model is developed

to optimize the window size with simple system-wide

information, such as the total layers of replica nodes

and the bottleneck service time. This information can be

collected periodically with low overhead. Therefore, the

consistency maintenance provided by BCoM scales well

in dynamic P2P systems.

In BCoM, replica nodes of each object are organized

into a d-ary dissemination tree (dDT ) to propagate

updates. dDT is built on top of the overlay structure,

an auxiliary structure for consistency maintenance of

an object. We evaluate the efﬁciency of BCoM with

comparison to SCOPE [1], which also builds an auxiliary

tree structure on top of the overlay for sending updates.

SCOPE proposed an ID partitioning algorithm to con-

struct their update dissemination tree for maintaining

sequential consistency in structured P2P networks. There

are other tree based consistency management for struc-

tured P2P (e.g., [22]), but the tree construction meth-

ods fundamentally are inherited from SCOPE. There-

fore, we choose to compare the performance of BCoM

with SCOPE. In SCOPE, the update dissemination tree

is built by recursively partitioning the identiﬁer space

and selecting a representative node as the tree node

for each partition. The drawback is that a tree node

may not be a replica node, thus not all tree nodes are

interested in receiving or propagating updates about the

object. Including such nodes in the dissemination tree

introduces extra overhead for sending updates. The ID

partitioning algorithm may also assign a node to be

several tree nodes in SCOPE because of its ID. Such

nodes may be easily overloaded when sending updates.

BCoM avoids these two problems by constructing the

update dissemination tree dDT only from replica nodes,

with each replica node mapped to a tree node. BCoM

also builds a dDT as balanced as possible to reduce the

tree height. Smaller tree height reduces the number of

hops for update propagation and thus the delay, which

improves the object availability for updates.

For each object in BCoM, replica nodes join the dDT

of this object through the root node, and all updates

about the objects are sent to the root for serialization.

However, the root will not be a bottleneck caused by a

large number of replicas, as the root only sends updates

to its children, who in turn send the updates to their

children. The root only has a small constant number of

children, and the node degree is independent of the total

number of tree nodes (i.e. replicas). The update rates will

neither overload the root because the root only serializes

the updates it received. No communication overhead

is imposed on the root and the computation overhead

for serializing updates is negligible for any modern

computer. A root node may be overloaded by being

root for too many objects. Since the root of an object

is selected through hashing the object ID to the node ID

in a structured P2P overlay, load balance schemes may

solve the problem, which is beyond the scope of this

paper.

BCoM presents two enhancements to further improve

the performance of a dDT . One is the ancestor cache

scheme, where each node maintains a cache of ancestors

for fast recovery from parent node failures or leaving.

This relieves the tree-structure “multiplication of loss”

problem [23] (i.e. all the subtree nodes rooted at the

crashed node will lose updates), which is especially

critical in P2P systems. Maintaining the ancestor cache

does not introduce extra overhead since the needed

information conveniently piggybacks on update prop-

agation. A small size cache signiﬁcantly improves the

robustness of dDT against node churn and failures.

The other is the node migration scheme, where more

capable nodes are migrated to upper layers and less

capable nodes are migrated to lower layers. The reason

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 3

is that if an upper layer node is slow in propagating

updates, the consistency constraint blocks its ancestors

from receiving new updates, and all its subtree nodes

will not receive timely updates. The node migration

scheme is to prevent and remove bottleneck nodes. Two

forms of node migration are presented, one is to remove

blocking and the other is to prevent blocking.

The contributions of our paper are the following:

• We propose a consistency maintenance frame-

work (BCoM) in structured P2P systems. A slid-

ing window update protocol and two enhancement

schemes are presented. BCoM balances consistency

strictness, object availability for updating, and up-

date dissemination latency.

• We analyze the problem of setting the window

size in response to dynamic network conditions,

changing workload patterns, and different resource

constraints through a queueing model. Our model

serves diverse consistency requirements from vari-

ous data sharing applications.

• We evaluate the performance of BCoM with com-

parison to SCOPE [1] using the P2PSim simulation

tool. SCOPE is the most relevant work to BCoM,

and it is widely studied in structured P2P systems

for consistency management.

The rest of the paper is organized as follows: Section 2

describes BCoM techniques and deployment. Section 3

presents the analytical model for window size setting.

The performance of BCoM is evaluated in Section 4,

and case study results are presented in Section 5. The

scholarly literature is reviewed in Section 6. The paper

is concluded in Section 7.

Preliminary conference version of this paper was pub-

lished in [24].

2DESCRIPTION OF BCOM

BCoM aims to: (1) provide bounded consistency for

maintaining a large number of replicas of a mutable

object; (2) balance the consistency strictness, object avail-

ability for updating, and update propagation perfor-

mance based on dynamic network conditions, workload

patterns, and resource constraints; (3) make the consis-

tency maintenance robust against frequent node churn

and failures. To fulﬁll these objectives, BCoM organizes

all replica nodes of an object into a d-ary dissemination

tree (dDT ) on top of the P2P overlay for disseminating

updates. It applies three core techniques: the sliding

window update protocol, the ancestor cache scheme, and

the tree node migration policy on a dDT for consistency

maintenance. In this section, we ﬁrst introduce the dDT

structure, and then explain the three techniques in detail.

2.1 Dissemination Tree Structure

For each object, BCoM builds a tree with node degree d

rooted at the node whose ID is the closest to the object

ID in the overlay identiﬁer space. We denote this d-ary

dissemination tree of object i as dDT

. Each node in dDT

is a peer who holds a copy of object i. We name such

a peer as a “replica node” of i, or simply as a replica

node. An update can be issued by any replica node, but

it should be submitted to the root. The root serializes

updates to eliminate conﬂicts.

With node churn and failures in P2P systems, a dDT

serves two cases of insertions: (1) a single node joining,

and (2) a node with subtree rejoining. The goal of

constructing a dDT is to minimize the tree height with

low overhead in both cases.

We show an example of dDT

construction with node

degree d set to 2 in Figure 1. The replica nodes are

ordered by their arrival times as node 0, node 1, node

2, etc. At the beginning, node 1 and node 2 joined. Both

were assigned by node 0 (i.e., the root) as its children.

Then, node 3 joined. Since node 0 cannot have more

child, it passed node 3 to a child who has the smallest

number of subtree nodes. Since both children (i.e., node

1 and node 2) had the same number of subtree nodes,

node 0 randomly selected one to break the tie, say node

1, and increased the number of subtree nodes at node

1 by one. Node 1 assigned node 3 as its child because

it had a space for a new child. When node 4 joined,

node 0 did not have space for a new child and passed

node 4 to the child with the smallest number of subtree

nodes, node 2. Similarly, node 5 and node 6 joined. When

node 6 crashed, all of its children detected the crash

independently and contacted other ancestors to rejoin

the tree. Every child of node 6 acts as a delegate of

its subtree to save individual rejoining of the subtree

nodes. Section 2.3 explains how to contact an ancestor

for rejoining. The tree construction algorithm is given in

Algorithm 1. We use Sub

no.

(x) to count the number of

subtree nodes of node x, including itself.

Fig. 1. Dissemination Tree Example

The dDT construction algorithm uses the number of

subtree nodes as the metric for insertions, instead of the

tree depth used in traditional balanced tree algorithms.

This is because a rejoining node with a subtree may

increase the tree depth by more than one, which is

beyond the one by one tree height increase handled

by traditional balanced tree algorithms. In addition,

maintaining the total number of nodes in each subtree

is simpler and more time efﬁcient than maintaining the

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 4

Algorithm 1 dDT Construction (p, q)

Input: node p receives node q’s join request

Output: parent of node q in dDT

if p does not have d children then

Sub

no.

(p)+ = Sub

no.

(q)

return p

else

ﬁnd a child f of p s.t. f has the smallest Sub

no.

Sub

no.

(f)+ = Sub

no.

(q)

return dDT Construction (f, q)

depth of each subtree. Internal nodes need to wait until

an insertion completes, then the updated tree depth can

be collected layer by layer from leaf nodes back to the

root. This makes the real time maintenance of the tree

depth difﬁcult and unnecessary when tree nodes are

frequently joining and leaving. However, internal nodes

can immediately update the total number of subtree

nodes after forwarding a new node to a child. In BCoM,

the tree depth is periodically collected to help set the

sliding window size, where its result does not need to

be updated in real time as discussed in Section 2.2.2. But

using an outdated tree depth for insertions to dDT will

lead to an unbalanced tree and degrade the performance.

2.2 Sliding Window Update Protocol

2.2.1 Basic Operations in Sliding Window Update

The sliding window update protocol regulates the

consistency strictness in a dDT . “Sliding” refers to the

incremental adjustment of the window size based on dy-

namic system conditions. If dDT

of object i is assigned

a sliding window size k

, any replica node in dDT

can

buffer up to k

unacknowledged updates before getting

blocked from receiving new updates. In other words,

each node in dDT

is given a buffer of size k

. At the

beginning, the root receives the ﬁrst update, sends it to

all children and waits for their ACKs. There are two

types of ACKs, R

ACK and NR ACK. Both indicate

that the update has been received. The difference is

that R

ACK means the sender is ready to receive the

next update; NR

ACK means the sender is not ready.

While waiting, the root accepts and buffers the incoming

updates as long as its buffer is not overﬂowed. When

receiving an R

ACK from a child, the root sends the

next update to this child if there is a buffered update

that has not been sent to this child. When receiving an

ACK from a child, it marks the update to be received

by this child and stops sending update to this child. After

receiving ACKs from all children, the update is removed

from the root’s buffer.

There are two cases of buffer overﬂow: 1) when the

root’s buffer is full, the new updates are discarded until

there is a space; 2) when an internal node’s buffer

is full, the node sends NR

ACK to its parent for the

last received update. An R

ACK is sent to its parent

when the internal node has a space in its buffer to

resume receiving updates. A leaf node does not have any

buffer. After receiving an update, it immediately sends

an R

ACK to its parent.

Figure 2 shows an example of the sliding window

update protocol with the window size set to 8. V stands

for the version number of an update, as V 10−V 13 means

that the node keeps the updates from the 10th version

to the 13th version. Each internal node keeps the next

version for its slowest child up to the latest version it

received. Each leaf node only keeps the latest version it

received.

ϭϬ

Ͳs

ϭϯ

Ͳs

ϭϬ

Ͳs

Fig. 2. An example of sliding window update protocol.

2.2.2 Window Size Setting

The sliding window size k

is critical for balancing the

consistency strictness, object availability for updating,

and update dissemination latency. A large k

masks the

long network latency and the temporary unavailability

of replica nodes, thus lowers the update discard rate.

But a large k

enlarges the discrepancy between the

local version of a replica node with the latest version

at the root. Thus, a large window size k

weakens the

consistency and increases the queueing delay of update

propagation in dDT

. On the extremes, inﬁnite buffer size

provides eventual consistency without discarding updates, and

buffer size zero provides sequential consistency with the worst

update discard rate.

We present an analytical model in Section 3 to set

the sliding window size k

so that the discard rate is

minimized under a delay constraint and a consistency

constraint. The detail formula is given in Section 3. Here,

we explain the procedure for setting the window size.

The root sets the window size for all tree nodes and

adjusts it periodically when needed. The root measures

the input metrics for computing the window size every T

seconds and adjusts the value of k

only after the metrics

stabilize and the old k

violates certain constraints. In

this way, unnecessary changes due to temporary distur-

bances are eliminated to keep dDT

stable. If k

needs to

be adjusted, it is incrementally increased or decreased

until the constraints are satisﬁed.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 5

The input metrics of computing the window size k

include the update arrival rate λ, the tree height L, and

the bottleneck service time μ

. The arrival rate is directly

measured by the root. The tree height and bottleneck

service time are collected periodically from leaf nodes

to the root in a bottom-up approach. The values of the

two metrics are aggregated at every internal node so

that the maintenance message keeps the same size. The

aggregation is performed as follows: each leaf node ini-

tializes the tree height to zero (L =0) and the bottleneck

service time μ

to the update propagation delay between

itself and its parent. Each node sends the maintenance

message to its parent. Once an internal node receives

the maintenance messages from all children, it updates

L as the maximum value of its children’s tree height plus

1 and μ

as the maximum value among its and every

child’s service time. If its service time is longer than a

child’s, a non-blocking migration is executed to swap the

parent with the child. The aggregation continues until

the root is reached.

2.3 Ancestor Cache Maintenance

Each replica node maintains a cache of m

ancestors

starting from its parent leading to the root in dDT

. The

value of m

is set based on the node churn rate (i.e., the

number of nodes joining and leaving the system during

a given period) so that the probability that all m

nodes

simultaneously fail is negligibly small. If a node does

not have m

ancestors, it caches all the ancestors from

its parent to the root.

A node contacts its cached ancestors sequentially layer

by layer upwards when its parent becomes unreachable.

This can be detected by ACKs and maintenance mes-

sages. Sequentially contacting enables a node to ﬁnd the

closest ancestor. The root is ﬁnally contacted if all the

other ancestors are unavailable. The root failure is han-

dled by the overlay routing, as a node with the nearest

ID will replace the crashed node to be the new root.

Different replication schemes may be used to reduce the

cost of root failure, which is speciﬁc to a structured P2P

overlay and beyond the scope of this paper.

The contacted ancestor runs the tree construction Al-

gorithm 1 to ﬁnd a new position for a rejoining node

with its subtree. BCoM does not replace a crashed node

with a leaf node to maintain the original tree structure

because migration brings down a bottleneck node to

the leaf layer for performance improvement. The new

parent node transfers the latest version of the object to

the new child if necessary. Since each node only keeps k

previous updates, content transmission is used to avoid

the communication overhead for getting the missing

updates from other nodes. The sliding window update

protocol resumes for incoming updates.

The ancestor cache provides fast recovery from node

failures with a small overhead. Assuming the probability

of a replica node failure is p, an ancestor cache of size

has a successful recovery probability as 1 − p

.An

ancestor cache is easily maintained by piggybacking an

ancestor list on each update. Whenever a node receives

an update it adds itself to the ancestor list before prop-

agating the update to its children. Each node uses the

newly received ancestor list to refresh its cache. There

is no extra communication, and the storage overhead

is also negligible for keeping the information of m

ancestors.

2.4 Tree Node Migration

Any non-leaf node will be blocked from receiving new

updates if one of its descendants has a buffer overﬂow in

the sliding window update protocol. It is quite possible

that a lower layer node performs faster than a bottleneck

node. This motivates us to promote the faster node to

a higher level and degrade the bottleneck node to a

lower level. For example in Figure 1, assume node 1 is

the bottleneck node causing the root 0 to be blocked.

The faster node may be a descendant of the bottleneck

node as shown in (A) or a descendant of a sibling of the

bottleneck node as shown in (B). When blocking occurs,

node 0 can swap the bottleneck node 1 with a faster

descendant who has more recent updates, like node 4,

to remove blocking. Before blocking occurs, node 1 can

be swapped with its fastest child who has the same

update version to prevent blocking. The performance

improvement through node migration is conﬁrmed by

our analytical model in Section 3.

There are two forms of node migration, as described

below.

• Blocking triggered migration: the blocked node

searches for a faster descendant who has a more

recent update than the bottleneck node, and swaps

them to remove blocking.

• Non-blocking migration: when a node observes a

child performing faster than itself, it swaps with this

child. Such migration prevents blocking and speeds

up the update propagation for the subtree rooted at

the parent node.

The swapping of (B) in Figure 1 is an example of

blocking triggered migration and (A) is an example of

non-blocking migration. Both forms of migration swap

one layer at a time and, hence, multiple migrations

are needed for multi-layer swapping. The non-blocking

migration helps promote faster nodes to upper layers,

which makes the searching in blocking-triggered migra-

tion easier. Since the overlay DHT routing in structured

P2P networks relies on cooperative nodes, we assume

BCoM is run by these cooperative P2P nodes transparent

to end users. Tree node migration uses only the local in-

formation and improves the overall system performance.

2.5 Basic Operations in BCoM

BCoM provides three basic operations:

• Subscribe: if a node p wants to read the object i and

keep it updated, p sends a subscription request to

the root of dDT

through the overlay routing. After

receiving the request, the root runs Algorithm 1 to

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

Maintaining Data Consistency in Structured P2P Systems

Figures

Citations

Heterogeneity-aware elastic provisioning in cloud-assisted edge computing systems

Swarm Intelligence Based File Replication and Consistency Maintenance in Structured P2P File Sharing Systems

A grid workflow Quality-of-Service estimation based on resource availability prediction

On the support of scientific workflows over Pub/Sub brokers.

A group-based data-driven approach for data synchronization in unstructured mobile P2P systems

References

Linearizability: a correctness condition for concurrent objects

OceanStore: an architecture for global-scale persistent storage

Capacity of a burst-noise channel

Tapestry: a resilient global-scale overlay for service deployment

How to model an internetwork

Related Papers (5)

A Balanced Consistency Maintenance Protocol for Structured P2P Systems

SCOPE: scalable consistency maintenance in structured P2P systems

Data consistency for P2P collaborative editing

Update propagation through replica chain in decentralized and unstructured P2P systems

Adaptive replication in peer-to-peer systems

Frequently Asked Questions (1)

Q1. What are the contributions mentioned in the paper "Maintaining data consistency in structured p2p systems" ?