What future works have the authors mentioned in the paper "Epidemic-style management of semantic overlays for content-based searching" ?

To the best of their knowledge, all earlier work on implicit building of semantic overlays relies on using heuristics to decide which of the peers that served a node recently are likely to be useful again in future queries [ 10, 12, 5 ].

Why does the gossiping protocol have a high usage of bandwidth?

Due to the periodic behavior of gossiping, the price of having rapidly converging protocols may inhibit a high usage of network resources (i.e., bandwidth).

What is the purpose of this paper?

In this paper the authors introduce the idea of applying epidemics to build and dynamically maintain semantic lists in a largescale file-sharing system.

How fast does gv = gc = 1 adapt to changes?

With gv = gc = 3 the system adapts a little faster to changes, but if bandwidth is of high concern, gv = gc = 1 can also provide very good results.

How many items are transferred to and from a node?

In each gossip 2 · (gv + gc) items are transferred to and fromthe node, resulting in a total traffic of 4 · (gv +gc) items for a node per cycle.

What is the way to build a file sharing service?

the authors show that using a two-layered approach combining two epidemic protocols is the appropriate way to build such a service.

How do you select gc items from the CYCLON view?

(a)Hook Description selectPeer() Select peer from the item with the oldest timestamp selectItemsToSend():active thread Select own item and randomly gc −1 others from the CYCLON view passive thread Randomly select gc items from the CYCLON viewselectItemsToKeep()

(Open Access) Epidemic-Style management of semantic overlays for content-based searching (2005) | Spyros Voulgaris

Q: What contributions have the authors mentioned in the paper "Epidemic-style management of semantic overlays for content-based searching" ?

In this paper the authors follow a different approach, proposing a proactive method to build a semantic overlay.

Q: What should be the main argument for a two-layered approach for managing semantic overlay networks?

In particular, the authors advocate that when it comes to constructing and using semantic lists, these lists should be optimized for search only, regardless of any other desirable property of the resulting overlay.

VU Research Portal

Epidemic-Style Management of Semantic Overlays for Content-based Searching

Voulgaris, S.; van Steen, M.

2004

document version

Publisher's PDF, also known as Version of record

Link to publication in VU Research Portal

citation for published version (APA)

Voulgaris, S., & van Steen, M. (2004). Epidemic-Style Management of Semantic Overlays for Content-based

Searching. (VU Technical Report; No. IR-CS-011.04). Vrije Universiteit, Faculty of Mathematics and Computer

Science.

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners

and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research.

• You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal ?

Take down policy

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately

and investigate your claim.

E-mail address:

vuresearchportal.ub@vu.nl

Download date: 10. Aug. 2022

Epidemic-style Management of Semantic Overlays for Content-Based Searching

Spyros Voulgaris

Vrije Universiteit Amsterdam

spyros@cs.vu.nl

Maarten van Steen

Vrije Universiteit Amsterdam

steen@cs.vu.nl

Abstract

A lot of recent research on content-based P2P search-

ing for ﬁle-sharing applications has focused on exploiting

semantic relations between peers to facilitate searching. To

the best of our knowledge, all methods proposed to date sug-

gest reactive ways to seize peers’ semantic relations. That

is, they rely on the usage of the underlying search mech-

anism, and infer semantic relations based on the queries

placed and the corresponding replies received. In this pa-

per we follow a different approach, proposing a proactive

method to build a semantic overlay. Our method is based on

an epidemic protocol that clusters peers with similar con-

tent. It is worth noting that this peer clustering is done in a

completely implicit way, that is, without requiring the user

to specify his preferences or to characterize the content of

ﬁles he shares.

1. Introduction

File sharing peer-to-peer (P2P) systems have gained

enormous popularity in recent years. This has stimulated

signiﬁcant research activity in the area of content-based

searching. Sparkled by the legal adventures of Nap-

ster, and challenged to defeat the inherent limitations

concerning the scalability and failure resilience of central-

ized systems, research has focused on decentralized so-

lutions for content-based searching, which by now has

resulted in a wealth of proposals for peer-to-peer net-

works.

In this paper, we are interested in those group of net-

works in which searching is based on grouping semanti-

cally related nodes. In these networks, a node ﬁrst queries

its semantically close peers before resorting to search meth-

ods that span the entire network. In particularly, we are in-

terested in solutions where semantic relationships between

nodes are captured implicitly. This capturing is generally

achieved through analysis of query results, leading to the

construction of a local semantic list at each peer, consist-

ing of references to other, semantically close peers.

Only very recently, an extensive study has been pub-

lished on search methods in peer-to-peer networks, be they

structured, unstructured, or of a hybrid form [9]. This study

reveals that virtually all peer-to-peer search methods in se-

mantic overlay networks follow an integrated approach to-

wards the construction of the semantic lists, while at the

same time accounting for changes occurring in the set of

nodes. These changes involve the joining and leaving of

nodes, as well as changes in a node’s preferences.

The problem we are faced with is that the construction of

semantic lists should result in highly clustered overlay net-

works. These networks excel for searching content when

nothing changes. However, to handle dynamics requires the

discoveryand propagationof changes that may happen any-

where in the network. For this reason, overlay networks

should also reﬂect desirable properties of random graphs

and complex networks in general [3, 8]. These two conﬂict-

ing demands generally lead to complexity when integrating

solutions into a single protocol.

Protocols for content-based searching in peer-to-peer

networks should separate these concerns. In particular, we

advocate that when it comes to constructing and using se-

mantic lists, these lists should be optimized for search only,

regardless of any other desirable property of the resulting

overlay. Instead, a separate protocol should be used to han-

dle network dynamics, and provide up-to-date information

that will allow proper adjustments in the semantic lists (and

thus leading to adjustments in the semantic overlay network

itself).

In this paper we propose such a two-layered approach

for managingsemantic overlay networks.Thetop layer con-

tains a gossip-basedprotocolthat strives to optimize seman-

tic lists for searching only. The bottom layer offers a fully

decentralized service for delivering, in an unbiased fashion,

information on new events, similar in nature to the peer-

sampling service recently described in [7]. Again, this ser-

vice is implemented using a gossip-based protocol (which,

by the way, is very different from those described in [7]).

Our main contribution is that we demonstrate that this

two-layered approach leads to high-quality semantic over-

lay networks. We substantiate our claims through extensive

simulations using traces collected from the eDonkey ﬁle-

sharing network [4].

The paper is organized as follows. We start with present-

ing our protocols in the next section, followed by describ-

ing our experimental setup in Section 3. Performance eval-

uation is discussed in Section 4, followed by an analysis of

consumed bandwidth in Section 5. We conclude with a dis-

cussion in Section 6.

2. The Protocol

2.1. Outline

In our model each peer maintains a dynamic list of se-

mantic neighbors, called its semantic view, of ﬁxed small

size `. A peer searches for a ﬁle by ﬁrst querying its seman-

tic neighbors. If no results are returned, the peer then resorts

to the default search mechanism.

Our aim is to organize the semantic views so as to max-

imize the hit ratio of the ﬁrst phase of the search. We will

call this the semantic hit ratio. We anticipate that the proba-

bility of a neighbor satisfying a peer’s query is proportional

to the semantic proximity between the peer and its neigh-

bor. We aim, therefore, at ﬁlling a peer’s semantic view with

its ` semantically closest peers out of the whole network.

We assume the existence of a semantic proximity func-

tion S(F

), which given the ﬁle lists F

and F

of peers

P and Q, respectively, provides a numeric metric of the se-

mantic proximity between the two peers. The more seman-

tically similar the ﬁle lists of P and Q are, the higher the

value of S(F

). We are essentially seeking to pick peers

,...,Q

for peer P’s semantic view, such that the sum

∑

i=1

S(P,Q

) is maximized.

We assume that the semantic proximity function exhibits

a form of transitivity, that is, for the largest percentage of

node triplets {P,Q,R}, if S(F

) > C and S(F

) > C,

then also S(F

) > C, for some system-deﬁned constant

C. In other words, if P and Q are semantic neighbors, as

well as Q and R, then so are P and R. The higher C is, the

stronger the transitivity.

2.2. Design Motivation

From our previous discussion, we are seeking a means to

construct, for each node, a semantic view from all the cur-

rent nodes in the systems. There are two sides to this con-

struction.

First, based on the assumption of transitivity in the se-

mantic proximity function S, a peer should explore the se-

mantically close peers that its neighbors have found. In

other words, if Q is in P’s semantic view, and R is in Q’s

view, it makes sense to check whether R is also semanti-

cally close to P. Exploiting the transitivity in S should then

quickly lead to high-quality semantic views.

Second, it is important that all nodes are examined. The

problem with following only transitivity is that we even-

tually will be searching only in a single semantic clus-

ter. Similar to the special “long” links in small-world net-

works [13], we need to establish links to other semantically-

Figure 1. The two-layered framework

related clusters. Likewise, when new nodes join the net-

work, they should easily ﬁnd an appropriate cluster to join.

These issues call for a randomization when selecting nodes

to inspect for adding to a semantic view.

In our design we decouple these two aspects by adopt-

ing a two-layered set of gossip protocols, as can be seen in

Figure 1. The lower layer, called CYCLON [11], is responsi-

ble for maintaining a connectedoverlay and for periodically

feeding the top-layerprotocol with nodes uniform randomly

selected from the network. In its turn, the top-layer proto-

col, called VICINITY, is in charge of focusing on discover-

ing peers that are semantically as close as possible, and of

adding these nodes to the semantic views.

2.3. Gossiping Framework

All information exchange between peers is carried out

by means of gossip items, or simply items. A gossip item

created by peer P is a tuple containing the following three

ﬁelds:

1. P’s contact information (network address and port)

2. The item’s creation time

3. Application-speciﬁc data; in this case P’s ﬁle list

Each node maintains locally a number of items per pro-

tocol, called the protocol’s view. This number is the same

for all items, and is called the protocol’s view size (c

for

VICINITY, and c

for CYCLON).

Figure 2 presents a generic skeleton forming the basis

for both VICINITY and CYCLON gossiping protocols. Each

node runs two threads. An active one, which periodically

wakes up and initiates communication to another peer, and

a passive one, which responds to the communication initi-

ated by another peer.

The functions appearing in boldface, namely se-

lectPeer(), selectItemsToSend(), and selec-

tItemsToKeep() form the three hooks of this skeleton.

Different protocols can be instantiated from this skele-

ton by implementing speciﬁc policies for these three

functions, in turn, leading to different emergent behav-

iors.

The number of items exchanged in each communication

is predeﬁned, and is called the protocol’s gossip length (g

for VICINITY, and g

for CYCLON).

/*** Active thread ***/

// Runs periodically every T time units

q = selectPeer()

myItem = (myAddress, timeNow, myFileList)

buf_send = selectItemsToSend()

send buf_send to q

receive buf_recv from q

view = selectItemsToKeep()

/*** Passive thread ***/

// Runs when contacted by some peer

receive buf_recv from p

myItem = (myAddress, timeNow, myFileList)

buf_send = selectItemsToSend()

send buf_send to p

view = selectItemsToKeep()

Figure 2. Epidemic protocol skeleton

For VICINITY, we chose the policies shown in Fig-

ure 3(a). We note that the RANDOM protocol resembles

T-Man [6]. The only difference is that in T-Man peers ex-

change their whole views, instead of just a subset of them.

As we discuss below, AGGRESSIVELY BIASED will turn

out to be an excellent choice for forming semantic clusters.

Note that selectItemsToKeep() takes into account

CYCLON’s cache too in selecting the best c

items to keep.

This is the default link between the two layers.

For CYCLON, we made the choices shown in Figure 3(b).

CYCLON is a protocol we previously developed, and which

is extensively described and analyzed in [11].

Effectively, what selectItemsToSend() and se-

lectItemsToKeep() establish is an exchange of some

neighbors between the caches of the two communicating

peers. In addition to that, the selected peer’s item in the ini-

tiator’s cache is always removed, but the initiator’s (new)

item is always placed in the selected peer’s cache.

CYCLON creates an overlay with completely random,

uncorrelated links between nodes, such that the in-degree

(number of incoming links) is practically the same for

each node. Importantly, it can achieve this property fairly

quickly even when a small number of items (such as 3

or 4) is exchanged in each communication, even for large

caches of several dozens of items. Therefore, it is ideal as

a lightweight service that can offer a node a randomly se-

lected peer from the current set of nodes.

3. Experimental Environment and Settings

All experiments presented here have been carried out

with PeerSim [2], an open source simulator in Java for P2P

protocols, developed at the University of Bologna.

To evaluate our protocol, we used real world traces from

the eDonkey ﬁle sharing system [1], collectedby Le Fessant

et al. in November2003 [4].A set of 12,000 world-wide dis-

tributed peers along with the ﬁles each one shares is logged

in these traces. A total number of 923,000 unique ﬁles is be-

ing collectively shared by these peers.

In order to simplify the analysis of our system’s emer-

gent behavior, we determined equal gossiping periods for

both layers. More speciﬁcally, once every T time units each

node initiates ﬁrst a gossip exchange with respect to its bot-

tom (CYCLON) layer, immediately followed by a gossip ex-

change at its top (VICINITY) layer. Note that even though

nodes initiate gossiping at universally ﬁxed intervals, they

are not synchronized with each other.

Even though both protocols are asynchronous, it is con-

venient to introduce the notion of cycles in order to study

their evolutionary behavior with respect to time. We deﬁne

a cycle to be the time period during which each node has ini-

tiated gossiping exactly once. Since each node initiates gos-

siping periodically, once every T time units, a cycle is equal

to T time units.

A number of parameters had to be set for these experi-

ments, listed here.

Proximity Function S We chose a rather simple, yet intu-

itive proximity function to test our protocol with. The

proximity S between two nodes P and Q, with ﬁle lists

and F

respectively, is deﬁned as the number of

ﬁles that lay in both lists. More formally: S(F

) =

|. As stated in 2.1, the semantically closer two

nodes are, the higher the value of S is.

Semantic view size ` In all experiments the semantic view

consisted of the 10 semantically closest peers in the

VICINITY cache.As shown in [5], a semantic view size

of ` = 10 provides a good tradeoff between the num-

ber of nodes contacted in the semantic search phase

and the expected semantic hit ratio.

Cache size For the cache size selection, we are faced with

the followingtradeofffor both protocols. A large cache

size provides higher chances of making better item se-

lections, and therefore accelerate the construction of

(near-)optimal semantic views. On the other hand, the

larger the cache size, the longer it takes to contact all

peers in it, resulting in the existence of older—and

therefore more likely to be invalid—links. Of course, a

larger cache also takes up more memory, although this

is generally not a signiﬁcant constraint nowadays.

Considering this tradeoff, and after an extensive

set of experiments that cannot be presented here due

Hook Description

selectPeer() Select peer from the item with the oldest timestamp

selectItemsToSend():

RANDOM Randomly select g

items

BIASED Select the g

items of nodes semantically closest to the selected peer

AGGRESSIVELY BIASED Select the g

items of nodes semantically closest to the selected peer from the VICINITY view

and the CYCLON view

selectItemsToKeep() Keep the c

items of nodes that are semantically closest, out of items in its current view, items

received, and items in the local CYCLON view. In case of multiple items from the same node,

keep the one with the most recent timestamp.

(a)

Hook Description

selectPeer() Select peer from the item with the oldest timestamp

selectItemsToSend():

active thread Select own item and randomly g

− 1 others from the CYCLON view

passive thread Randomly select g

items from the CYCLON view

selectItemsToKeep() Keep all g

received items, replacing the g

selected ones to send. In case of multiple items from

the same node, keep the one with the most recent timestamp.

(b)

Figure 3. The chosen policies for (a) the VICINITY protocol and (b) the CYCLON protocol.

to space limitations, we chose to present experiments

with cache size 50 for each of the two layers.

Gossip length The gossip length, that is, the number of

items gossiped per gossip exchange per protocol, is a

crucial factor for the amount of bandwidth used. This

becomes of greater consequence, considering that an

item carries the ﬁle list of its respective node. So, even

though exchangingmore items per gossip exchange al-

lows information to disseminate faster, we are inclined

to keep the gossip lengths as low as possible, as long

as the system’s performance is reasonable.

The gossip lengths we selected for each of the two

layers is 3, an admittedly low value.

Gossip period T The gossip period is a parameter that

does not affect the protocol’s behavior. The protocol

evolves as a function of the number of messages ex-

changed, or, consequently, of the number of cycles

elapsed. The gossip period only affects how fast the

protocol’s evolution will take place in time. The sin-

gle constraint is that the gossip period T should be ad-

equately longer than the worse latency throughout the

network, so that gossip exchanges are not favored or

hindered due to latency heterogeneity. A typical gos-

sip period for our protocol would be 1 minute, even

though this does not affect the following analysis.

4. Performance Evaluation

4.1. Convergence

In our ﬁrst experiment, we evaluate the convergence

speed of our algorithm by considering how quickly it ﬁnds

0.5

1.5

2.5

3.5

0 50 100 150 200 250 300

avg # common files per sem. neighbor

cycles

Random Vicinity

Random Vicinity + Cyclon

Biased Vicinity + Cyclon

Aggr. Biased Vicinity + Cyclon

Optimal semantic lists

Figure 4. Convergence of sem. views’ quality.

nodes having ﬁles in common. The proximity function’s ob-

jective is for each node to discover the ` peers that have

the most common ﬁles with it. Therefore, a good metric

of the progress towards this goal is the average number of

common ﬁles between a node and each one of its seman-

tic neighbors. From our traces, we measured that in the op-

timal organization, this metric has a value of 3.87.

Figure 4 shows this metric as a function of the cycle for

four distinct experiments. In favor of comparison fairness,

the cache size and gossip length are 50 and 3, respectively,

in each layer, for all experiments. The only exception is the

ﬁrst experiment, which has a single layer. In this case, the

cache size and gossip length are 100 and 6, respectively. All

experiments start with each node knowing 5 random other

ones, simply to ensure initial connectivity in a single con-

Epidemic-Style management of semantic overlays for content-based searching

Figures

Citations

CYCLON: Inexpensive Membership Management for Unstructured P2P Overlays

TRIBLER: a social‐based peer‐to‐peer system

T-Man: Gossip-based fast overlay topology construction

Gossiping in distributed systems

Sub-2-Sub: Self-Organizing Content-Based Publish and Subscribe for Dynamic and Large Scale Collaborative Networks

References

Statistical mechanics of complex networks

Small Worlds: The Dynamics of Networks between Order and Randomness

Small Worlds: the Dynamics of Networks between Order and Randomness - Book Review.

Small Worlds: The Dynamics of Networks between Order and Randomness

Efficient content location using interest-based locality in peer-to-peer systems

Related Papers (5)

Gossip-based peer sampling

CYCLON: Inexpensive Membership Management for Unstructured P2P Overlays

Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems

Chord: A scalable peer-to-peer lookup service for internet applications

Epidemic algorithms for replicated database maintenance

Frequently Asked Questions (10)

Q1. What contributions have the authors mentioned in the paper "Epidemic-style management of semantic overlays for content-based searching" ?

Q2. What future works have the authors mentioned in the paper "Epidemic-style management of semantic overlays for content-based searching" ?

Q3. Why does the gossiping protocol have a high usage of bandwidth?

Q4. What is the purpose of this paper?

Q5. How fast does gv = gc = 1 adapt to changes?

Q6. How many items are transferred to and from a node?

Q7. What is the way to build a file sharing service?

Q8. What is the way to achieve this property?

Q9. How do you select gc items from the CYCLON view?

Q10. What should be the main argument for a two-layered approach for managing semantic overlay networks?