scispace - formally typeset
Open AccessBook ChapterDOI

Epidemic-Style management of semantic overlays for content-based searching

TLDR
In this paper, the authors propose a proactive method to build a semantic overlay based on an epidemic protocol that clusters peers with similar content, without requiring the user to specify his preferences or to characterize the content of files he shares.
Abstract
A lot of recent research on content-based P2P searching for file- sharing applications has focused on exploiting semantic relations between peers to facilitate searching. To the best of our knowledge, all methods proposed to date suggest reactive ways to seize peers' semantic relations. That is, they rely on the usage of the underlying search mechanism, and infer semantic relations based on the queries placed and the corresponding replies received. In this paper we follow a different approach, proposing a proactive method to build a semantic overlay. Our method is based on an epidemic protocol that clusters peers with similar content. It is worth noting that this peer clustering is done in a completely implicit way, that is, without requiring the user to specify his preferences or to characterize the content of files he shares.

read more

Content maybe subject to copyright    Report

VU Research Portal
Epidemic-Style Management of Semantic Overlays for Content-based Searching
Voulgaris, S.; van Steen, M.
2004
document version
Publisher's PDF, also known as Version of record
Link to publication in VU Research Portal
citation for published version (APA)
Voulgaris, S., & van Steen, M. (2004). Epidemic-Style Management of Semantic Overlays for Content-based
Searching. (VU Technical Report; No. IR-CS-011.04). Vrije Universiteit, Faculty of Mathematics and Computer
Science.
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
• You may not further distribute the material or use it for any profit-making activity or commercial gain
• You may freely distribute the URL identifying the publication in the public portal ?
Take down policy
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.
E-mail address:
vuresearchportal.ub@vu.nl
Download date: 10. Aug. 2022

Epidemic-style Management of Semantic Overlays for Content-Based Searching
Spyros Voulgaris
Vrije Universiteit Amsterdam
spyros@cs.vu.nl
Maarten van Steen
Vrije Universiteit Amsterdam
steen@cs.vu.nl
Abstract
A lot of recent research on content-based P2P search-
ing for file-sharing applications has focused on exploiting
semantic relations between peers to facilitate searching. To
the best of our knowledge, all methods proposed to date sug-
gest reactive ways to seize peers’ semantic relations. That
is, they rely on the usage of the underlying search mech-
anism, and infer semantic relations based on the queries
placed and the corresponding replies received. In this pa-
per we follow a different approach, proposing a proactive
method to build a semantic overlay. Our method is based on
an epidemic protocol that clusters peers with similar con-
tent. It is worth noting that this peer clustering is done in a
completely implicit way, that is, without requiring the user
to specify his preferences or to characterize the content of
files he shares.
1. Introduction
File sharing peer-to-peer (P2P) systems have gained
enormous popularity in recent years. This has stimulated
significant research activity in the area of content-based
searching. Sparkled by the legal adventures of Nap-
ster, and challenged to defeat the inherent limitations
concerning the scalability and failure resilience of central-
ized systems, research has focused on decentralized so-
lutions for content-based searching, which by now has
resulted in a wealth of proposals for peer-to-peer net-
works.
In this paper, we are interested in those group of net-
works in which searching is based on grouping semanti-
cally related nodes. In these networks, a node first queries
its semantically close peers before resorting to search meth-
ods that span the entire network. In particularly, we are in-
terested in solutions where semantic relationships between
nodes are captured implicitly. This capturing is generally
achieved through analysis of query results, leading to the
construction of a local semantic list at each peer, consist-
ing of references to other, semantically close peers.
Only very recently, an extensive study has been pub-
lished on search methods in peer-to-peer networks, be they
structured, unstructured, or of a hybrid form [9]. This study
reveals that virtually all peer-to-peer search methods in se-
mantic overlay networks follow an integrated approach to-
wards the construction of the semantic lists, while at the
same time accounting for changes occurring in the set of
nodes. These changes involve the joining and leaving of
nodes, as well as changes in a node’s preferences.
The problem we are faced with is that the construction of
semantic lists should result in highly clustered overlay net-
works. These networks excel for searching content when
nothing changes. However, to handle dynamics requires the
discoveryand propagationof changes that may happen any-
where in the network. For this reason, overlay networks
should also reflect desirable properties of random graphs
and complex networks in general [3, 8]. These two conflict-
ing demands generally lead to complexity when integrating
solutions into a single protocol.
Protocols for content-based searching in peer-to-peer
networks should separate these concerns. In particular, we
advocate that when it comes to constructing and using se-
mantic lists, these lists should be optimized for search only,
regardless of any other desirable property of the resulting
overlay. Instead, a separate protocol should be used to han-
dle network dynamics, and provide up-to-date information
that will allow proper adjustments in the semantic lists (and
thus leading to adjustments in the semantic overlay network
itself).
In this paper we propose such a two-layered approach
for managingsemantic overlay networks.Thetop layer con-
tains a gossip-basedprotocolthat strives to optimize seman-
tic lists for searching only. The bottom layer offers a fully
decentralized service for delivering, in an unbiased fashion,
information on new events, similar in nature to the peer-
sampling service recently described in [7]. Again, this ser-
vice is implemented using a gossip-based protocol (which,
by the way, is very different from those described in [7]).
Our main contribution is that we demonstrate that this
two-layered approach leads to high-quality semantic over-
lay networks. We substantiate our claims through extensive
simulations using traces collected from the eDonkey file-
sharing network [4].
The paper is organized as follows. We start with present-

ing our protocols in the next section, followed by describ-
ing our experimental setup in Section 3. Performance eval-
uation is discussed in Section 4, followed by an analysis of
consumed bandwidth in Section 5. We conclude with a dis-
cussion in Section 6.
2. The Protocol
2.1. Outline
In our model each peer maintains a dynamic list of se-
mantic neighbors, called its semantic view, of fixed small
size `. A peer searches for a file by first querying its seman-
tic neighbors. If no results are returned, the peer then resorts
to the default search mechanism.
Our aim is to organize the semantic views so as to max-
imize the hit ratio of the first phase of the search. We will
call this the semantic hit ratio. We anticipate that the proba-
bility of a neighbor satisfying a peer’s query is proportional
to the semantic proximity between the peer and its neigh-
bor. We aim, therefore, at filling a peer’s semantic view with
its ` semantically closest peers out of the whole network.
We assume the existence of a semantic proximity func-
tion S(F
P
,F
Q
), which given the file lists F
P
and F
Q
of peers
P and Q, respectively, provides a numeric metric of the se-
mantic proximity between the two peers. The more seman-
tically similar the file lists of P and Q are, the higher the
value of S(F
P
,F
Q
). We are essentially seeking to pick peers
Q
1
,Q
2
,...,Q
`
for peer Ps semantic view, such that the sum
`
i=1
S(P,Q
i
) is maximized.
We assume that the semantic proximity function exhibits
a form of transitivity, that is, for the largest percentage of
node triplets {P,Q,R}, if S(F
P
,F
Q
) > C and S(F
Q
,F
R
) > C,
then also S(F
P
,F
R
) > C, for some system-defined constant
C. In other words, if P and Q are semantic neighbors, as
well as Q and R, then so are P and R. The higher C is, the
stronger the transitivity.
2.2. Design Motivation
From our previous discussion, we are seeking a means to
construct, for each node, a semantic view from all the cur-
rent nodes in the systems. There are two sides to this con-
struction.
First, based on the assumption of transitivity in the se-
mantic proximity function S, a peer should explore the se-
mantically close peers that its neighbors have found. In
other words, if Q is in Ps semantic view, and R is in Qs
view, it makes sense to check whether R is also semanti-
cally close to P. Exploiting the transitivity in S should then
quickly lead to high-quality semantic views.
Second, it is important that all nodes are examined. The
problem with following only transitivity is that we even-
tually will be searching only in a single semantic clus-
ter. Similar to the special “long” links in small-world net-
works [13], we need to establish links to other semantically-
Figure 1. The two-layered framework
related clusters. Likewise, when new nodes join the net-
work, they should easily find an appropriate cluster to join.
These issues call for a randomization when selecting nodes
to inspect for adding to a semantic view.
In our design we decouple these two aspects by adopt-
ing a two-layered set of gossip protocols, as can be seen in
Figure 1. The lower layer, called CYCLON [11], is responsi-
ble for maintaining a connectedoverlay and for periodically
feeding the top-layerprotocol with nodes uniform randomly
selected from the network. In its turn, the top-layer proto-
col, called VICINITY, is in charge of focusing on discover-
ing peers that are semantically as close as possible, and of
adding these nodes to the semantic views.
2.3. Gossiping Framework
All information exchange between peers is carried out
by means of gossip items, or simply items. A gossip item
created by peer P is a tuple containing the following three
fields:
1. Ps contact information (network address and port)
2. The item’s creation time
3. Application-specific data; in this case Ps file list
Each node maintains locally a number of items per pro-
tocol, called the protocol’s view. This number is the same
for all items, and is called the protocol’s view size (c
v
for
VICINITY, and c
c
for CYCLON).
Figure 2 presents a generic skeleton forming the basis
for both VICINITY and CYCLON gossiping protocols. Each
node runs two threads. An active one, which periodically
wakes up and initiates communication to another peer, and
a passive one, which responds to the communication initi-
ated by another peer.
The functions appearing in boldface, namely se-
lectPeer(), selectItemsToSend(), and selec-
tItemsToKeep() form the three hooks of this skeleton.
Different protocols can be instantiated from this skele-
ton by implementing specific policies for these three
functions, in turn, leading to different emergent behav-
iors.

The number of items exchanged in each communication
is predefined, and is called the protocol’s gossip length (g
v
for VICINITY, and g
c
for CYCLON).
/*** Active thread ***/
// Runs periodically every T time units
q = selectPeer()
myItem = (myAddress, timeNow, myFileList)
buf_send = selectItemsToSend()
send buf_send to q
receive buf_recv from q
view = selectItemsToKeep()
/*** Passive thread ***/
// Runs when contacted by some peer
receive buf_recv from p
myItem = (myAddress, timeNow, myFileList)
buf_send = selectItemsToSend()
send buf_send to p
view = selectItemsToKeep()
Figure 2. Epidemic protocol skeleton
For VICINITY, we chose the policies shown in Fig-
ure 3(a). We note that the RANDOM protocol resembles
T-Man [6]. The only difference is that in T-Man peers ex-
change their whole views, instead of just a subset of them.
As we discuss below, AGGRESSIVELY BIASED will turn
out to be an excellent choice for forming semantic clusters.
Note that selectItemsToKeep() takes into account
CYCLONs cache too in selecting the best c
v
items to keep.
This is the default link between the two layers.
For CYCLON, we made the choices shown in Figure 3(b).
CYCLON is a protocol we previously developed, and which
is extensively described and analyzed in [11].
Effectively, what selectItemsToSend() and se-
lectItemsToKeep() establish is an exchange of some
neighbors between the caches of the two communicating
peers. In addition to that, the selected peer’s item in the ini-
tiator’s cache is always removed, but the initiator’s (new)
item is always placed in the selected peer’s cache.
CYCLON creates an overlay with completely random,
uncorrelated links between nodes, such that the in-degree
(number of incoming links) is practically the same for
each node. Importantly, it can achieve this property fairly
quickly even when a small number of items (such as 3
or 4) is exchanged in each communication, even for large
caches of several dozens of items. Therefore, it is ideal as
a lightweight service that can offer a node a randomly se-
lected peer from the current set of nodes.
3. Experimental Environment and Settings
All experiments presented here have been carried out
with PeerSim [2], an open source simulator in Java for P2P
protocols, developed at the University of Bologna.
To evaluate our protocol, we used real world traces from
the eDonkey file sharing system [1], collectedby Le Fessant
et al. in November2003 [4].A set of 12,000 world-wide dis-
tributed peers along with the files each one shares is logged
in these traces. A total number of 923,000 unique files is be-
ing collectively shared by these peers.
In order to simplify the analysis of our system’s emer-
gent behavior, we determined equal gossiping periods for
both layers. More specifically, once every T time units each
node initiates first a gossip exchange with respect to its bot-
tom (CYCLON) layer, immediately followed by a gossip ex-
change at its top (VICINITY) layer. Note that even though
nodes initiate gossiping at universally fixed intervals, they
are not synchronized with each other.
Even though both protocols are asynchronous, it is con-
venient to introduce the notion of cycles in order to study
their evolutionary behavior with respect to time. We define
a cycle to be the time period during which each node has ini-
tiated gossiping exactly once. Since each node initiates gos-
siping periodically, once every T time units, a cycle is equal
to T time units.
A number of parameters had to be set for these experi-
ments, listed here.
Proximity Function S We chose a rather simple, yet intu-
itive proximity function to test our protocol with. The
proximity S between two nodes P and Q, with file lists
F
P
and F
Q
respectively, is defined as the number of
files that lay in both lists. More formally: S(F
P
,F
Q
) =
|F
P
T
F
Q
|. As stated in 2.1, the semantically closer two
nodes are, the higher the value of S is.
Semantic view size ` In all experiments the semantic view
consisted of the 10 semantically closest peers in the
VICINITY cache.As shown in [5], a semantic view size
of ` = 10 provides a good tradeoff between the num-
ber of nodes contacted in the semantic search phase
and the expected semantic hit ratio.
Cache size For the cache size selection, we are faced with
the followingtradeofffor both protocols. A large cache
size provides higher chances of making better item se-
lections, and therefore accelerate the construction of
(near-)optimal semantic views. On the other hand, the
larger the cache size, the longer it takes to contact all
peers in it, resulting in the existence of older—and
therefore more likely to be invalid—links. Of course, a
larger cache also takes up more memory, although this
is generally not a significant constraint nowadays.
Considering this tradeoff, and after an extensive
set of experiments that cannot be presented here due

Hook Description
selectPeer() Select peer from the item with the oldest timestamp
selectItemsToSend():
RANDOM Randomly select g
v
items
BIASED Select the g
v
items of nodes semantically closest to the selected peer
AGGRESSIVELY BIASED Select the g
v
items of nodes semantically closest to the selected peer from the VICINITY view
and the CYCLON view
selectItemsToKeep() Keep the c
v
items of nodes that are semantically closest, out of items in its current view, items
received, and items in the local CYCLON view. In case of multiple items from the same node,
keep the one with the most recent timestamp.
(a)
Hook Description
selectPeer() Select peer from the item with the oldest timestamp
selectItemsToSend():
active thread Select own item and randomly g
c
1 others from the CYCLON view
passive thread Randomly select g
c
items from the CYCLON view
selectItemsToKeep() Keep all g
c
received items, replacing the g
c
selected ones to send. In case of multiple items from
the same node, keep the one with the most recent timestamp.
(b)
Figure 3. The chosen policies for (a) the VICINITY protocol and (b) the CYCLON protocol.
to space limitations, we chose to present experiments
with cache size 50 for each of the two layers.
Gossip length The gossip length, that is, the number of
items gossiped per gossip exchange per protocol, is a
crucial factor for the amount of bandwidth used. This
becomes of greater consequence, considering that an
item carries the file list of its respective node. So, even
though exchangingmore items per gossip exchange al-
lows information to disseminate faster, we are inclined
to keep the gossip lengths as low as possible, as long
as the system’s performance is reasonable.
The gossip lengths we selected for each of the two
layers is 3, an admittedly low value.
Gossip period T The gossip period is a parameter that
does not affect the protocol’s behavior. The protocol
evolves as a function of the number of messages ex-
changed, or, consequently, of the number of cycles
elapsed. The gossip period only affects how fast the
protocol’s evolution will take place in time. The sin-
gle constraint is that the gossip period T should be ad-
equately longer than the worse latency throughout the
network, so that gossip exchanges are not favored or
hindered due to latency heterogeneity. A typical gos-
sip period for our protocol would be 1 minute, even
though this does not affect the following analysis.
4. Performance Evaluation
4.1. Convergence
In our first experiment, we evaluate the convergence
speed of our algorithm by considering how quickly it finds
0
0.5
1
1.5
2
2.5
3
3.5
4
0 50 100 150 200 250 300
avg # common files per sem. neighbor
cycles
Random Vicinity
Random Vicinity + Cyclon
Biased Vicinity + Cyclon
Aggr. Biased Vicinity + Cyclon
Optimal semantic lists
Figure 4. Convergence of sem. views’ quality.
nodes having files in common. The proximity function’s ob-
jective is for each node to discover the ` peers that have
the most common files with it. Therefore, a good metric
of the progress towards this goal is the average number of
common files between a node and each one of its seman-
tic neighbors. From our traces, we measured that in the op-
timal organization, this metric has a value of 3.87.
Figure 4 shows this metric as a function of the cycle for
four distinct experiments. In favor of comparison fairness,
the cache size and gossip length are 50 and 3, respectively,
in each layer, for all experiments. The only exception is the
first experiment, which has a single layer. In this case, the
cache size and gossip length are 100 and 6, respectively. All
experiments start with each node knowing 5 random other
ones, simply to ensure initial connectivity in a single con-

Citations
More filters
Journal ArticleDOI

CYCLON: Inexpensive Membership Management for Unstructured P2P Overlays

TL;DR: The protocol is shown to construct graphs that have low diameter, low clustering, highly symmetric node degrees, and that are highly resilient to massive node failures, and it is shown that the protocol is highly reactive to restoring randomness when a large number of nodes fail.
Journal ArticleDOI

TRIBLER: a social‐based peer‐to‐peer system

TL;DR: Tribler as discussed by the authors is a peer-to-peer file-sharing system that exploits social phenomena by maintaining social networks and using these in content discovery, content recommendation, and downloading.
Journal ArticleDOI

T-Man: Gossip-based fast overlay topology construction

TL;DR: The paper presents extensive empirical analysis of the protocol along with theoretical analysis of certain aspects of its behavior, and describes a practical application of T-Man for building Chord distributed hash table overlays efficiently from scratch.
Journal ArticleDOI

Gossiping in distributed systems

TL;DR: A brief introduction to the field of gossiping in distributed systems is presented, by providing a simple framework and using that framework to describe solutions for various application domains.
Proceedings Article

Sub-2-Sub: Self-Organizing Content-Based Publish and Subscribe for Dynamic and Large Scale Collaborative Networks

TL;DR: Sub-2-Sub is presented, a collaborative self-organizing publish/subscribe system deploying an unstructured overlay network that supports both value-based and interval-based subscriptions and relies on an epidemic-based algorithm.
References
More filters
Journal ArticleDOI

Statistical mechanics of complex networks

TL;DR: In this paper, a simple model based on the power-law degree distribution of real networks was proposed, which was able to reproduce the power law degree distribution in real networks and to capture the evolution of networks, not just their static topology.
Book

Small Worlds: The Dynamics of Networks between Order and Randomness

TL;DR: Duncan Watts uses the small-world phenomenon--colloquially called "six degrees of separation"--as a prelude to a more general exploration: under what conditions can a small world arise in any kind of network?
Journal ArticleDOI

Small Worlds: The Dynamics of Networks between Order and Randomness

TL;DR: The dynamics of networks between order and randomness, characteristics of small world networks, and the structure and dynamic of networks mark newman.
Proceedings ArticleDOI

Efficient content location using interest-based locality in peer-to-peer systems

TL;DR: This work proposes a content location solution in which peers loosely organize themselves into an interest- based structure on top of the existing Gnutella network, and demonstrates the existence of interest-based locality in five diverse traces of content distribution applications, two of which are traces of popular peer-to-peer file-sharing applications.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What contributions have the authors mentioned in the paper "Epidemic-style management of semantic overlays for content-based searching" ?

In this paper the authors follow a different approach, proposing a proactive method to build a semantic overlay. 

To the best of their knowledge, all earlier work on implicit building of semantic overlays relies on using heuristics to decide which of the peers that served a node recently are likely to be useful again in future queries [ 10, 12, 5 ]. 

Due to the periodic behavior of gossiping, the price of having rapidly converging protocols may inhibit a high usage of network resources (i.e., bandwidth). 

In this paper the authors introduce the idea of applying epidemics to build and dynamically maintain semantic lists in a largescale file-sharing system. 

With gv = gc = 3 the system adapts a little faster to changes, but if bandwidth is of high concern, gv = gc = 1 can also provide very good results. 

In each gossip 2 · (gv + gc) items are transferred to and fromthe node, resulting in a total traffic of 4 · (gv +gc) items for a node per cycle. 

the authors show that using a two-layered approach combining two epidemic protocols is the appropriate way to build such a service. 

it can achieve this property fairly quickly even when a small number of items (such as 3 or 4) is exchanged in each communication, even for large caches of several dozens of items. 

(a)Hook Description selectPeer() Select peer from the item with the oldest timestamp selectItemsToSend():active thread Select own item and randomly gc −1 others from the CYCLON view passive thread Randomly select gc items from the CYCLON viewselectItemsToKeep() 

In particular, the authors advocate that when it comes to constructing and using semantic lists, these lists should be optimized for search only, regardless of any other desirable property of the resulting overlay.