scispace - formally typeset
Journal ArticleDOI

Remember where you came from: on the second-order random walk based proximity measures

Yubao Wu, +2 more
- Vol. 10, Iss: 1, pp 13-24
Reads0
Chats0
TLDR
Experimental results show that in a variety of applications, the second-order measures can dramatically improve the performance compared to their first-order counterparts.
Abstract
Measuring the proximity between different nodes is a fundamental problem in graph analysis. Random walk based proximity measures have been shown to be effective and widely used. Most existing random walk measures are based on the first-order Markov model, i.e., they assume that the next step of the random surfer only depends on the current node. However, this assumption neither holds in many real-life applications nor captures the clustering structure in the graph. To address the limitation of the existing first-order measures, in this paper, we study the second-order random walk measures, which take the previously visited node into consideration. While the existing first-order measures are built on node-to-node transition probabilities, in the second-order random walk, we need to consider the edge-to-edge transition probabilities. Using incidence matrices, we develop simple and elegant matrix representations for the second-order proximity measures. A desirable property of the developed measures is that they degenerate to their original first-order forms when the effect of the previous step is zero. We further develop Monte Carlo methods to efficiently compute the second-order measures and provide theoretical performance guarantees. Experimental results show that in a variety of applications, the second-order measures can dramatically improve the performance compared to their first-order counterparts.

read more

Content maybe subject to copyright    Report

Remember Where You Came From: On The Second-Order
Random Walk Based Proximity Measures
Yubao Wu, Yuchen Bian, Xiang Zhang
Department of Electrical Engineering and Computer Science, Case Western Reserve University
{yubao.wu, yuchen.bian, xiang.zhang}@case.edu
ABSTRACT
Measuring the proximity between different node s is a fun-
damental problem in graph analysis. Random walk based
proximity measures have been shown to be effective and
widely used. Most existing random walk measures are based
on the first-order Markov model, i.e., they assume that the
next step of the random surfer only depends on the current
node. However, this assumption neither holds in many real-
life applications nor cap t u res the clustering structure in the
graph. To ad d ress the limita ti o n of the existing first-ord er
measures, in this paper, we study the second-order random
walk measures, which take the previously visited node in-
to consideration. While the existing first-order measures
are built on node-to-node transition probabilities, in the
second-order random walk, we ne ed to consider the edge-
to-edge transition probabilities. Using incidence m a t ric es,
we develop simple and elegant matrix representations for
the second-order proximity measures. A desirable property
of the developed measures is that they degenerate to their
original first-order forms wh en the effect of the previous
step is zero. We further develop Monte Carlo methods to
efficiently compu te the second-order measures and provide
theoretical performance guarantees. Experimental resu l t s
show that in a variety of applications, the second-order mea-
sures can dramatically improve the performance compared
to their first-order counterparts.
1. INTRODUCTION
A fundamental problem in graph analysis is to measure
the proximity (or closeness) between different nodes. It
serves as the basis of many advanc ed tasks such as ran ki n g
and querying [22, 25 , 11 , 27], community detection [2, 26],
link prediction [21, 19], and graph-based semi-supervised
learning [29, 28].
Designing effective proximity measures is a challenging
task. The simplest notation of proximity is based on the
shortest path or the network flow between two nodes [6].
Random walk based measures have recently been shown to
be effective and widely used in various applications. The ba-
This work is licensed under the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License. To view a copy
of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For
any use beyond those covered by this license, obtain permission by emailing
info@vldb.org.
Proceedings of the VLDB Endowment, Vol. 10, No. 1
Copyright 2016 VLDB Endowment 2150-8097/16/09.
smart phone flight
Figure 1: An example of the web domain graph
sic idea is to allow a surfer to randomly explore the graph.
The probabilities of the nodes being visited by the ran d o m
surfer are used to measure the importance of the nodes or
the similarity between different nodes. The most commonly
used random walk based proximity measures include PageR-
ank [22], random walk with restart [2 5 ], and SimRank [11].
Most existing random walk measures are based on the
first-order Markov model [15], i.e., they assume the next
node to be visited onl y depends on the cu rrent node and is
independent of the p rev io u s step. H owever, this assumption
does not hold in many real-life applications. For example,
consider the clickstream data wh i ch records the sequences
of web d o m a in s visited by individu a l users [3]. The existing
first-order ra n d o m walk measures assume that the next page
a user will visit only depends on the current page and is
independent on the previous page the user has visited. This
is clearly not true.
Figure 1 shows a subgra p h of the real-life web domain
graph
1
[17]. Ea ch node in the graph represents a domain, and
two domains share an edge if there are hyperlinks between
them. The domains in the graph form two communities.
The domains in the left community are about smart phones,
and those in the right community are about flights. Suppose
the random surfer is currently on google.com a n d the previ-
ously visited no d e is apple.com, i.e., the surfer came from the
smart phone community. The existing first-order random
walk meas u res do not consider where the surfer came from
and the transition probability only depends o n the edges in-
cident to the current node . Based on this assu mp t i on and
the graph topology, in the next step, the probabilities to visit
att.com and delta.com are 2.4×10
5
and 3.1×10
5
respec-
tively. That is, the surfer has a higher probability to visit
a domain about flight even though she just visited a smart
phone domain. However, using th e real-life clickstream data
(collected from comScore Inc.), given that the previous node
is apple.com, the probabilities to visit att.com and delta.com
are 8.5×10
4
and 3.7×10
6
respectively. That is, the proba-
bility to visit a smart phone domain is more than 200 times
higher than the probability to visit a flight domain.
1
The entire graph is publicly available at http://webdatacommons.org
13

basketball foot b al l
Figure 2: An example of the Twitter follower network
Figure 3: An example of the research collaboration network
As another example, consider the Twitter follower net-
work. Figure 2 shows a subgraph o f the real Twitter follower
network. The users form two communities, the basketball
community on the left and the football community on the
right. If the NBA player LeBron James (@ Kin g Ja m es) posts
a tweet, it is likely to be propagated among the users in the
basketball c o m munity instea d of the football community.
Similarly, if the Pittsburgh Steelers (@steelers) post a tweet,
it is likely to be prop a g a t ed in the football community.
The real tweet cascade data supports this intuition: given
that the tweet is fro m @KingJames, the transition prob-
abilities from @espn to @NBA and from @espn to @NFL
are 5.6 × 10
2
and 2.1 × 10
4
respectively. However, using
the first-order random walk, the transition probabilities are
both 4.5 × 10
3
. That is, the probabilities to go to both
communities are similar.
In the examples above, the visiting sequences are recorded
in the network flow data. When such data is not available, it
is still important to know where the surfer came from. Con-
sider the local community detection problem, whose goal
is to find a community nearby a given query node [2, 26].
Using the DBLP data
2
, Figure 3 shows three different re-
search communities involving Prof. Jiawei Han at UIUC.
The authors in the left co mmunity are senior researchers in
the core data mining research areas. The authors in the up-
per right community have published many works on social
media mining. The authors in the lower right community
mostly collaborate on information retrieval. Suppose that
the random surfer came from the left community, e.g., from
Prof. Wei Wang, and is currently at node Prof. Jiawei Han.
Intuitively, in the next step, the surfer should walk to a node
in t h e left community, since the authors in this communi-
ty are more similar to Prof. Wei Wang. However, using
the first -o rd er random walk model, the probabilities of the
surfer walking into the three communities are similar.
To address the limitation of the existin g first-order ran-
dom walk based proximity measures, in this paper, we inves-
tigate the second-order random walk measures, which take
the previously visited node into cons id era t io n . We systemat-
ically study the theoretical foundations of the second-order
measures. Spe ci c a ll y, th e existing first-order measures are
all built on the node-to-node transition probab i liti es, which
can be defined using the adjacency matrix of the graph.
To take the previous step into consideration, in the secon d -
order random walk, we need to consider edge-to-edge tran-
2
The d at a is publicly available at http://dblp.uni-trier.de/xml/
sition p ro b a b il it ie s. We show that such probabilities can
be conveniently rep resented by incidence matrices of the
graph [15]. Based on these mathem a ti c a l tools , we develop
simple and elegant matrix representations for the second-
order measures including PageRank [22], random walk with
restart [25], SimRank [11], and SimRank* [27], which are
among the most widely used proximity measures. A de-
sirable property of the developed second - ord e r measures is
that they can degenerate to their original first-order forms
when the effect of the previous step is zero. Furthermore,
to efficiently compute the second-order measures, we design
Monte Carlo algo rit h ms , which effectively simulate paths of
the random su rfe r and estimate proximity values. We for-
mally prove that the est im a te d proximity value is sharply
concentrated around the exact value and converges to the
exact value when the sample size is large. We perform ex te n -
sive experiments to evaluate the effectiveness of the devel-
oped s ec o n d -o rd er meas u res and the efficiency of the Monte
Carlo algorithms using both real and synthetic networks.
2. RELATED WORK
In the first-order random walk, a random surfer explores
the g ra p h according to the node-to-node transition prob-
abilities determined by the graph topology. If the rando m
walk on the graph is irreducib l e and aperiodic, there is a sta-
tionary proba b i li ty for visiting each node [15]. Various ran-
dom walk based proximity measu res have been developed,
among which PageRank [22], random walk with restart [25],
SimRank [11], and SimRank* [27] have gained significant
popularity and been extensively studied. In PageRank, in
addition to following the transition probability, at each time
point, the sur fe r also has a constant probability to jump to
any node in the graph. Random walk with restart is the
query biased version of PageRank: at each time point, the
surfer has a constant probability to jump to the query n-
ode. SimRank is based on the intuition that two nodes are
similar if their neighbors are similar. Th e SimRank value
between two nodes measures the expe ct ed number of step-
s required before two surfers, one starting from ea ch node,
meet at the same node if they walk in lock-step. SimRank*
is a variant of SimRank, which allows the two surfers not to
walk in lock-step.
Very limited work has been done on the second-ord er ran-
dom walk measure. In [24], the authors study memory-
based PageRank, which considers the previously visited
node. However, the developed measure does not degenerate
to the original PageRank when the effect from the previous
node is zero. A l on g the sam e line, multilinear PageRank
[9] also tries to generalize PageRank to the second- ord e r.
It approximates the probability of visiting an edge by the
product of the probabilities of visiting its two end n odes.
This may not be reaso n a b le, e.g., th e probability of visit-
ing a nonexistent edge would be no n -z ero . Both methods
are specifically designed for PageRank and do not apply to
other measures.
3. THE SECOND-ORDER RANDOM WALK
In this section, we study the foundati on of the second-
order rand o m walk. The first-order random walk is ba sed on
node-to-node transition probabilities. In the secon d -o rd er
random walk, we need to consider edge-to-edge transition
14

Table 1: Main symbols
symbols definitions
G(V, E) directed graph G with node set V and edge set E
I
i
, O
i
set of in-/out-neighbor no d es of node i
n,m,σ number of nodes; number of edges; σ =
P
iV
|I
i
|·|O
i
|
B
n×m incidence matrix, [B]
i,u
=1 : u is an out-edge of i
E
m×n incidence matrix, [E]
u,i
=1 : u is an in-edge of i
w
i,j
, w
i
weight of edge (i,j); out-degree of i :
w
i
=
P
jO
i
w
i,j
W
m×m diagonal matrix, [W]
u,u
=w
i,j
if edge u=(i,j)
D
n× n diagonal matrix, [D]
i,i
=w
i
p
i,j
transition probability from node i to j
p
i,j,k
transition prob. from j to k if the surfer came from i
p
u,v
transition prob. from edge u to v,
p
(i,j),(j,k)
=p
i,j,k
P
n× n node-to-node transition matrix,
[P]
i,j
=p
i,j
H
n× m node-to-edge transition matrix,
[H]
i,(i,j)
=p
i,j
M
m× m edge-to-edge transition matrix, [M]
u,v
=p
u,v
r
i,j
, r
i
r
i,j
: proximity value of node i w .r.t. node j; r
i
=r
i,q
r,R
r : n ×1 vector, r
i
=r
i
; R : n × n matrix, [R]
i,j
=r
i,j
s
u
, s
(i,j)
proximity value of edge u or (i, j) w.r.t. query node q
s
u,v
proximity value between edges u and v
s,S
s : m×1 vector, s
u
=s
u
; S : m×m matrix, [S]
u,v
=s
u,v
(a) a toy graph (b) the line graph
Figure 4: An example graph and its line graph
probabilities. The main symbols used in this pape r and
their definitions are listed in Table 1.
3.1 The Edge-to-Edge Transition Probability
Consider the first -o rd er random walk, where a surfer walks
from node i to j with probability p
i,j
. Let X
t
be a random
variable representing the node v is it ed by the surfer at time
point t. The node-to-node transition probability p
i,j
can be
represented as a conditional probability P[ X
t
= j|X
t
1
= i].
Let r
t
j
= P[X
t
= j] represent the pro b ab i li ty of the surfer
visiting node j at time t. We have r
t
j
=
P
iI
j
p
i,j
· r
t
1
i
,
where I
j
is the set of in-neighbors of j.
Now co n s id er the second-order random walk. We need to
consider where the surfer came from, i. e., the node visited
before the current node. We u s e p
i,j,k
to re p resent t h e tran-
sition probability from node j to k given that the previous
step was from node i to j, i .e. , p
i,j,k
= P[X
t
1
= k|X
t 1
= i,
X
t
=j] = P[X
t
=j,X
t
1
=k|X
t 1
=i,X
t
=j].
Let Y
t
= (i,j) represent the joint event (X
t
1
= i,X
t
= j),
i.e., the surfer is at node i at time (t 1) and at node j at
time t. Then, the second-order transition probability can
be written as p
i,j,k
=P[Y
t 1
=(j, k)|Y
t
=(i,j)], which can be
interpreted as th e transition probability between edges: let
u = (i, j) be the edge from node i to j, and v = (j, k) be the
edge from node j to k, we can rewrite p
i,j,k
as p
u,v
.
Probability p
i,j,k
can be treated as the node-to-node tran-
sition probability in the line graph of th e original graph. Fo r
example, Figures 4(a) and 4(b) show an example graph and
its line graph. The second-order transition probability p
4,2,1
in Figure 4(a) is the same as the first- o rd er transition prob-
ability p
e,b
in Figure 4(b).
Let s
t
(i,j)
= P[Y
t
= (i, j)] denote the probability of visiting
edge (i,j) between time (t1) and t. We have that
s
t
1
(j,k)
=
P
iI
j
p
i,j,k
·s
t
(i,j)
In the following, we introduce incidence matrices, which
will be used as building blo cks in the sec o n d - o rd er random
walk measures.
3.2 Incidence Matrices as the Basic Tool
A graph can be represented by its adjacency matrix A,
whose element [A]
i,j
represents the weight of edge (i,j).
Let D denote the diagonal matrix with [D]
i,i
being the out-
degree of no d e i. In the first-order random walk, the node-
to-node transition matrix can be represe nted as P = D
1
A.
In the second-order random walk, instead of using the adja-
cency matrix, we will use incidence mat ric es [15].
The incidence matrices B and E represent th e out-edges
and in-edges of the nodes respectively. In matrix B, each
row represents a node and each column represents a n edge.
In matrix E, each row represents an edge and each column
represents a node. The elements in matrices B and E are
defined as follows.
[B]
i,u
=
1, if edge u is an out-edge of node i,
0, oth erwi se .
[E]
u,i
=
1, if edge u is an in-edge of node i,
0, oth erw ise .
For example, the inci d en c e matrices of the graph in Figure
4(a) are
B =
a b c d e f
1
2
3
4
5
"
1 0 0 0 0 0
0 1 0 0 0 0
0 0 1 1 0 0
0 0 0 0 1 0
0 0 0 0 0 1
#
, and E
=
a b c d e f
1
2
3
4
5
"
0 1 1 0 0 0
0 0 0 0 1 0
0 0 0 0 0 1
1 0 0 0 0 0
0 0 0 1 0 0
#
.
Note that in the above defin it io n s , the orders of nod es and
edges are consistent in B and E.
The incidence matrices can be conveniently used to re-
construct other commonly used ma t ric es in graph analytics.
For example, let W be a diagonal matrix with [W]
u,u
be-
ing the weight of edge u, the n the adjacency matrix can be
represented by the incid en c e matrices as A = BWE. The
out-degree matrix D can be represented as D =BWB
.
Let H denote the node-to-edge t ra n si ti o n probability ma -
trix, with [H]
i,u
representing the probability th a t the surfer
will go through an out-edge u of node i, i.e.,
[H]
i,u
=
w
u
/w
i
, if edge u is an out-edge of node i,
0, otherwise,
where w
u
is the weight of edge u, and w
i
is the out-degree
of node i. H can be represented using incidence matrices
as H = D
1
BW. The node-to-node transition probability
matrix can then be represented as P =HE .
3.3 Obtaining Edge Transition Probabilities
In the first-order random walk, the element p
i,j
in the
node-to-node transiti o n matrix P is calc u la t ed as p
i,j
=
w
i,j
w
i
,
where w
i,j
and w
i
are the weight of edge (i,j) and out-degree
of node i respectively.
In the seco n d - o rd er random walk, we use M to represent
the edg e- t o -ed g e transition matrix, with element p
u,v
=p
i,j,k
,
where u= (i, j) and v = (j, k). Next, we discuss two different
ways to obtain the edge-to-edge transition probability.
Utilizing Network Flow Data: In many applicatio n s , the
information on the node visiting sequences is available. For
15

example, as discussed in Section 1, we may know the se-
quences of web domains browsed by different users, or we
may have the tweet cascade informat io n . In th is case, we
can break each sequence into trigrams, i.e., segments con-
sisting of two consecutive edges [24]. For example, sequence
i j k l c a n be broken into two trigrams, i j k and
j k l.
To obtain the secon d -o rd er transition probability, recall
that p
i,j,k
is the conditional probability of visiting edge (j, k)
given that the previously visited edge is (i, j). Let γ
i,j,k
be
the number of trigrams ij k. p
i,j,k
can be calculated as
p
i,j,k
=
γ
i,j,k
P
lO
j
γ
i,j,l
,
where O
j
is the set of out-neighbor nodes of j. That is, p
i,j,k
is the proportion of i j k trigrams in all trigrams with
(i,j) being the first edge.
When the network flow dat a is not available, we can u s e
the following approach to obtain p
i,j,k
.
Autoregressive Model: By taking the previous step in-
to consideration, the autoregressive model calculates the
second-order transition pro b ab i li ty as follows [23]
p
i,j,k
=
p
i,j,k
P
lO
j
p
i,j,l
,
where p
i,j,k
=(1α)p
j,k
+αp
i,k
. The parameter α (0α<1)
is a constant to control the strength of effect from the pre-
vious step. If α = 0, the second-orde r transition probabili-
ty degenerates to the first-order transition probability, i.e.,
p
i,j,k
=p
j,k
.
The ed g e -t o -e d ge trans it io n ma t rix M based on the au-
toregressive model can be represented using incidence ma-
trices. Let
M
= (1 α)EH + α(EB)(B
PE ) ,
where denotes the Hadamard (entry-wise) product. Then
M is the row normalized M
such that
P
v
p
u,v
=1. If α=0,
it degenerates to the first-order form and we have M =EH .
Note that in addition to the two methods discussed above,
other methods, such as calculating the edge similarity based
on the line graph [20], can also be applied to calculate the
edge-to-edge transition probability. In this pa per, we only
focus on the two methods discussed here.
3.4 Matrix Form
Next we represent the second-order random walk in its
matrix form. Let s
t
denote the edge visiting probability
vector between time points (t1) and t, i.e., s
t
u
=s
t
(i,j)
(u=
(i,j)). We have
s
t
1
= M
s
t
.
If M is primitive, s
t
converges according to the Perron-
Frobeniu s theorem [15]. Let s = lim
t→∞
s
t
denote the edge
stationary prob a b il ity. After having s, the node stationary
probability is simply the sum of all in-edge stationary prob-
abilities, i.e., r = E
s .
In the following, we show how to generalize the commonly
used proximity measure s to their second-order forms. Ta-
ble 2 s u mm a riz es recursi ve equ a t io n s of these measures in
their first-order and second-order forms. In the table, RW,
PR, RR, SR, an d SS are shorthand notations for random
walk, PageRank, random walk with restart, SimRank, and
SimRank* respectively.
Table 2: Recursive equations of various measures
first-order second-order
RW r = P r
s = M s
r = E s
PR
r = cP r+ (1 c)1/n
s = cM s+ (1 c)H 1/n
r = cE s+ (1 c)1/n
RR
r = cP r+ (1 c)q
s = cM s+ (1 c)H q
r = cE s+ (1 c)q
SR
R = cPRP + (1 c)I
S = cMSM +(1 c)EE
R = cHSH + (1 c)I
SS
R =
c
2
(PR+RP )+(1c) I
S =
c
2
(MS + SM )+ (1 c)EE
R
=
c
2
MR
+
c
2
SH +(1 c)E
R =
c
2
(HR
+(R
)
H )+(1c)I
(a) first-ord e r (b) second - o rd e r
Figure 5: The jumping process in PageRank
4. THE SECOND-ORDER PAGERANK
In the first -o rd e r PageRank, the surfer has a probability
of c to follow th e node-to-no d e transition probabilities, and
a probabi li ty of (1c) to randomly jump to any node in the
graph. Figure 5(a) illustrates the jumping process.
The matrix form of the first-order PageRank is r=cP r +
(1 c)1/n, where r is the node visiting probability vector,
P is the node-to-node transition matrix, and 1 is a vector
of all 1’s.
Similarly, in the second-order PageRank, the surfer has a
probability of c to follow the edge-to-edge transition prob-
abilities, and a probability of (1 c) to randomly jump to
any node in the graph. Its matrix form can be written as
s
t
1
=cM
s
t
+(1c)v, where v is the vector corresp o n d in g
to the jumping process. To det ermi n e v, we consider the
jumping process in further deta i ls.
Figure 5(b) shows the jumping process in th e second-order
PageRank. At time point (t1), starting from nod e i, with
probability c, the surfer first visits node j and then k by
following the second-order transition probability p
i,j,k
, and
with probability (1 c), the surfer randomly jumps to any
node x first and then visits node y. After jumping, the
effect of the previous step is lost, thus p
i,x,y
= p
x,y
, w h ich
is the first-order transition p roba b i li ty. Since the sum of
probabilities to jump to node x is (1 c)/n, the probability
of vi sit i n g edge (x,y) between time points t and (t+1) is
(1 c)p
x,y
/n. Thus we have v = H
1/n, where H is the
node-to-edge transition matrix introduced in Section 3.2.
Finally, we have
s
t
1
= cM
s
t
+(1c)H 1/n .
Theorem 1. In the second-order PageRank, if the out-
degree of every node is non-zero, there is a unique edge
stationary distribution, i.e., lim
t→∞
s
t
u
= s
u
, where s
u
is a
constant.
Proof. Since the probability distribu tio n vector s
t
sums
to 1, i.e., 1
s
t
=1, we have
s
t
1
=
cM
+
(1c)
n
H
1
n×m
s
t
,
16

where 1
n×m
is an n ×m matrix of all 1’s. The matrix T =
cM
+
(1c)
n
H
1
n×m
is primitive since T is irreducible and
has positive diagonal elements [15]. Since the out-degree of
every node is non-zero, every column of T sums to 1. Thus,
1 is an eigenvalue of T. By the Perron-Frobenius theorem
[15], there is a unique edge stationary distribution and the
power method converges.
The node stationary distribution r can be obtained from
the edge stationary distribution s. The stationary probabil-
ity of node i equals c times the sum of the edge stationary
probabilities on the in-edges o f i, plus an additional jumping
probability (1c)/n. The formula of r is given in Table 2.
Random walk with restart is the query biased version of
PageRank. In random walk with restart, instead of jumpin g
to every node uniformly, the s u rfer jumps to the given query
node q. Thus, for random walk with restart, the jumping
vector is v =H
q, where q
q
=1, and q
i
=0 i f i 6= q.
The developed second-order PageRank and random walk
with restart degenerate to their original first-order forms
when the second-order transition probab i li ty is the sa me as
the first-order transition probability, i.e., when p
i,j,k
=p
j,k
.
Please see Appendix A [1] for the proofs.
5. THE SECOND-ORDER SIMRANK
In SimRank, the random walk process involves two ran-
dom surfers [11]. Next, we first give the preliminary of Sim-
Rank and discuss its representation based on meeting paths.
5.1 SimRank and Meeting Paths
The intuition behind SimRank is that two nodes are sim-
ilar if their in-neig hbors are also similar. Let r
i,j
denote the
SimRank proximity value between nodes i a n d j. Si mR ank
is defined as
r
i,j
=
1, if i = j ,
c
|I
i
|·|I
j
|
P
kI
i
P
lI
j
r
k,l
, if i6=j ,
where I
i
denotes the set of in-neighbors of n ode i, and c
(0,1) is a constant.
The SimRan k value r
i,j
measures the expected number of
steps required before two surfers, one starting at node i and
the other at node j, meet at the same node if th ey randomly
walk backward, i.e., from a node to one of its in-neighb o r
nodes, in lock-step [11].
Since walking backward is counter-intuitive, in the follow-
ing, we study SimRank in the reverse graph, which is ob-
tained by reversing the direction of every edge in the origina l
graph. In the reverse graph, the two random surfers walk
forward to a meeting node, and SimRank can be defined as
r
i,j
=
1, if i = j ,
c
|O
i
|·|O
j
|
P
kO
i
P
lO
j
r
k,l
, if i6=j ,
where O
i
denotes the set of out- n ei ghbors of node i in th e
reverse graph.
In matrix form, the above recursive definition can be d e -
noted as R=cPRP
+(1c)I, where matrix R records prox-
imity values for all node pairs with [R]
i,j
=r
i,j
[18, 27, 13].
SimRank values can also be represented as the weighted
sum of probabilities of visiting a l l meeting paths [27].
Definition 1. [Meeting Path]A meeting path φ of length
{a,b} between nodes i and j in a graph G(V,E) is a sequence
of nodes, denoted as z
0
z
1
··· z
a
··· z
b
1
z
b
,
such that i = z
0
, j = z
b
, (z
t
1
,z
t
) E for t = 1,2,··· ,a, and
(z
t
,z
t
1
)E for t = a + 1, a + 2, · · · ,b.
A meeting path of length {a,b} is symmetric if b=2a, such
as the meeting path 4 2 1 3 5 in Figure 4(a).
A meeti n g path φ : i = z
0
z
1
· · · z
a
· · · z
b
1
z
b
= j can be decompos ed into two pa t h s ρ
1
: i = z
0
z
1
··· z
a
and ρ
2
: j = z
b
z
b
1
·· · z
a
. In the firs t- o rd er
random walk, starting from i, the probability to visit path
ρ
1
is P[ρ
1
] =
Q
a
t
1
p
z
t
1
,z
t
. Similarly, starting fro m j, the
probability to visit path ρ
2
is P[ρ
2
] =
Q
b
t
a 1
p
z
t
,z
t
1
. The
probability for the two surfers to visit φ and meet at node
z
a
is thus P[φ]= P[ρ
1
]·P[ρ
2
].
Let Φ
a,b
i,j
denote the set of all meeting paths of length
{a,b} between nodes i an d j, and P
a,b
i,j
] be the sum of
probabilities of visiting the meeting paths in Φ
a,b
i,j
. We have
the following lemma [27].
Lemma 1. P
a,b
i,j
]=[ P
a
(P
)
b
a
]
i,j
Thus the SimRank value r
i,j
can be represented as
r
i,j
=(1 c)
P
t
0
c
t
P
t,2t
i,j
]=( 1c)
P
t 0
c
t
[P
t
(P
)
t
]
i,j
(1)
That is, r
i,j
is the weighted sum of probabilit ies of visiting
all symmetric meeting paths between nodes i and j. In
matrix form, we have
R =(1c)
P
t
0
c
t
P
t
(P
)
t
.
5.2 Visiting Meeting Paths in the Second-Order
To develop the second-order SimRank, we need to know
the probability of visiting the meeting paths in the second-
order. Consider the seco n d -o rd er visiting probability of path
ρ
1
: i = z
0
z
1
· · · z
a
. Starting from node i, in the first
step, the surfer follows the first-order transition probabili-
ty, and then in the subsequent steps, the surfer follows the
second-order transit io n probabilities. Thus in the second-
order random walk, starting from i, the probability to vis-
it path ρ
1
is M[ρ
1
] = p
z
0
,z
1
Q
a
1
t
1
p
z
t
1
,z
t
,z
t 1
. Similarly, the
probability to visit path ρ
2
is M[ρ
2
]=p
z
b
,z
b
1
Q
b
1
t
a 1
p
z
t
1
,z
t
,z
t 1
.
The probability for the two surfers to visit the meeting path
φ a n d meet at node z
a
is M[φ]= M[ρ
1
]·M[ρ
2
].
Let M
a,b
i,j
]=
P
φΦ
a,b
i,j
M[φ] be the sum of probabilities of
visiting the meeting paths in Φ
a,b
i,j
in th e second-order. The
following lemma shows how to compute M
a,b
i,j
] for different
cases.
Lemma 2.
M
a,b
i,j
]=
I
i,j
, if 0 = a = b,
[HM
a
1
E]
i,j
, if 0 < a = b,
[E
(M )
b
1
H
]
i,j
, if 0 = a < b,
[HM
a
1
EE
(M )
b
a 1
H
]
i,j
, if 0< a < b .
Please see Appendix B [1] for the proof.
Lemma 1 for the first-order random walk is a special case
of Lemma 2 when the second-o rd er transition probability is
the same as the first-order transitio n probability.
Lemma 3. If p
i,j,k
=p
j,k
, we have that M
a,b
i,j
]=P
a,b
i,j
].
Please see Appendix B [1] for the proof.
Replacing P
t,2t
i,j
] by M
t,2t
i,j
] in Equation (1), we have
the second-order SimRank proximity
r
i,j
=(1c)
P
t
0
c
t
M
t,2t
i,j
].
17

Citations
More filters
Proceedings ArticleDOI

CloudRanger: root cause identification for cloud native systems

TL;DR: The proposed CloudRanger, a novel system dedicated for cloud native systems, offers a fast identification of culprit services when an anomaly occurs and outperforms some state-of-the-art approaches with a 10% improvement in accuracy.
Proceedings ArticleDOI

KnightKing: a fast distributed graph random walk engine

TL;DR: KnightKing is presented, the first general-purpose, distributed graph random walk engine, which adopts an intuitive walker-centric computation model and brings up to 4 orders of magnitude improvement in executing algorithms that currently can only be afforded with approximation solutions on large graphs.
Proceedings ArticleDOI

Local Community Detection in Multiple Networks

TL;DR: A novel RWM (Random Walk in Multiple networks) model to find relevant local communities in all networks for a given query node set from one network is proposed.
Proceedings ArticleDOI

On Multi-query Local Community Detection

TL;DR: A novel memory-based random walk method that can simultaneously identify multiple target local communities to which the query nodes belong and allows walkers with similar visiting histories to reinforce each other so that they can better capture the community structure instead of being biased to the query node.
Journal ArticleDOI

Memory-aware framework for fast and scalable second-order random walk over billion-edge natural graphs

TL;DR: To achieve the high efficiency of the second-order random walk within arbitrary memory budgets, a novel memory-aware framework is proposed on the basis of the cost model and two new node sampling methods are proposed that follow the acceptance-rejection paradigm.
References
More filters
Proceedings Article

The PageRank Citation Ranking : Bringing Order to the Web

TL;DR: This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them, and shows how to efficiently compute PageRank for large numbers of pages.
Book ChapterDOI

Probability Inequalities for sums of Bounded Random Variables

TL;DR: In this article, upper bounds for the probability that the sum S of n independent random variables exceeds its mean ES by a positive number nt are derived for certain sums of dependent random variables such as U statistics.
Journal IssueDOI

The link-prediction problem for social networks

TL;DR: Experiments on large coauthorship networks suggest that information about future interactions can be extracted from network topology alone, and that fairly subtle measures for detecting node proximity can outperform more direct measures.
Proceedings Article

Semi-supervised learning using Gaussian fields and harmonic functions

TL;DR: An approach to semi-supervised learning is proposed that is based on a Gaussian random field model, and methods to incorporate class priors and the predictions of classifiers obtained by supervised learning are discussed.
Proceedings Article

Handwritten Digit Recognition with a Back-Propagation Network

TL;DR: Minimal preprocessing of the data was required, but architecture of the network was highly constrained and specifically designed for the task, and has 1% error rate and about a 9% reject rate on zipcode digits provided by the U.S. Postal Service.