Remember where you came from: on the second-order random walk based proximity measures

doi:10.14778/3015270.3015272

Remember Where You Came From: On The Second-Order

Random Walk Based Proximity Measures

Yubao Wu, Yuchen Bian, Xiang Zhang

Department of Electrical Engineering and Computer Science, Case Western Reserve University

{yubao.wu, yuchen.bian, xiang.zhang}@case.edu

ABSTRACT

Measuring the proximity between diﬀerent node s is a fun-

damental problem in graph analysis. Random walk based

proximity measures have been shown to be eﬀective and

widely used. Most existing random walk measures are based

on the ﬁrst-order Markov model, i.e., they assume that the

next step of the random surfer only depends on the current

node. However, this assumption neither holds in many real-

life applications nor cap t u res the clustering structure in the

graph. To ad d ress the limita ti o n of the existing ﬁrst-ord er

measures, in this paper, we study the second-order random

walk measures, which take the previously visited node in-

to consideration. While the existing ﬁrst-order measures

are built on node-to-node transition probabilities, in the

second-order random walk, we ne ed to consider the edge-

to-edge transition probabilities. Using incidence m a t ric es,

we develop simple and elegant matrix representations for

the second-order proximity measures. A desirable property

of the developed measures is that they degenerate to their

original ﬁrst-order forms wh en the eﬀect of the previous

step is zero. We further develop Monte Carlo methods to

eﬃciently compu te the second-order measures and provide

theoretical performance guarantees. Experimental resu l t s

show that in a variety of applications, the second-order mea-

sures can dramatically improve the performance compared

to their ﬁrst-order counterparts.

1. INTRODUCTION

A fundamental problem in graph analysis is to measure

the proximity (or closeness) between diﬀerent nodes. It

serves as the basis of many advanc ed tasks such as ran ki n g

and querying [22, 25 , 11 , 27], community detection [2, 26],

link prediction [21, 19], and graph-based semi-supervised

learning [29, 28].

Designing eﬀective proximity measures is a challenging

task. The simplest notation of proximity is based on the

shortest path or the network ﬂow between two nodes [6].

Random walk based measures have recently been shown to

be eﬀective and widely used in various applications. The ba-

This work is licensed under the Creative Commons Attribution-

NonCommercial-NoDerivatives 4.0 International License. To view a copy

of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For

any use beyond those covered by this license, obtain permission by emailing

info@vldb.org.

Proceedings of the VLDB Endowment, Vol. 10, No. 1

smart phone ﬂight

Figure 1: An example of the web domain graph

sic idea is to allow a surfer to randomly explore the graph.

The probabilities of the nodes being visited by the ran d o m

surfer are used to measure the importance of the nodes or

the similarity between diﬀerent nodes. The most commonly

used random walk based proximity measures include PageR-

ank [22], random walk with restart [2 5 ], and SimRank [11].

Most existing random walk measures are based on the

ﬁrst-order Markov model [15], i.e., they assume the next

node to be visited onl y depends on the cu rrent node and is

independent of the p rev io u s step. H owever, this assumption

does not hold in many real-life applications. For example,

consider the clickstream data wh i ch records the sequences

of web d o m a in s visited by individu a l users [3]. The existing

ﬁrst-order ra n d o m walk measures assume that the next page

a user will visit only depends on the current page and is

independent on the previous page the user has visited. This

is clearly not true.

Figure 1 shows a subgra p h of the real-life web domain

graph

1

[17]. Ea ch node in the graph represents a domain, and

two domains share an edge if there are hyperlinks between

them. The domains in the graph form two communities.

The domains in the left community are about smart phones,

and those in the right community are about ﬂights. Suppose

the random surfer is currently on google.com a n d the previ-

ously visited no d e is apple.com, i.e., the surfer came from the

smart phone community. The existing ﬁrst-order random

walk meas u res do not consider where the surfer came from

and the transition probability only depends o n the edges in-

cident to the current node . Based on this assu mp t i on and

the graph topology, in the next step, the probabilities to visit

att.com and delta.com are 2.4×10

5

and 3.1×10

5

respec-

tively. That is, the surfer has a higher probability to visit

a domain about ﬂight even though she just visited a smart

phone domain. However, using th e real-life clickstream data

(collected from comScore Inc.), given that the previous node

is apple.com, the probabilities to visit att.com and delta.com

are 8.5×10

4

and 3.7×10

6

respectively. That is, the proba-

bility to visit a smart phone domain is more than 200 times

higher than the probability to visit a ﬂight domain.

1

The entire graph is publicly available at http://webdatacommons.org

13

basketball foot b al l

Figure 2: An example of the Twitter follower network

Figure 3: An example of the research collaboration network

As another example, consider the Twitter follower net-

work. Figure 2 shows a subgraph o f the real Twitter follower

network. The users form two communities, the basketball

community on the left and the football community on the

right. If the NBA player LeBron James (@ Kin g Ja m es) posts

a tweet, it is likely to be propagated among the users in the

basketball c o m munity instea d of the football community.

Similarly, if the Pittsburgh Steelers (@steelers) post a tweet,

it is likely to be prop a g a t ed in the football community.

The real tweet cascade data supports this intuition: given

that the tweet is fro m @KingJames, the transition prob-

abilities from @espn to @NBA and from @espn to @NFL

are 5.6 × 10

2

and 2.1 × 10

4

respectively. However, using

the ﬁrst-order random walk, the transition probabilities are

both 4.5 × 10

3

. That is, the probabilities to go to both

communities are similar.

In the examples above, the visiting sequences are recorded

in the network ﬂow data. When such data is not available, it

is still important to know where the surfer came from. Con-

sider the local community detection problem, whose goal

is to ﬁnd a community nearby a given query node [2, 26].

Using the DBLP data

2

, Figure 3 shows three diﬀerent re-

search communities involving Prof. Jiawei Han at UIUC.

The authors in the left co mmunity are senior researchers in

the core data mining research areas. The authors in the up-

per right community have published many works on social

media mining. The authors in the lower right community

mostly collaborate on information retrieval. Suppose that

the random surfer came from the left community, e.g., from

Prof. Wei Wang, and is currently at node Prof. Jiawei Han.

Intuitively, in the next step, the surfer should walk to a node

in t h e left community, since the authors in this communi-

ty are more similar to Prof. Wei Wang. However, using

the ﬁrst -o rd er random walk model, the probabilities of the

surfer walking into the three communities are similar.

To address the limitation of the existin g ﬁrst-order ran-

dom walk based proximity measures, in this paper, we inves-

tigate the second-order random walk measures, which take

the previously visited node into cons id era t io n . We systemat-

ically study the theoretical foundations of the second-order

measures. Spe ci ﬁ c a ll y, th e existing ﬁrst-order measures are

all built on the node-to-node transition probab i liti es, which

can be deﬁned using the adjacency matrix of the graph.

To take the previous step into consideration, in the secon d -

order random walk, we need to consider edge-to-edge tran-

2

The d at a is publicly available at http://dblp.uni-trier.de/xml/

sition p ro b a b il it ie s. We show that such probabilities can

be conveniently rep resented by incidence matrices of the

graph [15]. Based on these mathem a ti c a l tools , we develop

simple and elegant matrix representations for the second-

order measures including PageRank [22], random walk with

restart [25], SimRank [11], and SimRank* [27], which are

among the most widely used proximity measures. A de-

sirable property of the developed second - ord e r measures is

that they can degenerate to their original ﬁrst-order forms

when the eﬀect of the previous step is zero. Furthermore,

to eﬃciently compute the second-order measures, we design

Monte Carlo algo rit h ms , which eﬀectively simulate paths of

the random su rfe r and estimate proximity values. We for-

mally prove that the est im a te d proximity value is sharply

concentrated around the exact value and converges to the

exact value when the sample size is large. We perform ex te n -

sive experiments to evaluate the eﬀectiveness of the devel-

oped s ec o n d -o rd er meas u res and the eﬃciency of the Monte

Carlo algorithms using both real and synthetic networks.

2. RELATED WORK

In the ﬁrst-order random walk, a random surfer explores

the g ra p h according to the node-to-node transition prob-

abilities determined by the graph topology. If the rando m

walk on the graph is irreducib l e and aperiodic, there is a sta-

tionary proba b i li ty for visiting each node [15]. Various ran-

dom walk based proximity measu res have been developed,

among which PageRank [22], random walk with restart [25],

SimRank [11], and SimRank* [27] have gained signiﬁcant

popularity and been extensively studied. In PageRank, in

addition to following the transition probability, at each time

point, the sur fe r also has a constant probability to jump to

any node in the graph. Random walk with restart is the

query biased version of PageRank: at each time point, the

surfer has a constant probability to jump to the query n-

ode. SimRank is based on the intuition that two nodes are

similar if their neighbors are similar. Th e SimRank value

between two nodes measures the expe ct ed number of step-

s required before two surfers, one starting from ea ch node,

meet at the same node if they walk in lock-step. SimRank*

is a variant of SimRank, which allows the two surfers not to

walk in lock-step.

Very limited work has been done on the second-ord er ran-

dom walk measure. In [24], the authors study memory-

based PageRank, which considers the previously visited

node. However, the developed measure does not degenerate

to the original PageRank when the eﬀect from the previous

node is zero. A l on g the sam e line, multilinear PageRank

[9] also tries to generalize PageRank to the second- ord e r.

It approximates the probability of visiting an edge by the

product of the probabilities of visiting its two end n odes.

This may not be reaso n a b le, e.g., th e probability of visit-

ing a nonexistent edge would be no n -z ero . Both methods

are speciﬁcally designed for PageRank and do not apply to

other measures.

3. THE SECOND-ORDER RANDOM WALK

In this section, we study the foundati on of the second-

order rand o m walk. The ﬁrst-order random walk is ba sed on

node-to-node transition probabilities. In the secon d -o rd er

random walk, we need to consider edge-to-edge transition

14

Table 1: Main symbols

symbols deﬁnitions

G(V, E) directed graph G with node set V and edge set E

I

i

, O

i

set of in-/out-neighbor no d es of node i

n,m,σ number of nodes; number of edges; σ =

P

i∈V

|I

i

|·|O

i

|

B

n×m incidence matrix, [B]

i,u

=1 : u is an out-edge of i

E

m×n incidence matrix, [E]

u,i

=1 : u is an in-edge of i

w

i,j

, w

i

weight of edge (i,j); out-degree of i :

w

i

=

P

j∈O

i

w

i,j

W

m×m diagonal matrix, [W]

u,u

=w

i,j

if edge u=(i,j)

D

n× n diagonal matrix, [D]

i,i

=w

i

p

i,j

transition probability from node i to j

p

i,j,k

transition prob. from j to k if the surfer came from i

p

u,v

transition prob. from edge u to v,

p

(i,j),(j,k)

=p

i,j,k

P

n× n node-to-node transition matrix,

[P]

i,j

=p

i,j

H

n× m node-to-edge transition matrix,

[H]

i,(i,j)

=p

i,j

M

m× m edge-to-edge transition matrix, [M]

u,v

=p

u,v

r

i,j

, r

i

r

i,j

: proximity value of node i w .r.t. node j; r

i

=r

i,q

r,R

r : n ×1 vector, r

i

=r

i

; R : n × n matrix, [R]

i,j

=r

i,j

s

u

, s

(i,j)

proximity value of edge u or (i, j) w.r.t. query node q

s

u,v

proximity value between edges u and v

s,S

s : m×1 vector, s

u

=s

u

; S : m×m matrix, [S]

u,v

=s

u,v

(a) a toy graph (b) the line graph

Figure 4: An example graph and its line graph

probabilities. The main symbols used in this pape r and

their deﬁnitions are listed in Table 1.

3.1 The Edge-to-Edge Transition Probability

Consider the ﬁrst -o rd er random walk, where a surfer walks

from node i to j with probability p

i,j

. Let X

t

be a random

variable representing the node v is it ed by the surfer at time

point t. The node-to-node transition probability p

i,j

can be

represented as a conditional probability P[ X

t

= j|X

t

1

= i].

Let r

t

j

= P[X

t

= j] represent the pro b ab i li ty of the surfer

visiting node j at time t. We have r

t

j

=

P

i∈I

j

p

i,j

· r

t

1

i

,

where I

j

is the set of in-neighbors of j.

Now co n s id er the second-order random walk. We need to

consider where the surfer came from, i. e., the node visited

before the current node. We u s e p

i,j,k

to re p resent t h e tran-

sition probability from node j to k given that the previous

step was from node i to j, i .e. , p

i,j,k

= P[X

t

1

= k|X

t 1

= i,

X

t

=j] = P[X

t

=j,X

t

1

=k|X

t 1

=i,X

t

=j].

Let Y

t

= (i,j) represent the joint event (X

t

1

= i,X

t

= j),

i.e., the surfer is at node i at time (t − 1) and at node j at

time t. Then, the second-order transition probability can

be written as p

i,j,k

=P[Y

t 1

=(j, k)|Y

t

=(i,j)], which can be

interpreted as th e transition probability between edges: let

u = (i, j) be the edge from node i to j, and v = (j, k) be the

edge from node j to k, we can rewrite p

i,j,k

as p

u,v

.

Probability p

i,j,k

can be treated as the node-to-node tran-

sition probability in the line graph of th e original graph. Fo r

example, Figures 4(a) and 4(b) show an example graph and

its line graph. The second-order transition probability p

4,2,1

in Figure 4(a) is the same as the ﬁrst- o rd er transition prob-

ability p

e,b

in Figure 4(b).

Let s

t

(i,j)

= P[Y

t

= (i, j)] denote the probability of visiting

edge (i,j) between time (t−1) and t. We have that

s

t

1

(j,k)

=

P

i∈I

j

p

i,j,k

·s

t

(i,j)

In the following, we introduce incidence matrices, which

will be used as building blo cks in the sec o n d - o rd er random

walk measures.

3.2 Incidence Matrices as the Basic Tool

A graph can be represented by its adjacency matrix A,

whose element [A]

i,j

represents the weight of edge (i,j).

Let D denote the diagonal matrix with [D]

i,i

being the out-

degree of no d e i. In the ﬁrst-order random walk, the node-

to-node transition matrix can be represe nted as P = D

1

A.

In the second-order random walk, instead of using the adja-

cency matrix, we will use incidence mat ric es [15].

The incidence matrices B and E represent th e out-edges

and in-edges of the nodes respectively. In matrix B, each

row represents a node and each column represents a n edge.

In matrix E, each row represents an edge and each column

represents a node. The elements in matrices B and E are

deﬁned as follows.

[B]

i,u

=



1, if edge u is an out-edge of node i,

0, oth erwi se .

[E]

u,i

=



1, if edge u is an in-edge of node i,

0, oth erw ise .

For example, the inci d en c e matrices of the graph in Figure

4(a) are

B =

a b c d e f

1

2

3

4

5

"

1 0 0 0 0 0

0 1 0 0 0 0

0 0 1 1 0 0

0 0 0 0 1 0

0 0 0 0 0 1

#

, and E

=

a b c d e f

1

2

3

4

5

"

0 1 1 0 0 0

0 0 0 0 1 0

0 0 0 0 0 1

1 0 0 0 0 0

0 0 0 1 0 0

#

.

Note that in the above deﬁn it io n s , the orders of nod es and

edges are consistent in B and E.

The incidence matrices can be conveniently used to re-

construct other commonly used ma t ric es in graph analytics.

For example, let W be a diagonal matrix with [W]

u,u

be-

ing the weight of edge u, the n the adjacency matrix can be

represented by the incid en c e matrices as A = BWE. The

out-degree matrix D can be represented as D =BWB

.

Let H denote the node-to-edge t ra n si ti o n probability ma -

trix, with [H]

i,u

representing the probability th a t the surfer

will go through an out-edge u of node i, i.e.,

[H]

i,u

=



w

u

/w

i

, if edge u is an out-edge of node i,

0, otherwise,

where w

u

is the weight of edge u, and w

i

is the out-degree

of node i. H can be represented using incidence matrices

as H = D

1

BW. The node-to-node transition probability

matrix can then be represented as P =HE .

3.3 Obtaining Edge Transition Probabilities

In the ﬁrst-order random walk, the element p

i,j

in the

node-to-node transiti o n matrix P is calc u la t ed as p

i,j

=

w

i,j

w

i

,

where w

i,j

and w

i

are the weight of edge (i,j) and out-degree

of node i respectively.

In the seco n d - o rd er random walk, we use M to represent

the edg e- t o -ed g e transition matrix, with element p

u,v

=p

i,j,k

,

where u= (i, j) and v = (j, k). Next, we discuss two diﬀerent

ways to obtain the edge-to-edge transition probability.

Utilizing Network Flow Data: In many applicatio n s , the

information on the node visiting sequences is available. For

15

example, as discussed in Section 1, we may know the se-

quences of web domains browsed by diﬀerent users, or we

may have the tweet cascade informat io n . In th is case, we

can break each sequence into trigrams, i.e., segments con-

sisting of two consecutive edges [24]. For example, sequence

i→ j →k →l c a n be broken into two trigrams, i→ j →k and

j →k →l.

To obtain the secon d -o rd er transition probability, recall

that p

i,j,k

is the conditional probability of visiting edge (j, k)

given that the previously visited edge is (i, j). Let γ

i,j,k

be

the number of trigrams i→j →k. p

i,j,k

can be calculated as

p

i,j,k

=

γ

i,j,k

P

l∈O

j

γ

i,j,l

,

where O

j

is the set of out-neighbor nodes of j. That is, p

i,j,k

is the proportion of i → j → k trigrams in all trigrams with

(i,j) being the ﬁrst edge.

When the network ﬂow dat a is not available, we can u s e

the following approach to obtain p

i,j,k

.

Autoregressive Model: By taking the previous step in-

to consideration, the autoregressive model calculates the

second-order transition pro b ab i li ty as follows [23]

p

i,j,k

=

p

i,j,k

P

l∈O

j

p

i,j,l

,

where p

i,j,k

=(1−α)p

j,k

+αp

i,k

. The parameter α (0≤α<1)

is a constant to control the strength of eﬀect from the pre-

vious step. If α = 0, the second-orde r transition probabili-

ty degenerates to the ﬁrst-order transition probability, i.e.,

p

i,j,k

=p

j,k

.

The ed g e -t o -e d ge trans it io n ma t rix M based on the au-

toregressive model can be represented using incidence ma-

trices. Let

M

′

= (1 − α)EH + α(EB)⊙(B

PE ) ,

where ⊙ denotes the Hadamard (entry-wise) product. Then

M is the row normalized M

′

such that

P

v

p

u,v

=1. If α=0,

it degenerates to the ﬁrst-order form and we have M =EH .

Note that in addition to the two methods discussed above,

other methods, such as calculating the edge similarity based

on the line graph [20], can also be applied to calculate the

edge-to-edge transition probability. In this pa per, we only

focus on the two methods discussed here.

3.4 Matrix Form

Next we represent the second-order random walk in its

matrix form. Let s

t

denote the edge visiting probability

vector between time points (t−1) and t, i.e., s

t

u

=s

t

(i,j)

(u=

(i,j)). We have

s

t

1

= M

s

t

.

If M is primitive, s

t

converges according to the Perron-

Frobeniu s theorem [15]. Let s = lim

t→∞

s

t

denote the edge

stationary prob a b il ity. After having s, the node stationary

probability is simply the sum of all in-edge stationary prob-

abilities, i.e., r = E

s .

In the following, we show how to generalize the commonly

used proximity measure s to their second-order forms. Ta-

ble 2 s u mm a riz es recursi ve equ a t io n s of these measures in

their ﬁrst-order and second-order forms. In the table, RW,

PR, RR, SR, an d SS are shorthand notations for random

walk, PageRank, random walk with restart, SimRank, and

SimRank* respectively.

Table 2: Recursive equations of various measures

ﬁrst-order second-order

RW r = P r

s = M s

r = E s

PR

r = cP r+ (1 − c)1/n

s = cM s+ (1 − c)H 1/n

r = cE s+ (1 − c)1/n

RR

r = cP r+ (1 − c)q

s = cM s+ (1 − c)H q

r = cE s+ (1 − c)q

SR

R = cPRP + (1 − c)I

S = cMSM +(1 − c)EE

R = cHSH + (1 − c)I

SS

R =

c

2

(PR+RP )+(1−c) I

S =

c

2

(MS + SM )+ (1 − c)EE

R

′

=

c

2

MR

′

+

c

2

SH +(1 − c)E

R =

c

2

(HR

′

+(R

′

)

H )+(1−c)I

(a) ﬁrst-ord e r (b) second - o rd e r

Figure 5: The jumping process in PageRank

4. THE SECOND-ORDER PAGERANK

In the ﬁrst -o rd e r PageRank, the surfer has a probability

of c to follow th e node-to-no d e transition probabilities, and

a probabi li ty of (1−c) to randomly jump to any node in the

graph. Figure 5(a) illustrates the jumping process.

The matrix form of the ﬁrst-order PageRank is r=cP r +

(1 −c)1/n, where r is the node visiting probability vector,

P is the node-to-node transition matrix, and 1 is a vector

of all 1’s.

Similarly, in the second-order PageRank, the surfer has a

probability of c to follow the edge-to-edge transition prob-

abilities, and a probability of (1 − c) to randomly jump to

any node in the graph. Its matrix form can be written as

s

t

1

=cM

s

t

+(1−c)v, where v is the vector corresp o n d in g

to the jumping process. To det ermi n e v, we consider the

jumping process in further deta i ls.

Figure 5(b) shows the jumping process in th e second-order

PageRank. At time point (t−1), starting from nod e i, with

probability c, the surfer ﬁrst visits node j and then k by

following the second-order transition probability p

i,j,k

, and

with probability (1 − c), the surfer randomly jumps to any

node x ﬁrst and then visits node y. After jumping, the

eﬀect of the previous step is lost, thus p

i,x,y

= p

x,y

, w h ich

is the ﬁrst-order transition p roba b i li ty. Since the sum of

probabilities to jump to node x is (1 − c)/n, the probability

of vi sit i n g edge (x,y) between time points t and (t+1) is

(1− c)p

x,y

/n. Thus we have v = H

1/n, where H is the

node-to-edge transition matrix introduced in Section 3.2.

Finally, we have

s

t

1

= cM

s

t

+(1−c)H 1/n .

Theorem 1. In the second-order PageRank, if the out-

degree of every node is non-zero, there is a unique edge

stationary distribution, i.e., lim

t→∞

s

t

u

= s

u

, where s

u

is a

constant.

Proof. Since the probability distribu tio n vector s

t

sums

to 1, i.e., 1

s

t

=1, we have

s

t

1

=



cM

+

(1−c)

n

H

1

n×m



s

t

,

16

where 1

n×m

is an n ×m matrix of all 1’s. The matrix T =

cM

+

(1−c)

n

H

1

n×m

is primitive since T is irreducible and

has positive diagonal elements [15]. Since the out-degree of

every node is non-zero, every column of T sums to 1. Thus,

1 is an eigenvalue of T. By the Perron-Frobenius theorem

[15], there is a unique edge stationary distribution and the

power method converges.

The node stationary distribution r can be obtained from

the edge stationary distribution s. The stationary probabil-

ity of node i equals c times the sum of the edge stationary

probabilities on the in-edges o f i, plus an additional jumping

probability (1−c)/n. The formula of r is given in Table 2.

Random walk with restart is the query biased version of

PageRank. In random walk with restart, instead of jumpin g

to every node uniformly, the s u rfer jumps to the given query

node q. Thus, for random walk with restart, the jumping

vector is v =H

q, where q

q

=1, and q

i

=0 i f i 6= q.

The developed second-order PageRank and random walk

with restart degenerate to their original ﬁrst-order forms

when the second-order transition probab i li ty is the sa me as

the ﬁrst-order transition probability, i.e., when p

i,j,k

=p

j,k

.

Please see Appendix A [1] for the proofs.

5. THE SECOND-ORDER SIMRANK

In SimRank, the random walk process involves two ran-

dom surfers [11]. Next, we ﬁrst give the preliminary of Sim-

Rank and discuss its representation based on meeting paths.

5.1 SimRank and Meeting Paths

The intuition behind SimRank is that two nodes are sim-

ilar if their in-neig hbors are also similar. Let r

i,j

denote the

SimRank proximity value between nodes i a n d j. Si mR ank

is deﬁned as

r

i,j

=



1, if i = j ,

c

|I

i

|·|I

j

|

P

k∈I

i

P

l∈I

j

r

k,l

, if i6=j ,

where I

i

denotes the set of in-neighbors of n ode i, and c ∈

(0,1) is a constant.

The SimRan k value r

i,j

measures the expected number of

steps required before two surfers, one starting at node i and

the other at node j, meet at the same node if th ey randomly

walk backward, i.e., from a node to one of its in-neighb o r

nodes, in lock-step [11].

Since walking backward is counter-intuitive, in the follow-

ing, we study SimRank in the reverse graph, which is ob-

tained by reversing the direction of every edge in the origina l

graph. In the reverse graph, the two random surfers walk

forward to a meeting node, and SimRank can be deﬁned as

r

i,j

=



1, if i = j ,

c

|O

i

|·|O

j

|

P

k∈O

i

P

l∈O

j

r

k,l

, if i6=j ,

where O

i

denotes the set of out- n ei ghbors of node i in th e

reverse graph.

In matrix form, the above recursive deﬁnition can be d e -

noted as R=cPRP

+(1−c)I, where matrix R records prox-

imity values for all node pairs with [R]

i,j

=r

i,j

[18, 27, 13].

SimRank values can also be represented as the weighted

sum of probabilities of visiting a l l meeting paths [27].

Definition 1. [Meeting Path]A meeting path φ of length

{a,b} between nodes i and j in a graph G(V,E) is a sequence

of nodes, denoted as z

0

→ z

1

→ ··· → z

a

← ··· ← z

b

1

← z

b

,

such that i = z

0

, j = z

b

, (z

t

1

,z

t

) ∈ E for t = 1,2,··· ,a, and

(z

t

,z

t

1

)∈E for t = a + 1, a + 2, · · · ,b.

A meeting path of length {a,b} is symmetric if b=2a, such

as the meeting path 4→ 2 → 1 ← 3 ← 5 in Figure 4(a).

A meeti n g path φ : i = z

0

→ z

1

→ · · · → z

a

← · · · ← z

b

1

←

z

b

= j can be decompos ed into two pa t h s ρ

1

: i = z

0

→ z

1

→

··· → z

a

and ρ

2

: j = z

b

→ z

b

1

→ ·· · → z

a

. In the ﬁrs t- o rd er

random walk, starting from i, the probability to visit path

ρ

1

is P[ρ

1

] =

Q

a

t

1

p

z

t

1

,z

t

. Similarly, starting fro m j, the

probability to visit path ρ

2

is P[ρ

2

] =

Q

b

t

a 1

p

z

t

,z

t

1

. The

probability for the two surfers to visit φ and meet at node

z

a

is thus P[φ]= P[ρ

1

]·P[ρ

2

].

Let Φ

a,b

i,j

denote the set of all meeting paths of length

{a,b} between nodes i an d j, and P[Φ

a,b

i,j

] be the sum of

probabilities of visiting the meeting paths in Φ

a,b

i,j

. We have

the following lemma [27].

Lemma 1. P[Φ

a,b

i,j

]=[ P

a

(P

)

b

a

]

i,j

Thus the SimRank value r

i,j

can be represented as

r

i,j

=(1− c)

P

∞

t

0

c

t

P[Φ

t,2t

i,j

]=( 1−c)

P

∞

t 0

c

t

[P

t

(P

)

t

]

i,j

(1)

That is, r

i,j

is the weighted sum of probabilit ies of visiting

all symmetric meeting paths between nodes i and j. In

matrix form, we have

R =(1−c)

P

∞

t

0

c

t

P

t

(P

)

t

.

5.2 Visiting Meeting Paths in the Second-Order

To develop the second-order SimRank, we need to know

the probability of visiting the meeting paths in the second-

order. Consider the seco n d -o rd er visiting probability of path

ρ

1

: i = z

0

→ z

1

→ · · · →z

a

. Starting from node i, in the ﬁrst

step, the surfer follows the ﬁrst-order transition probabili-

ty, and then in the subsequent steps, the surfer follows the

second-order transit io n probabilities. Thus in the second-

order random walk, starting from i, the probability to vis-

it path ρ

1

is M[ρ

1

] = p

z

0

,z

1

Q

a

1

t

1

p

z

t

1

,z

t

,z

t 1

. Similarly, the

probability to visit path ρ

2

is M[ρ

2

]=p

z

b

,z

b

1

Q

b

1

t

a 1

p

z

t

1

,z

t

,z

t 1

.

The probability for the two surfers to visit the meeting path

φ a n d meet at node z

a

is M[φ]= M[ρ

1

]·M[ρ

2

].

Let M[Φ

a,b

i,j

]=

P

φ∈Φ

a,b

i,j

M[φ] be the sum of probabilities of

visiting the meeting paths in Φ

a,b

i,j

in th e second-order. The

following lemma shows how to compute M[Φ

a,b

i,j

] for diﬀerent

cases.

Lemma 2.

M[Φ

a,b

i,j

]=











I

i,j

, if 0 = a = b,

[HM

a

1

E]

i,j

, if 0 < a = b,

[E

(M )

b

1

H

]

i,j

, if 0 = a < b,

[HM

a

1

EE

(M )

b

a 1

H

]

i,j

, if 0< a < b .

Please see Appendix B [1] for the proof.

Lemma 1 for the ﬁrst-order random walk is a special case

of Lemma 2 when the second-o rd er transition probability is

the same as the ﬁrst-order transitio n probability.

Lemma 3. If p

i,j,k

=p

j,k

, we have that M[Φ

a,b

i,j

]=P[Φ

a,b

i,j

].

Please see Appendix B [1] for the proof.

Replacing P[Φ

t,2t

i,j

] by M[Φ

t,2t

i,j

] in Equation (1), we have

the second-order SimRank proximity

r

i,j

=(1−c)

P

∞

t

0

c

t

M[Φ

t,2t

i,j

].

17

Remember where you came from: on the second-order random walk based proximity measures

Citations

CloudRanger: root cause identification for cloud native systems

KnightKing: a fast distributed graph random walk engine

Local Community Detection in Multiple Networks

On Multi-query Local Community Detection

Memory-aware framework for fast and scalable second-order random walk over billion-edge natural graphs

References

The PageRank Citation Ranking : Bringing Order to the Web

Probability Inequalities for sums of Bounded Random Variables

The link-prediction problem for social networks

Semi-supervised learning using Gaussian fields and harmonic functions

Handwritten Digit Recognition with a Back-Propagation Network

Related Papers (5)

Local Graph Partitioning using PageRank Vectors

Robust local community detection: on free rider effect and its elimination

Fast Random Walk with Restart and Its Applications

Heat kernel based community detection

Querying k-truss community in large and dynamic graphs