What have the authors stated for future works in "Ranking of closeness centrality for large-scale social networks" ?

There are many directions to extend this study. – Second, the condition under which the heuristic algorithm results in the least number of SSSP computation is an open problem and would be quite interesting to study. – Fourth, the authors would like to study the stability of their algorithms for top-k query using random sampling. This may be possible for some classes of social networks with certain properties on their average distance distributions.

What is the key in reducing the running time of the algorithm?

The key in reducing the running time of the algorithm is to find the right sample size ` to minimize `+ |E|, the total number of SSSP calculations.

What is the way to rank vertices?

The authors first provide a basic ranking algorithm TOPRANK(k), and show that under certain conditions, the algorithm ranks all top k highest closeness centrality vertices (with high probability) in O((k + n 2 3 · log 13 n)(n log n + m)) time, which is better thanAx = λx, where λ is the greatest eigenvalue of A to ensure that all values xi are positive by the Perron-Frobenius theorem.

How does the algorithm calculate the closeness centrality?

In [EW04], Eppstein and Wang developed an approximation algorithm to calculate the closeness centrality in time O( log n²2 (n log n + m)) within an additive error of ²∆ for the inverse of the closeness centrality (with probability at least 1 − 1n ), where ² > 0 and ∆ is the diameter of the graph.

How many vertices are correlated with the closeness measure?

Their study shows that, with a 50% sample of the original network nodes, average correlations for the closeness measure ranged from 0.54 to 0.71.

(Open Access) Ranking of Closeness Centrality for Large-Scale Social Networks (2008) | Kazuya Okamoto

Q: What have the authors contributed in "Ranking of closeness centrality for large-scale social networks" ?

In this paper, the authors combine existing methods on calculating exact values and approximate values of closeness centrality and present new algorithms to rank the top-k vertices with the highest closeness centrality. The authors show that under certain conditions, their algorithm is more efficient than the algorithm that calculates the closeness-centralities of all vertices.

Q: What are the measures of centrality commonly used in network analysis?

Four measures of centrality that are widely used in network analysis are degree centrality, betweenness centrality4, closeness centrality, and eigenvector centrality5.

Q: how many samples are used to calculate the average distance of a vertex?

If the algorithm uses ` = α log n²2 samples (α > 1 is a constant number) which will cause the probability of ²∆ error at each vertex to be bounded above by 1n2 , the probability of ²∆ error anywhere in the graph is then bounded from above by 1n (≥ 1 − (1 − 1n2 )n).

Q: What is the closeness centrality of v?

The closeness centrality cv of vertex v [Be65] is defined ascv = n− 1Σu∈V d(v, u) . (2.1)In other words, the closeness centrality of v is the inverse of the average (shortestpath) distance from v to any other vertex in the graph.

Q: What is the closestness centrality of a graph?

The problem to solve in this paper is to find the top k vertices with the highest closeness centralities and rank them, where k is a parameter of the algorithm.

Ranking of Closeness Centrality for Large-Scale

Social Networks

Kazuya Okamoto

, Wei Chen

, and Xiang-Yang Li

Kyoto University, okia@kuis.kyoto-u.ac.jp

Microsoft Research Asia, weic@microsoft.com

Illinois Institute of Technology and Microsoft Research Asia

xli@cs.iit.edu

Abstract. Closeness centrality is an important concept in social net-

work analysis. In a graph representing a social network, closeness cen-

trality measures how close a vertex is to all other vertices in the graph. In

this paper, we combine existing methods on calculating exact values and

approximate values of closeness centrality and present new algorithms

to rank the top-k vertices with the highest closeness centrality. We show

that under certain conditions, our algorithm is more eﬃcient than the

algorithm that calculates the closeness-centralities of all vertices.

1 Introduction

Social networks have been the subject of study for many decades in social sci-

ence research. In recent years, with the rapid growth of Internet and World Wide

Web, many large-scale online-based social networks such as Facebook, Friend-

ster app ear, and many large-scale social network data, such as coauthorship net-

works, become easily available online for analysis [Ne04a, Ne04b, EL05, PP02].

A social network is typically represented as a graph, with individual p ersons

represented as vertices, the relationships between pairs of individuals as edges,

and the strengths of the relationships represented as the weights on edges (for

the purpose of ﬁnding the shortest weighted distance, we can treat lower-weight

edges as stronger relationships). Centrality is an important concept in studying

social networks [Fr79, NP03]. Conceptually, centality measures how central an

individual is positioned in a social network. Within graph theory and network

analysis, various measures (see [KL05] for details) of the centrality of a vertex

within a graph have been proposed to determine the relative importance of a

vertex within the graph. Four measures of centrality that are widely used in net-

work analysis are degree centrality, betweenness centrality

, closeness centrality,

and eigenvector centrality

. In this paper, we focus on shortest-path closeness

For a graph G = (V, E), the betweenness centrality C

(v) for a vertex v is C

(v) =

s,t:s6=t6=v

(s,t)

σ(s,t)

where σ(s, t) is the number of shortest paths from s to t, and

(s, t) is the number of shortest paths from s to t that pass through v.

Given a graph G = (V, E) with adjacency matrix A, let x

b e the eigenvector central-

ity of the ith node v

. Then vector x = (x

, x

, · · · , x

)

is the solution of equation

centrality (or closeness centrality for short) [Ba50, Be65]. The closeness central-

ity of a vertex in a graph is the inverse of the average shortest-path distance from

the vertex to any other vertex in the graph. It can be viewed as the eﬃciency

of each vertex (inidividual) in spreading information to all other vertices. The

larger the closeness centrality of a vertex, the shorter the average distance from

the vertex to any other vertex, and thus the better positioned the vertex is in

spreading information to other vertices.

The closeness centrality of all vertices can be calculated by solving all-

pairs shortest-paths problem, which can be solved by various algorithms taking

O(nm + n

log n) time [Jo77, FT87], where n is the number of vertices and m

is the number of edges of the graph. However, these algorithms are not eﬃcient

enough for large-scale social networks with millions or more vertices. In [EW04],

Eppstein and Wang developed an approximation algorithm to calculate the close-

ness centrality in time O(

log n

(n log n + m)) within an additive error of ²∆ for

the inverse of the closeness centrality (with probability at least 1 −

), where

² > 0 and ∆ is the diameter of the graph.

However, applications may be more interested in ranking vertices with high

closeness centralities than the actual values of closeness centralities of all ver-

tices. Suppose we want to use the approximation algorithm of [EW04] to rank

the closeness centralities of all vertices. Since the average shortest-path distances

are bounded above by ∆, the average diﬀerence in average distance (the inverse

of closeness centrality) between the ith-ranked vertex and the (i + 1)th-ranked

vertex (for any i = 1, . . . , n − 1) is O(

∆

). To obtain a reasonable ranking result,

we would like to control the additive error of each estimate of closeness central-

ity to within O(

∆

), which means we set ² to Θ(

). Then the algorithm takes

Θ(n

log n(n log n + m)) time, which is worse than the exact algorithm.

Therefore, we cannot use either purely the exact algorithm or purely the

approximation algorithm to rank closeness centralities of vertices. In this paper,

we show a method of ranking top k highest closeness centrality vertices, com-

bining the approximation algorithm and the exact algorithm. We ﬁrst provide

a basic ranking algorithm TOPRANK(k), and show that under certain condi-

tions, the algorithm ranks all top k highest closeness centrality vertices (with

high probability) in O((k + n

· log

n)(n log n + m)) time, which is better than

Ax = λx, where λ is the greatest eigenvalue of A to ensure that all values x

are

p ositive by the Perron-Frobenius theorem. Google’s PageRank [BP98] is a variant

of the eigenvector centrality measure. The PageRank vector R = (r

, r

, · · · , r

)

where r

is the PageRank of webpage i and n is the total number of webpages, is

the solution of the equation

R =

1 − d

· 1 + dLR.

Here d is a damping factor set around 0.85, L is a modiﬁed webpage-adjacency

matrix: l

i,j

= 0 if page j does not link to i, and normalised such that, for each j,

i=1

i,j

= 1, i.e., l

i,j

where a

i,j

= 1 only if page j has link to page i, and

i=1

i,j

is the out-degree of page j.

O(n(n log n + m)) (when k = o(n)), the time needed by a brute-force algorithm

that simply computes all average shortest distances and then ranks them. We

then use a heuristic to further improve the algorithm. Our work can be viewed

as the ﬁrst step toward designing and evaluating eﬃcient algorithms in ﬁnding

top ranking vertices with highest closeness centralities. We discuss in the end

several open problems and future directions of this work.

2 Preliminary

We consider a connected weighted undirected graph G = (V, E) with n vertices

and m edges (|V | = n, |E| = m). We use d(v, u) to denote the length of a

shortest-path between v and u, and ∆ to denote the diameter of graph G, i.e.,

∆ = max

v,u ∈V

d(v, u). The closeness centrality c

of vertex v [Be65] is deﬁned

n − 1

u∈V

d(v, u)

. (2.1)

In other words, the closeness centrality of v is the inverse of the average (shortest-

path) distance from v to any other vertex in the graph. The higher the c

, the

shorter the average distance from v to other vertices, and v is more imp ortant

by this measure. Other deﬁnitions of closeness centralities exist. For example,

some deﬁne the closeness centrality of a vertex v as

u∈V

d(v ,u )

[Sa66], and some

deﬁne the closeness centrality as the mean geodesic distance (i.e the shortest

path) between a vertex v and all other vertices reachable from it, i.e.,

u∈V

d(v,u)

n−1

where n ≥ 2 is the size of the network’s connected component V reachable from

v. In this paper, we will focus on closeness centrality deﬁned in Equation (2.1).

The problem to solve in this paper is to ﬁnd the top k vertices with the

highest closeness centralities and rank them, where k is a parameter of the

algorithm. To solve this problem, we combine the exact algorithm [Jo77, FT87]

and the approximation algorithm [EW04] for computing average shortest-path

distances to rank vertices on the closeness centrality. The exact algorithm iterates

Dijkstra’s single-source shortest-paths (SSSP for short) algorithm n times for all

n vertices to compute the average shortest-path distances. The original Dijkstra’s

SSSP algorithm [Di59] computes all shortest-path distances from one vertex, and

it can be eﬃciently implemented in O(n log n + m) time [Jo77, FT87].

The approximation algorithm RAND given in [EW04] also uses Dijkstra’s

SSSP algorithm. RAND samples ` vertices uniformly at random and computes

SSSP from each sample vertex. RAND estimates the closeness centrality of a

vertex using the average of ` shortest-path distances from the vertex to the `

sample vertices instead of to all n vertices. The following bound on the accu-

racy of the approximation is given in [EW04], which utilizes the Hoeﬀding’s

theorem [Ho63]:

Pr{|

ˆc

−

| ≥ ²∆} ≤

log n

(

n−1

)

, (2.2)

for any small positive value ², where ˆc

is the estimated closeness centrality of

vertex v. Let a

be the average shortest-path distance of vetex v, i.e.,

u∈V

d(v, u)

n − 1

Using the average distance, inequality (2.2) can be rewritten as

Pr{|ˆa

− a

| ≥ ²∆} ≤

log n

(

n−1

)

, (2.3)

where ˆa

is the estimated average distance of vertex v to all other vertices. If

the algorithm uses ` = α

log n

samples (α > 1 is a constant number) which will

cause the probability of ²∆ error at each vertex to be bounded above by

the probability of ²∆ error anywhere in the graph is then b ounded from above

(≥ 1 − (1 −

)

). It means that the approximation algorithm calculates

the average lengths of shortest-paths of all vertices in O(

log n

(n log n + m)) time

within an additive error of ²∆ with probability at least 1 −

, i.e., with high

probability (w.h.p.).

3 Ranking algorithms

Our top-k ranking algorithm is based on the approximation algorithm as well

as the exact algorithm. The idea is to ﬁrst use the approximation algorithm

with ` samples to obtain estimated average distances of all vertices and ﬁnd a

candidate set E of top-k

vertices with estimated shortest distances. We need to

guarantee that all ﬁnal top-k vertices with the exact average shortest distances

are included in set E with high probability. Thus, we need to carefully choose

number k

> k using the bound given in formula (2.3). Once we ﬁnd set E,

we can use the exact algorithm to compute the exact average distances for all

vertices in E and rank them accordingly to ﬁnd the ﬁnal top-k vertices with

the highest closeness centralities. The key of the algorithm is to ﬁnd the right

balance between sample size ` and the candidate set size k

: If we use a too small

sample size `, the candidate set size k

could be too large, but if we try to make

small, the sample size ` may be too large. Ideally, we want an optimal ` that

minimizes ` + k

, so that the total time of both the approximation algorithm

and the computation of exact closeness centralities of vertices in the candidate

set is minimized. In this section we will show the basic algorithm ﬁrst, and then

provide a further improvement of the algorithm with a heuristic.

3.1 Basic ranking algorithm

We name the vertices in V as v

, v

, . . . , v

such that a

≤ a

≤ · · · ≤ a

. Let

ˆa

be the estimated average distance of vertex v using approximation algorithm

based on sampling. Figure 1 shows our basic ranking algorithm TOPRANK(k),

where k is the input parameter specifying the number of top ranking vertices

the algorithm should extract. The algorithm also has a conﬁguration parameter

`, which is the number of samples used by the RAND algorithm in the ﬁrst step.

We will specify the value of ` in Lemma 2. Function f (`) in step 4 is deﬁned

as follows: f(`) = α

log n

(where α

> 1 is a constant number), such that

the probability of the estimation error for any vertex being at least f(`) · ∆ is

bounded above by

, based on inequality (2.3) (when setting ² = f(`)).

Algorithm TOPRANK(k)

1 Use the approximation algorithm RAND with a set S of ` sampled vertices to obtain the

estimated average distance ˆa

for every vertex v.

// Rename all vertices to ˆv

, ˆv

, . . . , ˆv

such that ˆa

ˆv

≤ ˆa

ˆv

≤ · · · ≤ ˆa

ˆv

2 Find ˆv

3 Let

∆ = 2 min

u∈S

max

v∈V

d(u, v).

// d(u, v) for all u ∈ S, v ∈ V have been calculated at step 1 and

∆ is determined in O(`n) time.

4 Compute candidate set E as the set of vertices whose estimated average distances are less than

or equal to ˆa

ˆv

+ 2f (`) ·

∆.

5 Calculate exact average shortest-path distances of all vertices in E.

6 Sort the exact average distances and ﬁnd the top-k vertices as the output.

Fig. 1. Algorithm for ranking top-k vertices with the highest closeness centralities.

Lemma 1. Algorithm TOPRANK(k) given in Figure 1 ranks the top-k vertices

with the highest closeness centralities correctly w.h.p., with any conﬁguration

parameter `.

Proof. We show that the set E computed at step 4 in algorithm TOPRANK(k)

contains all top-k vertices with the exact shortest distances w.h.p.

Let T = {v

, . . . , v

} and

T = {ˆv

, . . . , ˆv

}. Since for any vertex v, the

probability of the estimate ˆa

exceeding the error range of f(`) · ∆ is bounded

above by

, i.e., Pr (¬{a

− f(`) · ∆ ≤ ˆa

≤ a

+ f(`) · ∆}) ≤

, we have

¬{

v∈ T

ˆa

≤ a

+ f (`) · ∆ ≤ a

+ f(`) · ∆}

≤

; and





¬{

ˆv∈

ˆv

≤ ˆa

ˆv

+ f(`) · ∆ ≤ ˆa

ˆv

+ f(`) · ∆}





≤

The latter inequality means that, with error probability of at most

, there

are at least k vertices whose real average distances are less than or equal to

ˆa

ˆv

+ f (`) · ∆, which means a

≤ ˆa

ˆv

+ f (`) · ∆ with error probability bounded

above by

. Then ˆa

≤ a

+ f (`) · ∆ ≤ ˆa

ˆv

+ 2f (`) · ∆ for all v ∈ T with error

probability bounded above by

. Moreover, we have ∆ ≤

∆, because for any

u ∈ S, we have

∆ = max

v,v

∈V

d(v, v

) ≤ max

v,v

∈V

(d(u, v) + d(u, v

))

= max

v,v

∈V

d(u, v) + max

v,v

∈V

d(u, v

) = 2 max

v∈ V

d(u, v).

Ranking of Closeness Centrality for Large-Scale Social Networks

Figures

Citations

A Critical Review of Centrality Measures in Social Networks

Influence analysis in social networks: A survey

Deep Recursive Network Embedding with Regular Equivalence

Content-based filtering for recommendation systems using multiattribute networks

Identifying all-around nodes for spreading dynamics in complex networks

References

A note on two problems in connexion with graphs

Centrality in social networks conceptual clarification

The anatomy of a large-scale hypertextual Web search engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Probability Inequalities for sums of Bounded Random Variables

Related Papers (5)

Centrality in social networks conceptual clarification

A faster algorithm for betweenness centrality

A Set of Measures of Centrality Based on Betweenness

Social Network Analysis: Methods and Applications

Collective dynamics of small-world networks

Frequently Asked Questions (12)

Q1. What have the authors contributed in "Ranking of closeness centrality for large-scale social networks" ?

Q2. What have the authors stated for future works in "Ranking of closeness centrality for large-scale social networks" ?

Q3. What are the measures of centrality commonly used in network analysis?

Q4. What is the common way to represent a social network?

Q5. how many samples are used to calculate the average distance of a vertex?

Q6. What is the key in reducing the running time of the algorithm?

Q7. What is the closeness centrality of v?

Q8. What is the way to rank vertices?

Q9. What is the closestness centrality of a graph?

Q10. How does the algorithm calculate the closeness centrality?

Q11. How many vertices are correlated with the closeness measure?

Q12. What is the closestness centrality of a vertex?