scispace - formally typeset
Open AccessBook ChapterDOI

Ranking of Closeness Centrality for Large-Scale Social Networks

TLDR
This paper combines existing methods on calculating exact values and approximate values of closeness centrality and presents new algorithms to rank the top-kvertices with the highest closenesscentrality.
Abstract
Closeness centrality is an important concept in social network analysis. In a graph representing a social network, closeness centrality measures how close a vertex is to all other vertices in the graph. In this paper, we combine existing methods on calculating exact values and approximate values of closeness centrality and present new algorithms to rank the top-kvertices with the highest closeness centrality. We show that under certain conditions, our algorithm is more efficient than the algorithm that calculates the closeness-centralities of all vertices.

read more

Content maybe subject to copyright    Report

Ranking of Closeness Centrality for Large-Scale
Social Networks
Kazuya Okamoto
1
, Wei Chen
2
, and Xiang-Yang Li
3
1
Kyoto University, okia@kuis.kyoto-u.ac.jp
2
Microsoft Research Asia, weic@microsoft.com
3
Illinois Institute of Technology and Microsoft Research Asia
xli@cs.iit.edu
Abstract. Closeness centrality is an important concept in social net-
work analysis. In a graph representing a social network, closeness cen-
trality measures how close a vertex is to all other vertices in the graph. In
this paper, we combine existing methods on calculating exact values and
approximate values of closeness centrality and present new algorithms
to rank the top-k vertices with the highest closeness centrality. We show
that under certain conditions, our algorithm is more efficient than the
algorithm that calculates the closeness-centralities of all vertices.
1 Introduction
Social networks have been the subject of study for many decades in social sci-
ence research. In recent years, with the rapid growth of Internet and World Wide
Web, many large-scale online-based social networks such as Facebook, Friend-
ster app ear, and many large-scale social network data, such as coauthorship net-
works, become easily available online for analysis [Ne04a, Ne04b, EL05, PP02].
A social network is typically represented as a graph, with individual p ersons
represented as vertices, the relationships between pairs of individuals as edges,
and the strengths of the relationships represented as the weights on edges (for
the purpose of finding the shortest weighted distance, we can treat lower-weight
edges as stronger relationships). Centrality is an important concept in studying
social networks [Fr79, NP03]. Conceptually, centality measures how central an
individual is positioned in a social network. Within graph theory and network
analysis, various measures (see [KL05] for details) of the centrality of a vertex
within a graph have been proposed to determine the relative importance of a
vertex within the graph. Four measures of centrality that are widely used in net-
work analysis are degree centrality, betweenness centrality
4
, closeness centrality,
and eigenvector centrality
5
. In this paper, we focus on shortest-path closeness
4
For a graph G = (V, E), the betweenness centrality C
B
(v) for a vertex v is C
B
(v) =
P
s,t:s6=t6=v
σ
v
(s,t)
σ(s,t)
where σ(s, t) is the number of shortest paths from s to t, and
σ
v
(s, t) is the number of shortest paths from s to t that pass through v.
5
Given a graph G = (V, E) with adjacency matrix A, let x
i
b e the eigenvector central-
ity of the ith node v
i
. Then vector x = (x
1
, x
2
, · · · , x
n
)
T
is the solution of equation

centrality (or closeness centrality for short) [Ba50, Be65]. The closeness central-
ity of a vertex in a graph is the inverse of the average shortest-path distance from
the vertex to any other vertex in the graph. It can be viewed as the efficiency
of each vertex (inidividual) in spreading information to all other vertices. The
larger the closeness centrality of a vertex, the shorter the average distance from
the vertex to any other vertex, and thus the better positioned the vertex is in
spreading information to other vertices.
The closeness centrality of all vertices can be calculated by solving all-
pairs shortest-paths problem, which can be solved by various algorithms taking
O(nm + n
2
log n) time [Jo77, FT87], where n is the number of vertices and m
is the number of edges of the graph. However, these algorithms are not efficient
enough for large-scale social networks with millions or more vertices. In [EW04],
Eppstein and Wang developed an approximation algorithm to calculate the close-
ness centrality in time O(
log n
²
2
(n log n + m)) within an additive error of ²∆ for
the inverse of the closeness centrality (with probability at least 1
1
n
), where
² > 0 and is the diameter of the graph.
However, applications may be more interested in ranking vertices with high
closeness centralities than the actual values of closeness centralities of all ver-
tices. Suppose we want to use the approximation algorithm of [EW04] to rank
the closeness centralities of all vertices. Since the average shortest-path distances
are bounded above by , the average difference in average distance (the inverse
of closeness centrality) between the ith-ranked vertex and the (i + 1)th-ranked
vertex (for any i = 1, . . . , n 1) is O(
n
). To obtain a reasonable ranking result,
we would like to control the additive error of each estimate of closeness central-
ity to within O(
n
), which means we set ² to Θ(
1
n
). Then the algorithm takes
Θ(n
2
log n(n log n + m)) time, which is worse than the exact algorithm.
Therefore, we cannot use either purely the exact algorithm or purely the
approximation algorithm to rank closeness centralities of vertices. In this paper,
we show a method of ranking top k highest closeness centrality vertices, com-
bining the approximation algorithm and the exact algorithm. We first provide
a basic ranking algorithm TOPRANK(k), and show that under certain condi-
tions, the algorithm ranks all top k highest closeness centrality vertices (with
high probability) in O((k + n
2
3
· log
1
3
n)(n log n + m)) time, which is better than
Ax = λx, where λ is the greatest eigenvalue of A to ensure that all values x
i
are
p ositive by the Perron-Frobenius theorem. Google’s PageRank [BP98] is a variant
of the eigenvector centrality measure. The PageRank vector R = (r
1
, r
2
, · · · , r
n
)
T
,
where r
i
is the PageRank of webpage i and n is the total number of webpages, is
the solution of the equation
R =
1 d
n
· 1 + dLR.
Here d is a damping factor set around 0.85, L is a modified webpage-adjacency
matrix: l
i,j
= 0 if page j does not link to i, and normalised such that, for each j,
P
n
i=1
l
i,j
= 1, i.e., l
i,j
=
a
i,j
d
j
where a
i,j
= 1 only if page j has link to page i, and
d
j
=
P
n
i=1
a
i,j
is the out-degree of page j.

O(n(n log n + m)) (when k = o(n)), the time needed by a brute-force algorithm
that simply computes all average shortest distances and then ranks them. We
then use a heuristic to further improve the algorithm. Our work can be viewed
as the first step toward designing and evaluating efficient algorithms in finding
top ranking vertices with highest closeness centralities. We discuss in the end
several open problems and future directions of this work.
2 Preliminary
We consider a connected weighted undirected graph G = (V, E) with n vertices
and m edges (|V | = n, |E| = m). We use d(v, u) to denote the length of a
shortest-path between v and u, and to denote the diameter of graph G, i.e.,
= max
v,u V
d(v, u). The closeness centrality c
v
of vertex v [Be65] is defined
as
c
v
=
n 1
Σ
uV
d(v, u)
. (2.1)
In other words, the closeness centrality of v is the inverse of the average (shortest-
path) distance from v to any other vertex in the graph. The higher the c
v
, the
shorter the average distance from v to other vertices, and v is more imp ortant
by this measure. Other definitions of closeness centralities exist. For example,
some define the closeness centrality of a vertex v as
1
Σ
uV
d(v ,u )
[Sa66], and some
define the closeness centrality as the mean geodesic distance (i.e the shortest
path) between a vertex v and all other vertices reachable from it, i.e.,
Σ
uV
d(v,u)
n1
,
where n 2 is the size of the network’s connected component V reachable from
v. In this paper, we will focus on closeness centrality defined in Equation (2.1).
The problem to solve in this paper is to find the top k vertices with the
highest closeness centralities and rank them, where k is a parameter of the
algorithm. To solve this problem, we combine the exact algorithm [Jo77, FT87]
and the approximation algorithm [EW04] for computing average shortest-path
distances to rank vertices on the closeness centrality. The exact algorithm iterates
Dijkstra’s single-source shortest-paths (SSSP for short) algorithm n times for all
n vertices to compute the average shortest-path distances. The original Dijkstra’s
SSSP algorithm [Di59] computes all shortest-path distances from one vertex, and
it can be efficiently implemented in O(n log n + m) time [Jo77, FT87].
The approximation algorithm RAND given in [EW04] also uses Dijkstra’s
SSSP algorithm. RAND samples ` vertices uniformly at random and computes
SSSP from each sample vertex. RAND estimates the closeness centrality of a
vertex using the average of ` shortest-path distances from the vertex to the `
sample vertices instead of to all n vertices. The following bound on the accu-
racy of the approximation is given in [EW04], which utilizes the Hoeffding’s
theorem [Ho63]:
Pr{|
1
ˆc
v
1
c
v
| ²∆}
2
n
2`
²
2
log n
(
n1
n
)
2
, (2.2)

for any small positive value ², where ˆc
v
is the estimated closeness centrality of
vertex v. Let a
v
be the average shortest-path distance of vetex v, i.e.,
a
v
=
Σ
uV
d(v, u)
n 1
=
1
c
v
.
Using the average distance, inequality (2.2) can be rewritten as
Pr{|ˆa
v
a
v
| ²∆}
2
n
2`
²
2
log n
(
n1
n
)
2
, (2.3)
where ˆa
v
is the estimated average distance of vertex v to all other vertices. If
the algorithm uses ` = α
log n
²
2
samples (α > 1 is a constant number) which will
cause the probability of ²∆ error at each vertex to be bounded above by
1
n
2
,
the probability of ²∆ error anywhere in the graph is then b ounded from above
by
1
n
( 1 (1
1
n
2
)
n
). It means that the approximation algorithm calculates
the average lengths of shortest-paths of all vertices in O(
log n
²
2
(n log n + m)) time
within an additive error of ²∆ with probability at least 1
1
n
, i.e., with high
probability (w.h.p.).
3 Ranking algorithms
Our top-k ranking algorithm is based on the approximation algorithm as well
as the exact algorithm. The idea is to first use the approximation algorithm
with ` samples to obtain estimated average distances of all vertices and find a
candidate set E of top-k
0
vertices with estimated shortest distances. We need to
guarantee that all final top-k vertices with the exact average shortest distances
are included in set E with high probability. Thus, we need to carefully choose
number k
0
> k using the bound given in formula (2.3). Once we find set E,
we can use the exact algorithm to compute the exact average distances for all
vertices in E and rank them accordingly to find the final top-k vertices with
the highest closeness centralities. The key of the algorithm is to find the right
balance between sample size ` and the candidate set size k
0
: If we use a too small
sample size `, the candidate set size k
0
could be too large, but if we try to make
k
0
small, the sample size ` may be too large. Ideally, we want an optimal ` that
minimizes ` + k
0
, so that the total time of both the approximation algorithm
and the computation of exact closeness centralities of vertices in the candidate
set is minimized. In this section we will show the basic algorithm first, and then
provide a further improvement of the algorithm with a heuristic.
3.1 Basic ranking algorithm
We name the vertices in V as v
1
, v
2
, . . . , v
n
such that a
v
1
a
v
2
· · · a
v
n
. Let
ˆa
v
be the estimated average distance of vertex v using approximation algorithm
based on sampling. Figure 1 shows our basic ranking algorithm TOPRANK(k),
where k is the input parameter specifying the number of top ranking vertices

the algorithm should extract. The algorithm also has a configuration parameter
`, which is the number of samples used by the RAND algorithm in the first step.
We will specify the value of ` in Lemma 2. Function f (`) in step 4 is defined
as follows: f(`) = α
0
q
log n
`
(where α
0
> 1 is a constant number), such that
the probability of the estimation error for any vertex being at least f(`) · is
bounded above by
1
2n
2
, based on inequality (2.3) (when setting ² = f(`)).
Algorithm TOPRANK(k)
1 Use the approximation algorithm RAND with a set S of ` sampled vertices to obtain the
estimated average distance ˆa
v
for every vertex v.
// Rename all vertices to ˆv
1
, ˆv
2
, . . . , ˆv
n
such that ˆa
ˆv
1
ˆa
ˆv
2
· · · ˆa
ˆv
n
.
2 Find ˆv
k
.
3 Let
ˆ
= 2 min
uS
max
vV
d(u, v).
// d(u, v) for all u S, v V have been calculated at step 1 and
ˆ
is determined in O(`n) time.
4 Compute candidate set E as the set of vertices whose estimated average distances are less than
or equal to ˆa
ˆv
k
+ 2f (`) ·
ˆ
.
5 Calculate exact average shortest-path distances of all vertices in E.
6 Sort the exact average distances and find the top-k vertices as the output.
Fig. 1. Algorithm for ranking top-k vertices with the highest closeness centralities.
Lemma 1. Algorithm TOPRANK(k) given in Figure 1 ranks the top-k vertices
with the highest closeness centralities correctly w.h.p., with any configuration
parameter `.
Proof. We show that the set E computed at step 4 in algorithm TOPRANK(k)
contains all top-k vertices with the exact shortest distances w.h.p.
Let T = {v
1
, . . . , v
k
} and
ˆ
T = {ˆv
1
, . . . , ˆv
k
}. Since for any vertex v, the
probability of the estimate ˆa
v
exceeding the error range of f(`) · is bounded
above by
1
2n
2
, i.e., Pr (¬{a
v
f(`) · ˆa
v
a
v
+ f(`) · })
1
2n
2
, we have
Pr
Ã
¬{
^
v T
ˆa
v
a
v
+ f (`) · a
v
k
+ f(`) · }
!
k
2n
2
; and
Pr
¬{
^
ˆv
ˆ
T
a
ˆv
ˆa
ˆv
+ f(`) · ˆa
ˆv
k
+ f(`) · }
k
2n
2
.
The latter inequality means that, with error probability of at most
k
2n
2
, there
are at least k vertices whose real average distances are less than or equal to
ˆa
ˆv
k
+ f (`) · , which means a
v
k
ˆa
ˆv
k
+ f (`) · with error probability bounded
above by
k
2n
2
. Then ˆa
v
a
v
k
+ f (`) · ˆa
ˆv
k
+ 2f (`) · for all v T with error
probability bounded above by
k
n
2
. Moreover, we have
ˆ
, because for any
u S, we have
= max
v,v
0
V
d(v, v
0
) max
v,v
0
V
(d(u, v) + d(u, v
0
))
= max
v,v
0
V
d(u, v) + max
v,v
0
V
d(u, v
0
) = 2 max
v V
d(u, v).

Citations
More filters
Journal ArticleDOI

A Critical Review of Centrality Measures in Social Networks

TL;DR: The paper analyzes five centrality measures commonly discussed in literature on the basis of three simple requirements for the behavior of centralities and shows the state of the art with regard to centrality Measures for social networks.
Journal ArticleDOI

Influence analysis in social networks: A survey

TL;DR: This survey aims to pave a comprehensive and solid starting ground for interested readers by soliciting the latest work in social influence analysis from different levels, such as its definition, properties, architecture, applications, and diffusion models.
Proceedings ArticleDOI

Deep Recursive Network Embedding with Regular Equivalence

TL;DR: This work proposes a new approach named Deep Recursive Network Embedding (DRNE) to learn network embeddings with regular equivalence, and proposes a layer normalized LSTM to represent each node by aggregating the representations of their neighborhoods in a recursive way.
Journal ArticleDOI

Content-based filtering for recommendation systems using multiattribute networks

TL;DR: This study proposes a novel CBF method that uses a multiattribute network to effectively reflect several attributes when calculating correlations to recommend items to users, and finds that this approach outperformed existing methods in terms of accuracy and robustness.
Journal ArticleDOI

Identifying all-around nodes for spreading dynamics in complex networks

TL;DR: The experimental results of susceptible-infectious-recovered (SIR) dynamics suggest that the proposed all-around distance can act as a more accurate, stable indicator of influential nodes.
References
More filters
Journal ArticleDOI

A note on two problems in connexion with graphs

TL;DR: A tree is a graph with one and only one path between every two nodes, where at least one path exists between any two nodes and the length of each branch is given.
Journal ArticleDOI

Centrality in social networks conceptual clarification

TL;DR: In this article, three distinct intuitive notions of centrality are uncovered and existing measures are refined to embody these conceptions, and the implications of these measures for the experimental study of small groups are examined.
Journal ArticleDOI

The anatomy of a large-scale hypertextual Web search engine

TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Journal Article

The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Sergey Brin, +1 more
- 01 Jan 1998 - 
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.
Book ChapterDOI

Probability Inequalities for sums of Bounded Random Variables

TL;DR: In this article, upper bounds for the probability that the sum S of n independent random variables exceeds its mean ES by a positive number nt are derived for certain sums of dependent random variables such as U statistics.
Frequently Asked Questions (12)
Q1. What have the authors contributed in "Ranking of closeness centrality for large-scale social networks" ?

In this paper, the authors combine existing methods on calculating exact values and approximate values of closeness centrality and present new algorithms to rank the top-k vertices with the highest closeness centrality. The authors show that under certain conditions, their algorithm is more efficient than the algorithm that calculates the closeness-centralities of all vertices. 

There are many directions to extend this study. – Second, the condition under which the heuristic algorithm results in the least number of SSSP computation is an open problem and would be quite interesting to study. – Fourth, the authors would like to study the stability of their algorithms for top-k query using random sampling. This may be possible for some classes of social networks with certain properties on their average distance distributions. 

Four measures of centrality that are widely used in network analysis are degree centrality, betweenness centrality4, closeness centrality, and eigenvector centrality5. 

A social network is typically represented as a graph, with individual persons represented as vertices, the relationships between pairs of individuals as edges, and the strengths of the relationships represented as the weights on edges (for the purpose of finding the shortest weighted distance, the authors can treat lower-weight edges as stronger relationships). 

If the algorithm uses ` = α log n²2 samples (α > 1 is a constant number) which will cause the probability of ²∆ error at each vertex to be bounded above by 1n2 , the probability of ²∆ error anywhere in the graph is then bounded from above by 1n (≥ 1 − (1 − 1n2 )n). 

The key in reducing the running time of the algorithm is to find the right sample size ` to minimize `+ |E|, the total number of SSSP calculations. 

The closeness centrality cv of vertex v [Be65] is defined ascv = n− 1Σu∈V d(v, u) . (2.1)In other words, the closeness centrality of v is the inverse of the average (shortestpath) distance from v to any other vertex in the graph. 

The authors first provide a basic ranking algorithm TOPRANK(k), and show that under certain conditions, the algorithm ranks all top k highest closeness centrality vertices (with high probability) in O((k + n 2 3 · log 13 n)(n log n + m)) time, which is better thanAx = λx, where λ is the greatest eigenvalue of A to ensure that all values xi are positive by the Perron-Frobenius theorem. 

The problem to solve in this paper is to find the top k vertices with the highest closeness centralities and rank them, where k is a parameter of the algorithm. 

In [EW04], Eppstein and Wang developed an approximation algorithm to calculate the closeness centrality in time O( log n²2 (n log n + m)) within an additive error of ²∆ for the inverse of the closeness centrality (with probability at least 1 − 1n ), where ² > 0 and ∆ is the diameter of the graph. 

Their study shows that, with a 50% sample of the original network nodes, average correlations for the closeness measure ranged from 0.54 to 0.71. 

Since the average shortest-path distances are bounded above by ∆, the average difference in average distance (the inverse of closeness centrality) between the ith-ranked vertex and the (i + 1)th-ranked vertex (for any i = 1, . . . , n− 1) is O(∆n ).