What are the contributions mentioned in the paper "An efficient algorithm for approximate betweenness centrality computation" ?

Q: What are the contributions mentioned in the paper "An efficient algorithm for approximate betweenness centrality computation" ?

In this paper, the authors propose a generic randomized framework for unbiased approximation of betweenness centrality. The authors discuss the conditions a promising sampling technique should satisfy to minimize the approximation error and present a sampling method partially satisfying the conditions. The authors perform extensive experiments and show the high efficiency and accuracy of the proposed method.

(Open Access) An efficient algorithm for approximate betweenness centrality computation (2013) | Mostafa Haghir Chehreghani

An Efﬁcient Algorithm for Approximate Betweenness

Centrality Computation

Mostafa Haghir Chehreghani

Department of Computer Science, KU Leuven

Celestijnenlaan 200a - box 2402

3001 Leuven, Belgium

mostafa.haghirchehreghani@cs.kuleuven.be

ABSTRACT

Betweenness centrality is an important centrality measure widely

used in social network analysis, route p lanning etc. However, even

for mid-size networks, it is practically intractable to compute ex-

act betweenness scores. In this paper, we propose a generic ran-

domized framework for unbiased approximation of betweenness

centrality. The proposed framework can be adapted with different

sampling techniques and giv e diverse methods. We discuss the con-

ditions a promising sampling technique should satisfy to minimize

the approximation error and present a sampling method partially

satisfying the conditions. We perform extensiv e experiments and

show the high efﬁciency and accuracy of the proposed method.

Categories and Subject Descriptors

G.2.2 [Discrete Mathematics]: Graph Theory—Graph algorithms,

Path and circuit problems;E.1[Data]: Data structures—Graphs

and networks

General Terms

Theory

Keywords

Centrality, betweenness centrality, social network analysis, approx-

imate algorithms.

1. INTRODUCTION

Betweenness centrality of a vertex, introduced by Linton Free-

man [6], is deﬁned as the number of shortest paths (geodesic paths)

from all (source) vertices to all others that pass through that vertex.

He used it as a measure for quantifying the control of a human on

the communication b etween other humans in a social network [6].

Betweenness centrality is also used in some well-known algorithms

for clustering and community detection in social and information

networks [8].

Although betweenness centrality computation is tractable in the-

ory in the sense that there exist polynomial time and space algo-

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full cita-

tion on the ﬁrst page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

CIKM’13 October 27-November 1, 2013, Burlingame, CA, USA.

http://dx.doi.org/10.1145/2505515.2507826.

rithms, the most efﬁcient existing exact method is Brandes’s algo-

rithm [3] which requires O(nm) time for unweighted graphs and

O(nm + n

log n) time for weighted graphs (n is the number of

vertices and m is the number of edges in the network). Therefore,

exact betweenness centrality computation is not practically appli-

cable, even for mid-size networks. The next bad news is that com-

puting exact betweenness centrality of a single vertex is not easier

than computing betweenness centrality of all vertices. Therefore,

the above mentioned worst case bounds hold if someone wants to

compute betweenness centrality of one or a few vertices. How-

ever, in many applications it might be required to compute be-

tweenness centrality of only one or a few vertices. For example,

the index might be computed only for core vertices of communi-

ties in social/information networks [12], or hubs in communication

networks.

To make betweenness centrality practically computable, in re-

cent years, several algorithms have been proposed for approximate

betweenness centrality computation [4], [1] and [7]. E xisting algo-

rithms fall into one of the following categories.

1. Some algorithms like [4] and [7] try to approximate between-

ness centrality of all vertices in the network. For such meth-

ods the value computed for every vertex is not of high im-

portance. Instead, the main goal is to correctly estimate the

relative rank of all vertices.

2. Some others, like the method presented in [1], aim to ap-

proximate betweenness centrality of a single vertex (or a few

vertices) in time faster than computing betweenness central-

ity of all vertices. For such methods, the accuracy of the

estimated betweenness centrality is important.

Our focus in this paper is the second cate gory of algorithms, i.e.

we aim at developing an efﬁcient and accurate algorithm for be-

tweenness centrality computation of a single vertex. In this pa-

per, we propose a generic randomized framework for unbiased ap-

proximation of betweenness centrality. In the proposed framework,

a source vertex i is selected by some strategy, single-source be-

tweenness scores of all vertices on i are computed, and the scores

are scaled as estimations of betweenness centralities. While our

method might seem similar to the method of Brandes and Pich [4],

they have a key difference. In the method of [4], for a few source

vertices, single-source betweenness scores are computed and for

the rest, they are extrapolated. Betweenness centralities are the sum

of all single-source betweenness scores (which are either computed

or extrapolated). In our method, single-source betweenness scores

are computed for one single source chosen randomly, and the ob-

tained scores are scaled as estimations of betweenness centralities.

We discuss the condition a promising sampling technique should

satisfy to minimize the approximation error for a single vertex.

1489

Since it might be computationally expensive to ﬁnd such a sam-

pling, we propose a sampling technique which partially satisﬁes

the condition. While the algorithm of [1] is intuitively presented

for high centrality vertices, in our method, the sampling technique

can be revised to optimize itself for both high centrality vertices and

low centrality vertices. The proposed method can be used to com-

pute similar centrality notions like stress centrality, which is also

based on counting shortest paths. We perform extensive experi-

ments on real-world netw orks from different domains, and show

that compared to existing exact and inexact algorithms, our method

works with higher accuracy or it gives signiﬁcant speedups.

The rest of this paper is organized as follows. In Section 2, pre-

liminaries and deﬁnitions related to betweenness centrality com-

putation are given. In Section 3, we present a generic random-

ized algorithm for betweenness centrality computation. In Section

4, we discuss the sampling methods. We empirically evaluate the

proposed method in Section 5 and show its efﬁciency and high ac-

curacy. Finally, the paper is concluded in Section 6.

2. PRELIMINARIES

Throughout the paper, G refers to a graph (network). For sim-

plicity, we suppose G is a connected and loop-free graph without

multi-edges. Throughout the paper, we assume G is an unweighted

graph, unless it is explicitly mentioned that G is weighted. V (G)

and E(G) refer to the set of vertices and the set of edges of G,re-

spectiv ely. Throughout the paper, n points to |V (G)| and m points

to |E(G)|. For an edge e =(u, v) ∈ E(G), u and v are two end-

points of e.Ashortest path (also called a geodesic path) between

two vertices u, v ∈ V (G) is a path whose size is minimum, among

all paths between u and v. For two vertices u, v ∈ V (G),we

use d(u, v), to denote the size (the number of edges) of a short-

est path connecting u and v.Bydeﬁnition, d(u, v)=0and

d(u,

v)=d(v, u).Fors, t ∈ V (G), σ

denotes the number

of shortest paths between s and t,andσ

(v) denotes the number

of shortest paths between s and t that also pass through v.We

have σ

(v)=



t∈V (G)\{s,v}

(v). Betweenness centrality of a

vertex v is deﬁned as:

BC(v)=



s,t∈V (G)\{v}

(v)

(1)

A notion which is widely used for counting the number of shortest

paths in a graph is the directed acyclic graph (DAG) containing

all shortest paths starting from a vertex s (see e.g. [3]). In this

paper, we refer to it as the shortest-path-DAG,orSPD in short,

rooted at s. For every vertex s in a graph G,theSPD rooted at s

is unique, and it can be computed in O(m) time for unweighted

graphs and in O(m + n log n) time for weighted graphs [3]. In

[3], the authors introduced the notion of the dependency scor e of a

vertex s ∈ V (G) on a vertex x ∈ V (G) \{s}, which is deﬁned

as:

s•

(v)=



t∈V (G)\{v,s}

(v)

(2)

We have: BC(v)=



s∈V (G)\{v}

s•

(v). As mentioned in [3],

given the SPD rooted at s, dependency scores of s on all other

vertices can be computed in O(m) time.

3. APPROXIMATE BETWEENNESS CEN-

TRALITY COMPUTATION

Algorithm 1 shows the high lev el pseudo code of the algorithm

proposed for approximate betweenness centrality computation. First

the following probabilities are computed

,...,p

≥ 0 such that



i=1

=1 (3)

Then, at every iteration t of the loop in Lines 8-15 of Algorithm 1:

(I) an i ∈{1,...,n} is selected with probability p

, (II) the SPD

rooted at i is computed, (III) dependency score of vertex v on i,

i•

(v), is computed, and (IV)

i•

(v)

is the estimation of BC(v)

in iteration t. The average of betweenness centralities calculated

in different iterations is returned as the estimation of betweenness

centrality.

Algorithm 1 High level pseudo code of the algorithm of approxi-

mate betweenness centrality computation.

1: APPROXIMATEBE TWEENNESS

2: Require. A network (graph) G, the number of samples T .

3: Ensure. Betweenness centrality of vertices of G.

4: Compute probabilities p

,...,p

5: for all vertices v ∈ V (G) do

6: B[v] ← 0

7: end for

8: for all t =1to T do

9: Select a vertex i with probability p

10: Form the SPD D rooted at i

11: Compute dependency scores of e very vertex v on i

12: for all vertex v ∈ V (G) do

13: B[v] ← B[v]+

i•

(v)

14: end for

15: end f or

16: for all i ∈{1,...,n} do

17: B[i] ←

B[i]

18: end f or

19: return B

LEMMA 1. In Algorithm 1, for every vertex v we have

E(B[v]) = BC(v) (4)

and

Va r (B[v]) =



i=1



i•

(v)



−

BC(v)

(5)

ROOF.Wehave:

E(B[v]) =



i=1

i•

(v)

= BC(v)

and

Va r (B

[v]) = E(B

[v]

) − E(B

[v])



i=1

i•

(v)

− BC(v)

Since B[v] is the average of T independent copies of B

[v],there-

fore

Va r (B[v]) =



i=1



i•

(v)



−

BC(v)

(6)

Some existing algorithms can be described as adaptations of Al-

gorithm 1 with speciﬁc sampling methods. For example, if vertices

i are selected uniformly at random (i.e. p

for 1 ≤ i ≤ n),

then it will give the randomized algorithm presented in [1]. We

1490

note that instead of taking T samples, we can deﬁneacriteriafor

the termination of the loop in Lines 8-15 of Algorithm 1. For ex-

ample, similar to the algorithm of [1], we can stop when B[v] ≥ cn

for some constant c.

4. SAMPLING METHODS

Suppose that we want to approximate betweenness centrality of

avertexv. The following Lemma deﬁnes the probabilities mini-

mizing the approximation error.

EMMA 2. If in Algorithm 1 source vertices i are selected with

probabilities

•i

(v)



j=1

•j

(v)

(7)

the approximation error (i.e. variance of B[v]) is minimized. In

this case, variance of B[v] will be 0

ROOF. Omitted due to lack of space.

Therefore, using probabilities p

deﬁned in Equation 7, gives an

exact method in the sense that it makes the approximation error 0.

However, time complexity of computing optimal p

’s is the same

as exact betweenness centrality computation. Although it is not

practically efﬁcient to use probabilities p

deﬁned in Equation 7,

they can help us to deﬁne properties of a good sampling. From

Equation 7, we can conclude that in a good sampling, for every two

vertices i and i



, the following must hold:

≥ p



⇔ δ

•i

(v) ≥ δ

•i



(v) (8)

which means vertices with higher dependency scores on v,mustbe

selected with a higher probability.

However, ﬁnding probabilities p

,...,p

which satisfy Equa-

tion 8 might be computationally expensive, since it needs to com-

pute dependency scores of all vertices on v which is as bad as com-

puting dependency scores of every source vertex on all vertices. In

order to design practically efﬁcient sampling techniques, we con-

sider relaxations of Equation 8. Consider two vertices i and i



such

that d(i



,v) >d(i, v). If in the SPD rooted at v there exists an

ancestor-descendant relationship between i and i



and i is the only

ancestor of i



at the level d(i, v), then, it can be shown that for

k ∈{i, i



}, probability p

deﬁned as

d(k,v)



j=1

d(j,v)

(9)

satisﬁes Equation 8.

The positive aspect of the sampling technique presented in Equa-

tion 9 is that it only needs to compute the distance between vertex

v and every vertex in the graph: the single-source shortest path, or

SSSP in short, problem. For unweighted graphs, this problem can

be solved in O( m) time and for weighted graphs, using Fibonacci

heap, it is solvable in O(m + n log n) time[5]. Itmeansthatthe

sampling method presented in E quation 9 is practically efﬁcient.

What this lemma suggests, somehow contradicts the source ver-

tex selection procedure presented in [7]. In the method of [7] the

scheme for aggregating dependency scores changes so that vertices

do not proﬁt from being near the selected source vertices. How-

ever, Lemma 2 says it is better to select source vertices based on

their dependency scores on v, and as we will see later, it might

result in preferring source vertices which are closer to v.

The reason of this contradiction is that while here we aim at pre-

cisely approximating betweenness centrality of some speciﬁc ver-

tex v, the method of [7] aims to rank all vertices based on their

betweenness scores.

Therefore, with probabilities p

deﬁned in Equation 9, a vertex

i is selected and dependency score of i on v is computed, and the

result is scaled. For unweighted graphs, it gives an O(Tm) time al-

gorithm for approximate betweenness centrality computation. For

weighted graphs (with positive weights), time complexity of the

approximation algorithm will be O(Tm+ Tnlog n).

5. EXPERIMENTAL RESULTS

Our experiments were done on one core of a single AMD Proces-

sor 270 clocked at 2.0 GHz with 8 GB main memory and 2×1 MB

L2 cache, running Ubuntu Linux 12.0. The program was compiled

by the GNU C++ compiler 4.0.2 using optimization level 3. We

compare our proposed method with the algorithm presented in [1].

As mentioned earlier, methods lik e [4] and [7] aim to rank vertices

based on betweenness scores (and the betweenness score of an in-

dividual vertex is not very important for them). Therefore, they are

not suitable for our comparisons. We refer to the algorithm of [1]

as uniform sampling, since it chooses source vertices uniformly at

random. We refer to our proposed method as distance-based sam-

pling. We also compare the methods with Brandes’s algorithm for

exact betweenness centrality computation [3].

We performed extensive experiments on different real-world net-

works to assess the quantitative and qualitative behavior of the pro-

posed algorithm. We used two DBLP citation networks dblp0305

and dblp0507 [2], the Wiki-Vote social network [9], the Enron-

Email communication network [10], and the CA-CondMat collabo-

ration network [11]. Table 1 summarizes speciﬁcations of our real-

world networks.

Table 1: Summary of real-world networks.

Dataset # vertices # edges Avg. degree

dblp-0305 109,044 233,961 4.29

dblp-0507 135,116 290,363 4.28

Enron-Email 36,692 367,662 20.04

W iki-Vote 7,115 103,689 29.14

CA-CondMat 23,133 93,497 8.08

For a vertex v, the empirical betweenness approximation error

(which is reported in the experiments) is deﬁned as:

err(v)=

|App(v) − BC(v)|

BC(v)

× 100 (10)

where App(v) is the computed approximate score.

In our experiments, we consider several vertices of a dataset

and for every vertex, we compute distance-based probabilities, ex-

act betweenness scores, and approximate betweenness scores using

distance-based and uniform samplings. Table 2 summarizes the av-

erage results (i.e. the sum of results of all vertices divided by their

number) obtained for different datasets. In all experiments, for

both uniform and distance-based samplings, the number of sam-

ples (the number of source vertices selected) is 10% of the number

of vertices in the network. Therefore, the running time of the ap-

proximate methods is around 10% of the running time of the exact

method.

The ﬁrst dataset is Wiki-Vote which is dense (its average de-

gree is 29.14). For most of vertices of Wiki-Vote, distance-based

sampling gives a better approximation. The extra time needed by

the distance-based sampling to compute required shortest path dis-

tances is quite tin y and ignorable compared to running time of the

whole process. Since for all datasets it is a tiny value varying in

different runs, we report an upper bound for it. For the Wiki-Vote

1491

Table 2: Comparing average approximation error and average running time of uniform sampling, distance-based sampling, and

exact method, for single vertices in different datasets.

Database

Exact BC Approximate BC

Avg. BC score Avg. time (sec)

Distance-based sampling Uniform sampling

Avg. time (sec) # iterations

Avg. error (%) Avg. dist.

comp. time

(sec)

Avg. error (%)

Wiki Vote 76056.85 515.09 37.0% < 1 41.13% 46.05 10%

Email-Enron 2775100.8 9033.11 15.75% < 1 25.28% 925.80 10%

dblp0305 564246.41 19149.8 7.59% < 2 64.73% 1747.15 10%

dblp0507 798125.00 35140 7.19% < 2 50.17% 2863.82 10%

CA-CondMat 691667 3026.9 10.8% < 1 20.81% 315.3 10%

dataset, it is always smaller than 1 second. The next dataset is

Email-Enron. Compared to Wiki-Vote, it is less dense (but still

dense) and larger. Over this dataset, the approximation error of

distance-based sampling is better than the uniform sampling.

Figure 1: In sparse

graphs, distance-based

sampling is closer to opti-

mal sampling. The graph

in the left side shows a

SPD in a dense graph, and

the graph in the right side

shows a SPD in a sparse

graph.

Dblp0305 and dblp0507 are

large and relatively sparse datasets.

As reﬂected in Table 2, over

these datasets, distance-based

sampling works much better

than uniform sampling. This

means that on sparse datasets,

the difference between the ap-

proximation quality of two meth-

ods is more considerable. It

has several reasons. The ﬁrst

reason is that in very dense

datasets, many vertices have the

same (and small) distance from

v (v is the vertex whose be-

tweenness centrality is approx-

imated). Therefore, distance-

based sampling becomes closer

to the uniform sampling. The

second reason is that in sparse networks, in the SPD rooted at

v, the probability that a vertex i has only one ancestor at some

level k is lower than this probability in dense graphs. Figure 1

compares these two situations. It means that in sparse networks,

distance-based sampling is closer to the optimal sampling, because

by distance-based sampling, a larger number of vertices will satisfy

the condition expressed in Equation 7. As a result, on sparse net-

works, distance-based sampling becomes much more effective than

uniform sampling.

Finally, the methods were compared on the CA-CondMat dataset

which contains scientiﬁc collaborations between authors of papers

submitted to Condense Matter category [11]. The average degree

in this dataset is 8.08 which means it is denser than dblp0305 and

dblp0507, but less dense than Wiki-Vote and Email-Enron. Over

this dataset, the approximation error of uniform sampling i s almost

twice of the approximation error of distance-based sampling.

6. CONCLUSION

In this paper , we presented a generic randomized framework for

unbiased approximation of betweenness centrality. In the proposed

framework, a source vertex i is selected by some strategy, single-

source betweenness scores of all vertices on i are computed, and

the scores are scaled as estimations of betweenness centralities.

Our proposed framework can be adapted with different sampling

techniques to give diverse methods for approximating betweenness

centrality. We discussed the conditions a promising sampling tech-

nique should satisfy to minimize the approximation error, and pro-

posed a sampling technique which partially satisﬁes the conditions.

Our experiments show the high efﬁciency and quality of the pro-

posed method.

7. ACKNOWLEDGMENTS

This work was partially supported by ERC Starting Grant 240186

"MiGraNT: Mining Graphs and Netwo rks: a Theory-based approach".

8. REFERENCES

[1] D. A. Bader, S. Kintali, K. Madduri and M. Mihail,

Approximating Betweenness Centrality, WAW, 124-137,

2007.

[2] M. Berlingerio, F. Bonchi, B. Bringmann and A. Gionis,

Mining Graph Evolution Rules, ECML/PKDD (1), 115-130,

2009.

[3] U. Brandes. A Faster Algorithm for Betweenness Centrality.

Journal of Mathematical Sociology, 25(2), 163-177, 2001.

[4] U. Brandes, C. Pich, Centrality estimation in large networks,

Intl. Journal of Bifurcation and Chaos, 17(7), 303-2318, 2007.

[5] M. L. Fredman and R. E. Tarjan, Fibonacci heaps and their

uses in improved network optimization algorithms, Journal of

the Association for Computing Machinery, 34(3), 596-615,

1987.

[6] L. C. Freeman, A set of measures of centrality based upon

betweenness, Sociometry, 40, 35-41, 1977.

[7] R. Geisberger, P. Sanders, D. Schultes, Better Approximation

of Betweenness Centrality, ALENEX, 90-100, 2008.

[8] M. Girvan and M. E. J. Newman, Community structure in

social and biological networks, Proc. Natl. Acad. Sci. USA

99, 7821-7826, 2002.

[9] J. Leskovec, D. Huttenlocher, J. Kleinberg, Signed Networks

in Social Media, CHI 2010.

[10] J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney,

Community Structure in Large Networks: Natural Cluster

Sizes and the Absence of Large Well-Deﬁned Clusters,

Internet Mathematics, 6(1), 29-123, 2009.

[11] J. Lesk ovec, J. Kleinberg, C. Faloutsos, Graph Evolution:

Densiﬁcation and Shrinking Diameters. ACM Transactions on

Knowledge Discovery from Data (ACM TKDD), 1(1), 2007.

[12] Y. Wang, Z. Di, Y. Fan, Identifying and Characterizing

Nodes Important to Community Structure Using the Spectrum

of the Graph, PLoS ONE 6(11), e27418, 2011.

1492

An efficient algorithm for approximate betweenness centrality computation

Figures

Citations

Betweenness Centrality in Large Complex Networks

Super mediator - A new centrality measure of node importance for information diffusion over social network

Node Immunization over Infectious Period

Approximating Betweenness Centrality in Fully-dynamic Networks

Efficient Exact and Approximate Algorithms for Computing Betweenness Centrality in Directed Graphs

References

Emergence of Scaling in Random Networks

Social Network Analysis: Methods and Applications

Centrality in social networks conceptual clarification

Community structure in social and biological networks

Social Network Analysis: Methods and Applications.

Related Papers (5)

Fast algorithm for successive computation of group betweenness centrality.

Better approximation of betweenness centrality

Almost linear-time algorithms for adaptive betweenness centrality using hypergraph sketches

Scalable Betweenness Centrality Maximization via Sampling

K-path centrality: a new centrality measure in social networks

Frequently Asked Questions (1)

Q1. What are the contributions mentioned in the paper "An efficient algorithm for approximate betweenness centrality computation" ?