Book Chapter•DOI•

Estimating the Similarity of Community Detection Methods Based on Cluster Size Distribution

Q: What are the contributions mentioned in the paper "Estimating the similarity of community detection methods based on cluster size distribution" ?

In this paper, the authors propose a novel approach to estimate the similarity between community detection methods using the size density distributions of communities that they detect. This distinction plays an important role in assisting network analysts to identify their rule-of-thumb solutions.

Q: How many edges do the authors have in each category?

the number of edges increase linearly by the number of nodes with equivalent rates among categories as can be deduced from the gradients of the linear estimates.

Q: What is the purpose of this paper?

In this paper, the authors are interested in estimating pairwise similarity of community detection methods based on expected community size.

Q: What are the popular techniques for detecting inter-community edges?

Popular techniques include using edge betweenness centrality (GN in Table 1) or edge clustering coefficient, which could be based on triangular (RCCLP-3) or quadrangular (RCCLP-4) patterns.

Vinh-Loc Dao¹, Cécile Bothorel¹, Philippe Lenca¹•Institutions (1)

Centre national de la recherche scientifique¹

11 Dec 2018-Vol. 812, pp 183-194

TL;DR: This paper proposes a novel approach to estimate the similarity between community detection methods using the size density distributions of communities that they detect and shows that there is a very clear distinction between the partitioning strategies of differentcommunity detection methods.

read less

Abstract: Detecting community structure discloses tremendous information about complex networks and unlock promising applied perspectives. Accordingly, a numerous number of community detection methods have been proposed in the last two decades with many rewarding discoveries. Notwithstanding, it is still very challenging to determine a suitable method in order to get more insights into the mesoscopic structure of a network given an expected quality, especially on large scale networks. Many recent efforts have also been devoted to investigating various qualities of community structure associated with detection methods, but the answer to this question is still very far from being straightforward. In this paper, we propose a novel approach to estimate the similarity between community detection methods using the size density distributions of communities that they detect. We verify our solution on a very large corpus of networks consisting in more than a hundred networks of five different categories and deliver pairwise similarities of 16 state-of-the-art and well-known methods. Interestingly, our result shows that there is a very clear distinction between the partitioning strategies of different community detection methods. This distinction plays an important role in assisting network analysts to identify their rule-of-thumb solutions.

...read moreread less

Summary (2 min read)

Jump to: [1 Introduction] – [2 Estimating the similarity of community detection methods] – [3 Community detection methods] – [4 Network dataset] – [5 Experimental results] and [6 Discussion and Conclusion]

1 Introduction

Community detection discloses interesting information about the heterogeneous structure of complex networks and opens promising perspective in many theoretical as well as applied domains [8,23,24].
Indeed, the notion of goodness varies according to contextual objective and also the assumption about the underlying network model.
These two approaches work well to validate the functionality of proposed methods in adhoc networks but are not directly interpretable in a comparative evaluation of clustering quality.
The authors are interested in estimating pairwise similarity of community detection methods based on expected community size.
The authors will show that this taxonomy exposes very useful information for proposing appropriate methods according to expected analysis strategy.

2 Estimating the similarity of community detection methods

The authors present a novel approach to determine the similarity of community detection methods using the size distribution of communities that they discover.
Nonetheless, it allows to get more insight into the difference in terms of partitioning strategy.
As such, two methods could be supposed to be similar if their corresponding density distributions expose a large intersection area as shown in Fig. 1(a).
The premise behind this estimation is that two similar methods must not compulsorily produce a large portion of exactly same-size communities, but a large portion of comparable-size ones.
A large and representative corpus would help to reduce the dependency impact.

3 Community detection methods

A community is roughly described as a group of nodes in a graph where there must be many edges connecting them together than edges connecting the community with the rest of the graph [5].
The authors select a representative set of state-of-theart and widely studied detection methods whose approaches spread out over the most commonly used in the literature.
These methods are summarized in Table 1 with corresponding information.
∗The total number of networks, nodes and edges in the whole dataset respectively.
In many cases, they can be classified into one theoretical family or another.

4 Network dataset

The authors experiment requires a large number of networks in order to reduce the impact of the irregularity which could be presented in a small set of ad-hoc networks.
Hence, presuming that networks in different domains possess various structural particularities [4], the authors collect networks spanning of a variety of categories which are widely studied in the research community.
Table 2 encapsulates the principle information of networks involved in their experiment.
As the authors can see, the marginal distributions on top imply that inside each category, networks also span in a relatively wide range of size with some slightly differences from one category to another.
Additionally, the networks in this dataset are quite sparse.

5 Experimental results

The authors gradually employ each method presented in Section 3 to discover community structures on the whole set of networks summarized in Table 2.
Finally, the authors use the similarity function defined by Equation (3) to estimate the closeness between each pair of methods.
As the authors can see, there is a clear difference in the densities of community size, showing that these methods have various partitioning strategies.
This phenomenon is less distinguishable on RCCLP-4 since there are much less quadrangular than triangular connections in networks.
LPA, SPLA (both based on label propagation) and Conclude display nearly identical distributions.

6 Discussion and Conclusion

The authors experimental taxonomy discloses a new source of information that some traditional evaluation methods could not directly expose.
The authors demonstrate the similarity between partitions detected by these methods in Fig. 5 using Normalized Mutual Information (NMI) metric [27].
In the meanwhile, using only community size distribution to deduce the similarity of methods could lead to unexpected results.
In fact, it seems that the distributions illustrated in Fig. 3 have a tendency to move to the left hand side if more small scale networks are involved and inversely, to the right hand side if large scale networks are more implicated.
A further investigation is deemed necessary and promising as perspective.

Did you find this useful? Give us your feedback

Figures (6)

Table 1. Community detection methods and associated implementations involved in our analyses grouped by different methodological approach. The label column denotes the corresponding abbreviations used in our paper.

Fig. 5. The similarity of partitions average NMI

Fig. 4. The estimated proximity between detection methods. Similar methods share a large fraction of same-size communities. Methods are ordered using hierarchical clustering. The dendrogram proposes a hierarchical structure of the fitting closeness. Blue colors mean high similarity.

Fig. 2. Structural information of network employed in the experiment. The solid lines represent estimated relations between number of nodes and number of edges in each network category using a linear regression model. Accordingly, the translucent color backgrounds represent the corresponding 95% confidence intervals.

Fig. 1. The size distributions of communities detected by two different methods.

Fig. 3. The distribution of community size contained in the partitions detected on the networks of the dataset. The distribution are smoothed using a Gaussian kernel estimator. The illustrative gradient color is for the ease of view purpose.

Content maybe subject to copyright Report

HAL Id: hal-01911077

https://hal.archives-ouvertes.fr/hal-01911077

Submitted on 2 Nov 2018

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Estimating the similarity of community detection

methods based on cluster size distribution

Vinh-Loc Dao, Cécile Bothorel, Philippe Lenca

To cite this version:

Vinh-Loc Dao, Cécile Bothorel, Philippe Lenca. Estimating the similarity of community detection

methods based on cluster size distribution. COMPLEX NETWORKS 2018 : The 7th International

Conference on Complex Networks and Their Applications, Dec 2018, Cambridge, United Kingdom.

pp.183-194, �10.1007/978-3-030-05411-3_15�. �hal-01911077�

Estimating the similarity of community

detection methods based on cluster size

distribution

Vinh-Loc Dao, C´ecile Bothorel, Philippe Lenca

IMT Atlantique - Lab-STICC CNRS

UMR 6285 F-29238 Brest, France

{vinh.dao, cecile.bothorel, philippe.lenca}@imt-atlantique.fr,

Abstract. Detecting community structure discloses tremendous infor-

mation about complex networks and unlock promising applied perspec-

tives. Accordingly, a numerous number of community detection methods

have been proposed in the last two decades with many rewarding discov-

eries. Notwithstanding, it is still very challenging to determine a suitable

method in order to get more insights into the mesoscopic structure of

a network given an expected quality, especially on large scale networks.

Many recent eﬀorts have also been devoted to investigating various qual-

ities of community structure associated with detection methods, but the

answer to this question is still very far from being straightforward. In

this paper, we propose a novel approach to estimate the similarity be-

tween community detection methods using the size density distributions

of communities that they detect. We verify our solution on a very large

corpus of networks consisting in more than a hundred networks of ﬁve

diﬀerent categories and deliver pairwise similarities of 16 state-of-the-art

and well-known methods. Interestingly, our result shows that there is

a very clear distinction between the partitioning strategies of diﬀerent

community detection methods. This distinction plays an important role

in assisting network analysts to identify their rule-of-thumb solutions.

Keywords: community detection, similarity metric, community size,

comparative analysis

1 Introduction

Community detection discloses interesting information about the heterogeneous

structure of complex networks and opens promising perspective in many theo-

retical as well as applied domains [8,23,24]. Although showing a high similarity

with traditional unsupervised data clustering, community detection techniques

have just been becoming prosperous in the last two decades remarked by the in-

vention of modularity [14] and the availability of a large volume of networks from

small scale to very large scale thanks to the development of Internet and notably

social platforms. Since then, a numerous number of detection techniques with

various approaches have been proposed [3,5] to solve this network decomposition

2 Vinh-Loc Dao et al.

problem. Even though communities are widely assumed to be sub-graphs where

nodes are more densely connected relatively to the rest of the network, there is

no commonly accepted standard process to evaluate the accuracy of detection

methods. Indeed, the notion of goodness varies according to contextual objective

and also the assumption about the underlying network model. By consequence,

there is normally a confusion when one needs to ﬁnd the most suitable method

among available ones that is presumed to satisfy some speciﬁc requirements in

outcome quality.

Meanwhile, the stated issue leaves behind rooms for developing theoretical

and empirical techniques for comparing community detection algorithms. Actu-

ally, new methods are usually introduced in accompany with quality evaluation

based on many variants of Mutual Information [27] or modularity. These two

approaches work well to validate the functionality of proposed methods in ad-

hoc networks but are not directly interpretable in a comparative evaluation of

clustering quality. Actually, the former ones do not provide structural informa-

tion of detected communities and the latter ones are dependent on hypotheses

about null models. In other words, equivalent scores do not directly ensure an

equivalence of partition quality.

In this paper, we are interested in estimating pairwise similarity of commu-

nity detection methods based on expected community size. These estimates also

reveal information about the closeness in terms of number of communities - a

very important and intuitive characteristic of clustering algorithms - which is

considered as an essential perspective in community detection literature [5] and

recently addressed by many in-depth researches [7,19]. Speciﬁcally, we conduct

an empirical experiment to inspect a large number of state-of-the-art and widely

used community detection methods and estimate their similarity using size dis-

tribution of communities that they discover on a large dataset of networks across

several domains. The result of our analysis implicates that community detection

methods can be classiﬁed in three well discernible groups exhibiting three es-

sential strategies of node partition. These strategies produce a great impact on

the outcome of community detection methods, making them very distinctive.

We will show that this taxonomy exposes very useful information for proposing

appropriate methods according to expected analysis strategy.

2 Estimating the similarity of community detection

methods

We present a novel approach to determine the similarity of community detection

methods using the size distribution of communities that they discover. Certainly,

this is only one among interesting quality aspects that diﬀerentiate one method

from the others. Nonetheless, it allows to get more insight into the diﬀerence in

terms of partitioning strategy.

Speciﬁcally, a very naive but eﬃcient approach to evaluate the similarity of

two methods is to inquire into the “closeness” of the two corresponding com-

munity size distributions. As such, two methods could be supposed to be similar

Estimating the similarity of community detection methods 3

0e+00

2e−06

4e−06

6e−06

0e+00 2e+05 4e+05

Community size

Density

d A

d B

(a) Overlapped sizes

0e+00

2e−06

4e−06

6e−06

8e−06

0e+00 2e+05 4e+05

Community size

Density

d A

d B

(b) Interlaced sizes

Fig. 1. The size distributions of communities detected by two diﬀerent methods.

if their corresponding density distributions expose a large intersection area as

shown in Fig. 1(a). From this notice, we can deﬁne our new similarity function

as follows.

First, we denote two 2-tuples (A, n

) and (B, n

) being the multisets repre-

senting all communities detected on a set of networks G = {G} by method A

and method B respectively, where A = {x

, x

, ..., x

} and B = {x

, x

, ..., x

}

being the ascending ordered sets of sizes of communities: 1 ≤ x

< x

< ... < x

and 1 ≤ x

< x

< ... < x

. The multiplicity functions n

: A → N

≥1

and

: B → N

≥1

measure the number of communities of sizes x

and x

respec-

tively. Let N

i=1

) and N

i=1

) being the total number of

communities of all sizes detected by each method, we deﬁne a similarity function

describing the closeness of A and B on G as:

(A, B) =

i=1

j=1

min

(

)

δ(x

, x

), (1)

where δ(x

, x

) = 1 if x

= x

and 0 otherwise. Equation (1) is simply the

common fraction of same-size communities detected on G by both A and B:

0 ≤ S

(A, B) ≤ 1. This deﬁnition seems to be intuitive but does not work well

in practice. As illustrated in Fig. 1(b), when the sizes interlace each other, a low

score will be produced although the similarity in this case is as much as that of

the case of Fig. 1(a). Choosing an appropriate binning interval would mitigate

the problem. This solution is, however very inﬂexible, sensible to the characteris-

tic of data as well as to the functionality of the methods in use. A straightforward

alternative can be envisioned by using a kernel density estimator to uncover the

probability density functions as shown by the solid lines in Fig. 1(b). In this

way, we approximate the common fraction of same-size communities of Equa-

tion (1) by the overlapping area of two corresponding continuous distributions.

The premise behind this estimation is that two similar methods must not com-

pulsorily produce a large portion of exactly same-size communities, but a large

4 Vinh-Loc Dao et al.

portion of comparable-size ones. Hence, we consider the following estimator to

take in local information of community size x

f(x

) =



− x



, (2)

where h is the bandwidth controlling the neighborhood interval around x

and

K is the kernel function controlling the weight given to the observations {x

}

chosen as Gaussian in our analysis. Using this estimator, we rewrite the similarity

function deﬁned in Equation (1) as follows:

(A, B) =

min{

(a)

(x),

(b)

(x)}dx, (3)

where

(u)

(x) =





− x



, (4)

with u ∈ {a, b}. In the estimations of this paper, the bandwidth h is selected

based on the normal reference rule [26] to minimize the mean integrated squared

error. The only exception is the cases illustrated Fig. 3 where a higher value has

been chosen to get a higher smoothing quality for a better illustration.

Using Equations (3) and (4) to estimate the similarity between pairs of de-

tection methods on a large dataset will help us discovering diﬀerent behaviors of

community detection methods. Since the accuracy of the estimator depends on

the networks of the dataset that we analyze, the result will obviously relativized.

However, a large and representative corpus would help to reduce the dependency

impact.

3 Community detection methods

A community is roughly described as a group of nodes in a graph where there

must be many edges connecting them together than edges connecting the com-

munity with the rest of the graph [5]. However, in practice, this concept is math-

ematically or algorithmically formulated in diﬀerent ways engendering various

discovery approaches. In this paper, we select a representative set of state-of-the-

art and widely studied detection methods whose approaches spread out over the

most commonly used in the literature. These methods are summarized in Table 1

with corresponding information. Their approaches could be brieﬂy summarized

as follows:

– Edge removal: In this approach, inter-community edges in a network are

gradually removed in order to disconnect densely connected groups. The

problem of community detection is translated to identifying candidates for

inter-community edges based on their topological positions. Popular tech-

niques include using edge betweenness centrality (GN in Table 1) or edge

clustering coeﬃcient, which could be based on triangular (RCCLP-3) or

quadrangular (RCCLP-4) patterns.

HTML Viewer

Frequently Asked Questions (5)

Q1. What are the contributions mentioned in the paper "Estimating the similarity of community detection methods based on cluster size distribution" ?

In this paper, the authors propose a novel approach to estimate the similarity between community detection methods using the size density distributions of communities that they detect. This distinction plays an important role in assisting network analysts to identify their rule-of-thumb solutions.

Q2. How many edges do the authors have in each category?

the number of edges increase linearly by the number of nodes with equivalent rates among categories as can be deduced from the gradients of the linear estimates.

Q3. What is the purpose of this paper?

In this paper, the authors are interested in estimating pairwise similarity of community detection methods based on expected community size.

Q4. What are the popular techniques for detecting inter-community edges?

Popular techniques include using edge betweenness centrality (GN in Table 1) or edge clustering coefficient, which could be based on triangular (RCCLP-3) or quadrangular (RCCLP-4) patterns.

Q5. how do the distributions in fig. 3 move?

In fact, it seems that the distributions illustrated in Fig. 3 have a tendency to move to the left hand side if more small scale networks are involved and inversely, to the right hand side if large scale networks are more implicated.

Estimating the Similarity of Community Detection Methods Based on Cluster Size Distribution

Summary (2 min read)

1 Introduction

2 Estimating the similarity of community detection methods

3 Community detection methods

4 Network dataset

5 Experimental results

6 Discussion and Conclusion

Figures (6)

Citations

Cites background from "Estimating the Similarity of Commun..."

References

"Estimating the Similarity of Commun..." refers methods in this paper

Additional excerpts

"Estimating the Similarity of Commun..." refers background or methods in this paper

Related Papers (5)

Frequently Asked Questions (5)

Q1. What are the contributions mentioned in the paper "Estimating the similarity of community detection methods based on cluster size distribution" ?

Q2. How many edges do the authors have in each category?

Q3. What is the purpose of this paper?

Q4. What are the popular techniques for detecting inter-community edges?

Q5. how do the distributions in fig. 3 move?