scispace - formally typeset
Search or ask a question
Book ChapterDOI

Estimating the Similarity of Community Detection Methods Based on Cluster Size Distribution

11 Dec 2018-Vol. 812, pp 183-194
TL;DR: This paper proposes a novel approach to estimate the similarity between community detection methods using the size density distributions of communities that they detect and shows that there is a very clear distinction between the partitioning strategies of differentcommunity detection methods.
Abstract: Detecting community structure discloses tremendous information about complex networks and unlock promising applied perspectives. Accordingly, a numerous number of community detection methods have been proposed in the last two decades with many rewarding discoveries. Notwithstanding, it is still very challenging to determine a suitable method in order to get more insights into the mesoscopic structure of a network given an expected quality, especially on large scale networks. Many recent efforts have also been devoted to investigating various qualities of community structure associated with detection methods, but the answer to this question is still very far from being straightforward. In this paper, we propose a novel approach to estimate the similarity between community detection methods using the size density distributions of communities that they detect. We verify our solution on a very large corpus of networks consisting in more than a hundred networks of five different categories and deliver pairwise similarities of 16 state-of-the-art and well-known methods. Interestingly, our result shows that there is a very clear distinction between the partitioning strategies of different community detection methods. This distinction plays an important role in assisting network analysts to identify their rule-of-thumb solutions.

Summary (2 min read)

1 Introduction

  • Community detection discloses interesting information about the heterogeneous structure of complex networks and opens promising perspective in many theoretical as well as applied domains [8,23,24].
  • Indeed, the notion of goodness varies according to contextual objective and also the assumption about the underlying network model.
  • These two approaches work well to validate the functionality of proposed methods in adhoc networks but are not directly interpretable in a comparative evaluation of clustering quality.
  • The authors are interested in estimating pairwise similarity of community detection methods based on expected community size.
  • The authors will show that this taxonomy exposes very useful information for proposing appropriate methods according to expected analysis strategy.

2 Estimating the similarity of community detection methods

  • The authors present a novel approach to determine the similarity of community detection methods using the size distribution of communities that they discover.
  • Nonetheless, it allows to get more insight into the difference in terms of partitioning strategy.
  • As such, two methods could be supposed to be similar if their corresponding density distributions expose a large intersection area as shown in Fig. 1(a).
  • The premise behind this estimation is that two similar methods must not compulsorily produce a large portion of exactly same-size communities, but a large portion of comparable-size ones.
  • A large and representative corpus would help to reduce the dependency impact.

3 Community detection methods

  • A community is roughly described as a group of nodes in a graph where there must be many edges connecting them together than edges connecting the community with the rest of the graph [5].
  • The authors select a representative set of state-of-theart and widely studied detection methods whose approaches spread out over the most commonly used in the literature.
  • These methods are summarized in Table 1 with corresponding information.
  • ∗The total number of networks, nodes and edges in the whole dataset respectively.
  • In many cases, they can be classified into one theoretical family or another.

4 Network dataset

  • The authors experiment requires a large number of networks in order to reduce the impact of the irregularity which could be presented in a small set of ad-hoc networks.
  • Hence, presuming that networks in different domains possess various structural particularities [4], the authors collect networks spanning of a variety of categories which are widely studied in the research community.
  • Table 2 encapsulates the principle information of networks involved in their experiment.
  • As the authors can see, the marginal distributions on top imply that inside each category, networks also span in a relatively wide range of size with some slightly differences from one category to another.
  • Additionally, the networks in this dataset are quite sparse.

5 Experimental results

  • The authors gradually employ each method presented in Section 3 to discover community structures on the whole set of networks summarized in Table 2.
  • Finally, the authors use the similarity function defined by Equation (3) to estimate the closeness between each pair of methods.
  • As the authors can see, there is a clear difference in the densities of community size, showing that these methods have various partitioning strategies.
  • This phenomenon is less distinguishable on RCCLP-4 since there are much less quadrangular than triangular connections in networks.
  • LPA, SPLA (both based on label propagation) and Conclude display nearly identical distributions.

6 Discussion and Conclusion

  • The authors experimental taxonomy discloses a new source of information that some traditional evaluation methods could not directly expose.
  • The authors demonstrate the similarity between partitions detected by these methods in Fig. 5 using Normalized Mutual Information (NMI) metric [27].
  • In the meanwhile, using only community size distribution to deduce the similarity of methods could lead to unexpected results.
  • In fact, it seems that the distributions illustrated in Fig. 3 have a tendency to move to the left hand side if more small scale networks are involved and inversely, to the right hand side if large scale networks are more implicated.
  • A further investigation is deemed necessary and promising as perspective.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

HAL Id: hal-01911077
https://hal.archives-ouvertes.fr/hal-01911077
Submitted on 2 Nov 2018
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Estimating the similarity of community detection
methods based on cluster size distribution
Vinh-Loc Dao, Cécile Bothorel, Philippe Lenca
To cite this version:
Vinh-Loc Dao, Cécile Bothorel, Philippe Lenca. Estimating the similarity of community detection
methods based on cluster size distribution. COMPLEX NETWORKS 2018 : The 7th International
Conference on Complex Networks and Their Applications, Dec 2018, Cambridge, United Kingdom.
pp.183-194, �10.1007/978-3-030-05411-3_15�. �hal-01911077�

Estimating the similarity of community
detection methods based on cluster size
distribution
Vinh-Loc Dao, C´ecile Bothorel, Philippe Lenca
IMT Atlantique - Lab-STICC CNRS
UMR 6285 F-29238 Brest, France
{vinh.dao, cecile.bothorel, philippe.lenca}@imt-atlantique.fr,
Abstract. Detecting community structure discloses tremendous infor-
mation about complex networks and unlock promising applied perspec-
tives. Accordingly, a numerous number of community detection methods
have been proposed in the last two decades with many rewarding discov-
eries. Notwithstanding, it is still very challenging to determine a suitable
method in order to get more insights into the mesoscopic structure of
a network given an expected quality, especially on large scale networks.
Many recent efforts have also been devoted to investigating various qual-
ities of community structure associated with detection methods, but the
answer to this question is still very far from being straightforward. In
this paper, we propose a novel approach to estimate the similarity be-
tween community detection methods using the size density distributions
of communities that they detect. We verify our solution on a very large
corpus of networks consisting in more than a hundred networks of five
different categories and deliver pairwise similarities of 16 state-of-the-art
and well-known methods. Interestingly, our result shows that there is
a very clear distinction between the partitioning strategies of different
community detection methods. This distinction plays an important role
in assisting network analysts to identify their rule-of-thumb solutions.
Keywords: community detection, similarity metric, community size,
comparative analysis
1 Introduction
Community detection discloses interesting information about the heterogeneous
structure of complex networks and opens promising perspective in many theo-
retical as well as applied domains [8,23,24]. Although showing a high similarity
with traditional unsupervised data clustering, community detection techniques
have just been becoming prosperous in the last two decades remarked by the in-
vention of modularity [14] and the availability of a large volume of networks from
small scale to very large scale thanks to the development of Internet and notably
social platforms. Since then, a numerous number of detection techniques with
various approaches have been proposed [3,5] to solve this network decomposition

2 Vinh-Loc Dao et al.
problem. Even though communities are widely assumed to be sub-graphs where
nodes are more densely connected relatively to the rest of the network, there is
no commonly accepted standard process to evaluate the accuracy of detection
methods. Indeed, the notion of goodness varies according to contextual objective
and also the assumption about the underlying network model. By consequence,
there is normally a confusion when one needs to find the most suitable method
among available ones that is presumed to satisfy some specific requirements in
outcome quality.
Meanwhile, the stated issue leaves behind rooms for developing theoretical
and empirical techniques for comparing community detection algorithms. Actu-
ally, new methods are usually introduced in accompany with quality evaluation
based on many variants of Mutual Information [27] or modularity. These two
approaches work well to validate the functionality of proposed methods in ad-
hoc networks but are not directly interpretable in a comparative evaluation of
clustering quality. Actually, the former ones do not provide structural informa-
tion of detected communities and the latter ones are dependent on hypotheses
about null models. In other words, equivalent scores do not directly ensure an
equivalence of partition quality.
In this paper, we are interested in estimating pairwise similarity of commu-
nity detection methods based on expected community size. These estimates also
reveal information about the closeness in terms of number of communities - a
very important and intuitive characteristic of clustering algorithms - which is
considered as an essential perspective in community detection literature [5] and
recently addressed by many in-depth researches [7,19]. Specifically, we conduct
an empirical experiment to inspect a large number of state-of-the-art and widely
used community detection methods and estimate their similarity using size dis-
tribution of communities that they discover on a large dataset of networks across
several domains. The result of our analysis implicates that community detection
methods can be classified in three well discernible groups exhibiting three es-
sential strategies of node partition. These strategies produce a great impact on
the outcome of community detection methods, making them very distinctive.
We will show that this taxonomy exposes very useful information for proposing
appropriate methods according to expected analysis strategy.
2 Estimating the similarity of community detection
methods
We present a novel approach to determine the similarity of community detection
methods using the size distribution of communities that they discover. Certainly,
this is only one among interesting quality aspects that differentiate one method
from the others. Nonetheless, it allows to get more insight into the difference in
terms of partitioning strategy.
Specifically, a very naive but efficient approach to evaluate the similarity of
two methods is to inquire into the “closeness” of the two corresponding com-
munity size distributions. As such, two methods could be supposed to be similar

Estimating the similarity of community detection methods 3
0e+00
2e06
4e06
6e06
0e+00 2e+05 4e+05
Community size
Density
d A
d B
(a) Overlapped sizes
0e+00
2e06
4e06
6e06
8e06
0e+00 2e+05 4e+05
Community size
Density
d A
d B
(b) Interlaced sizes
Fig. 1. The size distributions of communities detected by two different methods.
if their corresponding density distributions expose a large intersection area as
shown in Fig. 1(a). From this notice, we can define our new similarity function
as follows.
First, we denote two 2-tuples (A, n
a
) and (B, n
b
) being the multisets repre-
senting all communities detected on a set of networks G = {G} by method A
and method B respectively, where A = {x
a
1
, x
a
2
, ..., x
a
r
} and B = {x
b
1
, x
b
2
, ..., x
b
s
}
being the ascending ordered sets of sizes of communities: 1 x
a
1
< x
a
2
< ... < x
a
r
and 1 x
b
1
< x
b
2
< ... < x
b
s
. The multiplicity functions n
a
: A N
1
and
n
b
: B N
1
measure the number of communities of sizes x
a
i
and x
b
i
respec-
tively. Let N
a
=
P
r
i=1
n
a
(x
a
i
) and N
b
=
P
s
i=1
n
b
(x
b
i
) being the total number of
communities of all sizes detected by each method, we define a similarity function
describing the closeness of A and B on G as:
S
G
(A, B) =
1
2
r
X
i=1
s
X
j=1
min
(
n
a
(x
a
i
)
N
a
,
n
b
(x
b
j
)
N
b
)
δ(x
a
i
, x
b
j
), (1)
where δ(x
a
i
, x
b
j
) = 1 if x
a
i
= x
b
j
and 0 otherwise. Equation (1) is simply the
common fraction of same-size communities detected on G by both A and B:
0 S
G
(A, B) 1. This definition seems to be intuitive but does not work well
in practice. As illustrated in Fig. 1(b), when the sizes interlace each other, a low
score will be produced although the similarity in this case is as much as that of
the case of Fig. 1(a). Choosing an appropriate binning interval would mitigate
the problem. This solution is, however very inflexible, sensible to the characteris-
tic of data as well as to the functionality of the methods in use. A straightforward
alternative can be envisioned by using a kernel density estimator to uncover the
probability density functions as shown by the solid lines in Fig. 1(b). In this
way, we approximate the common fraction of same-size communities of Equa-
tion (1) by the overlapping area of two corresponding continuous distributions.
The premise behind this estimation is that two similar methods must not com-
pulsorily produce a large portion of exactly same-size communities, but a large

4 Vinh-Loc Dao et al.
portion of comparable-size ones. Hence, we consider the following estimator to
take in local information of community size x
0
:
b
f(x
0
) =
1
hn
X
i
K
x
i
x
0
h
, (2)
where h is the bandwidth controlling the neighborhood interval around x
0
and
K is the kernel function controlling the weight given to the observations {x
i
}
chosen as Gaussian in our analysis. Using this estimator, we rewrite the similarity
function defined in Equation (1) as follows:
S
G
(A, B) =
Z
min{
b
f
(a)
(x),
b
f
(b)
(x)}dx, (3)
where
b
f
(u)
(x) =
1
hN
u
N
u
X
i
n
u
(x
u
i
)K
x
u
i
x
h

, (4)
with u {a, b}. In the estimations of this paper, the bandwidth h is selected
based on the normal reference rule [26] to minimize the mean integrated squared
error. The only exception is the cases illustrated Fig. 3 where a higher value has
been chosen to get a higher smoothing quality for a better illustration.
Using Equations (3) and (4) to estimate the similarity between pairs of de-
tection methods on a large dataset will help us discovering different behaviors of
community detection methods. Since the accuracy of the estimator depends on
the networks of the dataset that we analyze, the result will obviously relativized.
However, a large and representative corpus would help to reduce the dependency
impact.
3 Community detection methods
A community is roughly described as a group of nodes in a graph where there
must be many edges connecting them together than edges connecting the com-
munity with the rest of the graph [5]. However, in practice, this concept is math-
ematically or algorithmically formulated in different ways engendering various
discovery approaches. In this paper, we select a representative set of state-of-the-
art and widely studied detection methods whose approaches spread out over the
most commonly used in the literature. These methods are summarized in Table 1
with corresponding information. Their approaches could be briefly summarized
as follows:
Edge removal: In this approach, inter-community edges in a network are
gradually removed in order to disconnect densely connected groups. The
problem of community detection is translated to identifying candidates for
inter-community edges based on their topological positions. Popular tech-
niques include using edge betweenness centrality (GN in Table 1) or edge
clustering coefficient, which could be based on triangular (RCCLP-3) or
quadrangular (RCCLP-4) patterns.

Citations
More filters
Journal ArticleDOI
TL;DR: The aim of CDlib is to allow easy and standardized access to a wide variety of network clustering algorithms, to evaluate and compare the results they provide, and to visualize them.
Abstract: Community Discovery is among the most studied problems in complex network analysis. During the last decade, many algorithms have been proposed to address such task; however, only a few of them have been integrated into a common framework, making it hard to use and compare different solutions. To support developers, researchers and practitioners, in this paper we introduce a python library - namely CDlib - designed to serve this need. The aim of CDlib is to allow easy and standardized access to a wide variety of network clustering algorithms, to evaluate and compare the results they provide, and to visualize them. It notably provides the largest available collection of community detection implementations, with a total of 39 algorithms.

83 citations

Journal ArticleDOI
TL;DR: Experiments show that the proposed community detection algorithm based on influential nodes (LGIEM) is able to detect communities efficiently, and achieves better performance compared to other recent methods.

77 citations

Journal ArticleDOI
TL;DR: This paper provides comprehensive analyses on computation time, community size distribution, a comparative evaluation of methods according to their optimization schemes as well as a comparison of their partitioning strategy through validation metrics, and proposes ways to classify community detection methods.
Abstract: Discovering community structure in complex networks is a mature field since a tremendous number of community detection methods have been introduced in the literature. Nevertheless, it is still very challenging for practitioners to determine which method would be suitable to get insights into the structural information of the networks they study. Many recent efforts have been devoted to investigating various quality scores of the community structure, but the problem of distinguishing between different types of communities is still open. In this paper, we propose a comparative, extensive, and empirical study to investigate what types of communities many state-of-the-art and well-known community detection methods are producing. Specifically, we provide comprehensive analyses on computation time, community size distribution, a comparative evaluation of methods according to their optimization schemes as well as a comparison of their partitioning strategy through validation metrics. We process our analyses on a very large corpus of hundreds of networks from five different network categories and propose ways to classify community detection methods, helping a potential user to navigate the complex landscape of community detection.

38 citations


Cites background from "Estimating the Similarity of Commun..."

  • ...A very naive but efficient approach to evaluate the similarity of two methods is to inquire into the “closeness” of the two corresponding community size distributions (Dao et al., 2018b)....

    [...]

  • ...This is due to the fact that in some highly local centralized networks having star-like structures (Dao et al., 2018a), they have a tendency to remove edges connecting hub and peripheral nodes and create singletons (single node community)....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a community strength index based on ECG results to quantify the presence of community structure in a graph and applied ECG to community-aware anomaly detection.
Abstract: We recently proposed a new ensemble clustering algorithm for graphs (ECG) based on the concept of consensus clustering. In this paper, we provide experimental evidence to the claim that ECG alleviates the well-known resolution limit issue, and that it leads to better stability of the partitions. We propose a community strength index based on ECG results to help quantify the presence of community structure in a graph. We perform a wide range of experiments both over synthetic and real graphs, showing the usefulness of ECG over a variety of problems. In particular, we consider measures based on node partitions as well as topological structure of the communities, and we apply ECG to community-aware anomaly detection. Finally, we show that ECG can be used in a semi-supervised context to zoom in on the sub-graph most closely associated with seed nodes.

15 citations

Journal ArticleDOI
TL;DR: In this article, a comparative, extensive and empirical study to investigate what types of communities many state-of-the-art and well-known community detection methods are producing is presented.
Abstract: Discovering community structure in complex networks is a mature field since a tremendous number of community detection methods have been introduced in the literature. Nevertheless, it is still very challenging for practioners to determine which method would be suitable to get insights into the structural information of the networks they study. Many recent efforts have been devoted to investigating various quality scores of the community structure, but the problem of distinguishing between different types of communities is still open. In this paper, we propose a comparative, extensive and empirical study to investigate what types of communities many state-of-the-art and well-known community detection methods are producing. Specifically, we provide comprehensive analyses on computation time, community size distribution, a comparative evaluation of methods according to their optimisation schemes as well as a comparison of their partioning strategy through validation metrics. We process our analyses on a very large corpus of hundreds of networks from five different network categories and propose ways to classify community detection methods, helping a potential user to navigate the complex landscape of community detection.

13 citations

References
More filters
BookDOI
01 Jan 1986
TL;DR: The Kernel Method for Multivariate Data: Three Important Methods and Density Estimation in Action.
Abstract: Introduction. Survey of Existing Methods. The Kernel Method for Univariate Data. The Kernel Method for Multivariate Data. Three Important Methods. Density Estimation in Action.

15,499 citations


"Estimating the Similarity of Commun..." refers methods in this paper

  • ...In the estimations of this paper, the bandwidth h is selected based on the normal reference rule [26] to minimize the mean integrated squared error....

    [...]

Journal ArticleDOI
TL;DR: This work proposes a heuristic method that is shown to outperform all other known community detection methods in terms of computation time and the quality of the communities detected is very good, as measured by the so-called modularity.
Abstract: We propose a simple method to extract the community structure of large networks. Our method is a heuristic method that is based on modularity optimization. It is shown to outperform all other known community detection method in terms of computation time. Moreover, the quality of the communities detected is very good, as measured by the so-called modularity. This is shown first by identifying language communities in a Belgian mobile phone network of 2.6 million customers and by analyzing a web graph of 118 million nodes and more than one billion links. The accuracy of our algorithm is also verified on ad-hoc modular networks. .

13,519 citations


Additional excerpts

  • ...[1] Louvain Authors(3) Newman [13] SN igraph(1)...

    [...]

Journal ArticleDOI
TL;DR: It is demonstrated that the algorithms proposed are highly effective at discovering community structure in both computer-generated and real-world network data, and can be used to shed light on the sometimes dauntingly complex structure of networked systems.
Abstract: We propose and study a set of algorithms for discovering community structure in networks-natural divisions of network nodes into densely connected subgroups. Our algorithms all share two definitive features: first, they involve iterative removal of edges from the network to split it into communities, the edges removed being identified using any one of a number of possible "betweenness" measures, and second, these measures are, crucially, recalculated after each removal. We also propose a measure for the strength of the community structure found by our algorithms, which gives us an objective metric for choosing the number of communities into which a network should be divided. We demonstrate that our algorithms are highly effective at discovering community structure in both computer-generated and real-world network data, and show how they can be used to shed light on the sometimes dauntingly complex structure of networked systems.

12,882 citations


"Estimating the Similarity of Commun..." refers background or methods in this paper

  • ...Although showing a high similarity with traditional unsupervised data clustering, community detection techniques have just been becoming prosperous in the last two decades remarked by the invention of modularity [14] and the availability of a large volume of networks from small scale to very large scale thanks to the development of Internet and notably social platforms....

    [...]

  • ...Edge removal Girvan-Newman [14] GN igraph(1) Radicchi et al....

    [...]

  • ...– Modularity optimization: Methods in this approach use a common objective function called modularity [14], but have different optimization strategies....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a simple method to extract the community structure of large networks based on modularity optimization, which is shown to outperform all other known community detection methods in terms of computation time.
Abstract: We propose a simple method to extract the community structure of large networks. Our method is a heuristic method that is based on modularity optimization. It is shown to outperform all other known community detection methods in terms of computation time. Moreover, the quality of the communities detected is very good, as measured by the so-called modularity. This is shown first by identifying language communities in a Belgian mobile phone network of 2 million customers and by analysing a web graph of 118 million nodes and more than one billion links. The accuracy of our algorithm is also verified on ad hoc modular networks.

11,078 citations

Journal ArticleDOI
TL;DR: In this article, the modularity of a network is expressed in terms of the eigenvectors of a characteristic matrix for the network, which is then used for community detection.
Abstract: Many networks of interest in the sciences, including social networks, computer networks, and metabolic and regulatory networks, are found to divide naturally into communities or modules. The problem of detecting and characterizing this community structure is one of the outstanding issues in the study of networked systems. One highly effective approach is the optimization of the quality function known as “modularity” over the possible divisions of a network. Here I show that the modularity can be expressed in terms of the eigenvectors of a characteristic matrix for the network, which I call the modularity matrix, and that this expression leads to a spectral algorithm for community detection that returns results of demonstrably higher quality than competing methods in shorter running times. I illustrate the method with applications to several published network data sets.

10,137 citations

Frequently Asked Questions (5)
Q1. What are the contributions mentioned in the paper "Estimating the similarity of community detection methods based on cluster size distribution" ?

In this paper, the authors propose a novel approach to estimate the similarity between community detection methods using the size density distributions of communities that they detect. This distinction plays an important role in assisting network analysts to identify their rule-of-thumb solutions. 

the number of edges increase linearly by the number of nodes with equivalent rates among categories as can be deduced from the gradients of the linear estimates. 

In this paper, the authors are interested in estimating pairwise similarity of community detection methods based on expected community size. 

Popular techniques include using edge betweenness centrality (GN in Table 1) or edge clustering coefficient, which could be based on triangular (RCCLP-3) or quadrangular (RCCLP-4) patterns. 

In fact, it seems that the distributions illustrated in Fig. 3 have a tendency to move to the left hand side if more small scale networks are involved and inversely, to the right hand side if large scale networks are more implicated.