Journal Article•DOI•

Statistical clustering of temporal networks through a dynamic stochastic block model

Q: What are the contributions in "Statistical clustering of temporal networks through a dynamic stochastic block model" ?

Their approach, motivated by the importance of controlling for label switching issues across the different time steps, focuses on detecting groups characterized by a stable within group connectivity behavior. The authors study identifiability of the model parameters, propose an inference procedure based on a variational expectation maximization algorithm as well as a model selection criterion to select for the number of groups. The authors carefully discuss their initialization strategy which plays an important role in the method and compare their procedure with existing ones on synthetic datasets. The authors also illustrate their approach on dynamic contact networks, one of encounters among high school students and two others on animal interactions.

Catherine Matias¹, Vincent Miele¹•Institutions (1)

Centre national de la recherche scientifique¹

01 Sep 2017-Journal of The Royal Statistical Society Series B-statistical Methodology (Wiley/Blackwell (10.1111))-Vol. 79, Iss: 4, pp 1119-1141

TL;DR: This work explores statistical properties and frequentist inference in a model that combines a stochastic block model for its static part with independent Markov chains for the evolution of the nodes groups through time and proposes an inference procedure based on a variational expectation–maximization algorithm.

read less

Abstract: Statistical node clustering in discrete time dynamic networks is an emerging field that raises many challenges. Here, we explore statistical properties and frequentist inference in a model that combines a stochastic block model (SBM) for its static part with independent Markov chains for the evolution of the nodes groups through time. We model binary data as well as weighted dynamic random graphs (with discrete or continuous edges values). Our approach, motivated by the importance of controlling for label switching issues across the different time steps, focuses on detecting groups characterized by a stable within group connectivity behavior. We study identifiability of the model parameters , propose an inference procedure based on a variational expectation maximization algorithm as well as a model selection criterion to select for the number of groups. We carefully discuss our initialization strategy which plays an important role in the method and compare our procedure with existing ones on synthetic datasets. We also illustrate our approach on dynamic contact networks, one of encounters among high school students and two others on animal interactions. An implementation of the method is available as a R package called dynsbm.

...read moreread less

Summary (4 min read)

Jump to: [1. Introduction] – [2.2. Varying connectivity parameters vs varying group membership] – [2.3. Parameters identifiability] – [3.1. General description] – [3.2. Algorithm initialization] – [4. Synthetic experiments] – [4.1. Clustering performances] – [4.2. Model selection] – [5. Revealing social structure in dynamic contact networks] – [5.1. Encounters among high school students] – [5.2. Social interactions between animals] – [A. Counter example of identifiability when groups memberships and connectivity parameters vary freely] – [B. Non identifiability in affiliation case (planted partition)] – [D. Estimation of γ and model selection: specific examples] and [E. Extension to varying number of nodes]

1. Introduction

Statistical network analysis has become a major field of research, with applications as diverse as sociology, ecology, biology, internet, etc.
While static approaches have been developed as early as in the 60’s (mostly in the field of sociology), the literature concerning dynamic models is much more recent.
An important part of the literature on static network analysis is dedicated to clustering methods, with both aims of taking into account the intrinsic heterogeneity of the data and summarizing this data through node classification.
This kind of assumption has never been discussed in the literature.
The authors are interested in statistical models for discrete time dynamic random graphs, with the aim of providing a node classification varying with time, while controlling for label switching issues across the different time steps.

2.2. Varying connectivity parameters vs varying group membership

The authors give some intuition on why it is not possible to let both connectivity parameters and group membership vary through time without entering into label switching issues between time steps.
Thus it is not always possible to label the groups so that between two successive time steps, estimation of the transition parameters would be constrained to have large diagonal elements.
In the above toy example, this corresponds to the first interpretation rather than the second.
Other choices could be made and the authors believe that this one is particularly suited to model social networks or contact data where the groups are defined as structures exhibiting a stable within group connectivity behavior and individuals may change groups through time (see Section 5 for applications on real datasets).

2.3. Parameters identifiability

This property is not satisfactory since clustering in models that only satisfy a local identifiability of SBM part of the parameter prevents from obtaining a picture of the evolution of the groups across time.
The authors stress that this example implies that dynamic affiliation SBM (or planted partition model) does not have identifiable parameters and groups may not be recovered consistently across time.
This is an important point as previous authors have tried to recover groups from this type of synthetic datasets and evaluated their estimated classification in a non natural way.
The authors prove below that these constraints, combined with the same conditions used for identifiability in the static case, are sufficient to ensure identifiability of the parametrization in their dynamic setup.
In particular for the binary case, assuming that the matrix of Bernoulli parameters β has distinct rows is a generic constraint (meaning that it removes a subset of zero Lebesgue measure of the parameter set).

3.1. General description

As usual with latent variables, the log-likelihood logPθ(Y) contains a sum over all possible latent configurations Z and thus may not be computed except for small values of N and T .
A classical solution is to rely on expectation-maximisation (EM) algorithm (Dempster et al., 1977), an iterative procedure that finds local maxima of the log-likelihood.
This distribution has not a factored form and thus may not be computed efficiently.
The authors refer to the review by Matias and Robin (2014) for more details about VEM algorithm (in particular a presentation of EM viewed as a special instance of VEM) and its comparison to other estimation procedures in SBM.

3.2. Algorithm initialization

All EM based procedures look for local maxima of their objective function and careful initialization is a key in their success.
For static SBM, VEM procedures often rely on a k-means algorithm on the adjacency matrix to obtain an initial clustering of the individuals.
In their context, the dynamic aspect of the data needs to be properly handled.
As a result, their initial clustering of the individuals is constant across time (namely Zti does not depend on t).
The authors initialization is performant in these cases.

4. Synthetic experiments

The methods presented in this manuscript are implemented into a R package and available at http: //lbbe.univ-lyon1.fr/dynsbm.
While the complexity of the estimation algorithm is O(TQ2N2),.

4.1. Clustering performances

The authors explore the performances of their method for clustering the nodes across the different time steps.
As for the Bernoulli parameters β, the authors explore 4 different cases (see Table 4.1) representing different difficulty levels, plus a specific example of affiliation for which they recall that parameters are not identifiable in the dynamic setting.
For each combination of (π,β), the authors generate 100 datasets, estimate their parameters, cluster their nodes and report in Figure 3 boxplots of a global and of an averaged ARI value.
The authors believe that this is due to the initialization of their procedure: with T = 10 time points, it is more likely that the groups membership differ from their initial value.
The authors note that the authors do not discuss initialization and simply propose to start with a random partition of the nodes, which proves to be a bad strategy.

4.2. Model selection

The authors generate 100 datasets under this model and estimate the number of groups relying on ICL criterion.
The authors observe that the correct number of groups is recovered in 88% of the cases (left panel).
Moreover, the right panel shows that when ICL selects only 3 groups, ARI of the classification with 4 groups is rather low (less than 80%).
This shows that in those cases, classification with 4 groups is not the correct one, so that VEM algorithm seems responsible for bad results (optimum has not been reached) more than the penalization term.

5.1. Encounters among high school students

Describing face-to-face contacts in a population (in their case, a classroom) can play an important role in 1/ understanding if there is a peculiar non-random mixing of individuals that would be a sign for a social organization and 2/ predicting how infectious diseases can spread, by studying the crosslink between the contacts dynamics and the disease dynamics.
Interaction times were aggregated by days to form Clustering dynamic random graphs via SBM 15 a sequence of 4 different networks.
The method selects Q = 4 groups and the authors now present the results obtained with their model fitted with Q = 4 groups.
The authors observe that groups 2 and 3 are composed by students that are likely to interact together .
Group 4 displays a similar pattern of community structure, with much less interaction (intermediate value of β̂44) but also a significant level of interaction with group 2 .

A. Counter example of identifiability when groups memberships and connectivity parameters vary freely

Here, the authors exhibit an example where the parameters are non identifiable when both groups memberships and connectivity parameters may vary across time without any constraint.
Id the size-two identity matrix and (βt, γt) = (β, γ) are chosen constant with t.
Finally, the across group parameter is not modified through time and the authors set φ̃t12 = φ12.
Thus the two parameters θ, θ̃ are not equal up to label switching while they produce the same distribution on the observations.

B. Non identifiability in affiliation case (planted partition)

Identifying the whole parameters from a binary affiliation SBM is a difficult task, as may be seen for instance by the many different but always partial results obtained by Allman et al. (2011).
In their Corollary 7, the authors establish that when group proportions are known, the parameters βin(:= βqq for all q) and βout(:= βql for all q 6= l) of a binary affiliation static SBM are identifiable.
This may be seen for instance from the example constructed in Section A that remains valid in the affiliation case.
While static affiliation often relies on an assumption of equal group proportions, there is no simple parallel situation for the transition matrix π in the dynamic case (the trivial assumption π = Empirical evidence for label switching between time steps in the affiliation setup is given in Section 4 from the Main Manuscript.

D. Estimation of γ and model selection: specific examples

The M-step equations concerning γ differ depending on the specific choice of the parametric family {f(·, γ), γ ∈ Γ}. Remember that the resulting conditional distribution on the observations is a mixture between an element from this family and the Dirac mass at zero.
The authors also provide expressions for ICL criterion in these different setups.
The parameter θ reduces to (π,β) for which updating expressions at the M-step have already been given (see Proposition 2).
These equations remain valid when considering a set of disjoint bins {Im}m instead of pointwise values {am}m.

E. Extension to varying number of nodes

However in real data applications it may happen that some actors enter or leave the study during the analysis.
Note that the whole chain Zi is not stationary anymore.
As such, a node that would not be present at each time point contributes to the likelihood only through the part of the trajectory where it is present.
Generalization of their VEM algorithm easily follows.

Did you find this useful? Give us your feedback

Figures (9)

Figure 1. Dependency structures of the model. Top: general view corresponding to hidden Markov model (HMM) structure; Middle: details on latent structure organization corresponding to N different iid Markov chains Zi = (Zti )1≤t≤T across individuals; Bottom: details for fixed time point t corresponding to SBM structure.

Figure 4. Estimation of the number of groups via ICL criterion. Left panel shows the frequency of the selected number of groups. Right panel shows ARI of the classification obtained with 4 groups depending on the selected number of groups.

Figure 2. Connectivity parameters or group membership variation: a toy example.

Figure 3. Boxplots of global ARI (white, left) and averaged ARI (grey, right) in different setups. From left to right: the three panels correspond to π = πlow (panels A,D), πmedium (panels B,E) and πhigh (panels C,F), respectively. In each panel, from left to right: results corresponding to β = low−, low+,medium−,medium+ and affiliation case, respectively. First row: T = 5 time points, second row: T = 10.

Figure 7. Summary of the interaction parameters β̂ and γ̂ estimated by our model with Q = 4 groups on the dataset of sparrows (left panel, Shizuka et al., 2014) and Q = 3 groups on dataset of onagers (right panel, Rubenstein et al., 2015), respectively. Same principle as in Figure 5.

Table 1. Bernoulli parameter values in 4 different cases, plus an affiliation example.

Figure 8. Alluvial plot showing the dynamics of the group membership estimated by our model on the datasets of interactions between 69 sparrows (left, Shizuka et al. (2014)) and 23 onagers (right, Rubenstein et al. (2015)) respectively. Same principle as in Figure 6 (with Si− k,Mi− k indicating group k at Season or Month i). A fake group (group 0) gathers absent animals at a specific time step and fuzzy fluxes represent arrival/departure to/from a group from/to group 0.

Figure 6. Alluvial plot showing the dynamics of the group membership estimated by our model on the dataset of interactions in the ’PC’ class (Fournet and Barrat, 2014). Each line is a flux that represents the move of one or more students from a group to another group (Di − k indicates group k for day i). The thickness of the lines is proportional to the number of students and the total height represents the 27 students.

Figure 5. Summary of the interaction parameters β̂ and γ̂ estimated by our model with Q = 4 grous on the dataset of interactions in the ’PC’ class (Fournet and Barrat, 2014). In each cell (q, l) with 1 ≤ q ≤ l ≤ 4, there are T = 4 barplots corresponding to the 4 measurements (Tuesday to Friday). Each barplot represents the distribution of the parameter γtql for the three categories of interaction frequency (low, medium and high). The width of each barplot is proportional to the sparsity parameter βtql. We recall that when considering the diagonal cells (q, q), parameters do not depend on t anymore.

Content maybe subject to copyright Report

HAL Id: hal-01167837

https://hal.archives-ouvertes.fr/hal-01167837v2

Submitted on 5 Feb 2016

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Statistical clustering of temporal networks through a

dynamic stochastic block model

Catherine Matias, Vincent Miele

To cite this version:

Catherine Matias, Vincent Miele. Statistical clustering of temporal networks through a dynamic

stochastic block model. Journal of the Royal Statistical Society: Series B, Royal Statistical Society,

2017, 79 (4), pp.1119-1141. �10.1111/rssb.12200�. �hal-01167837v2�

Statistical clustering of temporal networks through a dy-

namic stochastic block model

Catherine Matias†

Sorbonne Universit

es, UPMC Univ Paris 06, Univ Paris Diderot, Sorbonne Paris Cit

e, CNRS, Labora-

toire de Probabilit

es et Mod

eles Al

eatoires (LPMA), 75005 Paris, France.

Vincent Miele

Universit

e de Lyon, F-69000 Lyon; Universit

e Lyon 1; CNRS, UMR5558,

Laboratoire de Biom

etrie et Biologie

Evolutive, F-69622 Villeurbanne, France.

Abstract. Statistical node clustering in discrete time dynamic networks is an emerging ﬁeld that

raises many challenges. Here, we explore statistical properties and frequentist inference in a model

that combines a stochastic block model (SBM) for its static part with independent Markov chains for the

evolution of the nodes groups through time. We model binary data as well as weighted dynamic ran-

dom graphs (with discrete or continuous edges values). Our approach, motivated by the importance

of controlling for label switching issues across the different time steps, focuses on detecting groups

characterized by a stable within group connectivity behavior. We study identiﬁability of the model pa-

rameters, propose an inference procedure based on a variational expectation maximization algorithm

as well as a model selection criterion to select for the number of groups. We carefully discuss our

initialization strategy which plays an important role in the method and compare our procedure with

existing ones on synthetic datasets. We also illustrate our approach on dynamic contact networks, one

of encounters among high school students and two others on animal interactions. An implementation

of the method is available as a R package called dynsbm.

Keywords: contact network, dynamic random graph, graph clustering, stochastic block model, varia-

tional expectation maximization

1. Introduction

Statistical network analysis has become a major ﬁeld of research, with applications as diverse as

sociology, ecology, biology, internet, etc. General references on statistical modeling of random

graphs include the recent book by Kolaczyk (2009) and the two reviews by Goldenberg et al.

(2010) and Snijders (2011). While static approaches have been developed as early as in the 60’s

(mostly in the ﬁeld of sociology), the literature concerning dynamic models is much more recent.

Modeling discrete time dynamic networks is an emerging ﬁeld that raises many challenges and we

refer to Holme (2015) for a most recent review.

An important part of the literature on static network analysis is dedicated to clustering meth-

ods, with both aims of taking into account the intrinsic heterogeneity of the data and summarizing

this data through node classiﬁcation. Among clustering approaches, community detection meth-

ods form a smaller class of methods that aim at ﬁnding groups of highly connected nodes. Our

focus here is not only on community detection but more generally on node classiﬁcation based on

connectivity behaviors, with a particular interest on model-based approaches (see e.g. Matias and

Robin, 2014). When considering a sequence of snapshots of a network at diﬀerent time steps, these

static clustering approaches will give rise to classiﬁcations that are diﬃcult to compare through

time and thus diﬃcult to interpret. An important thing to note is that label switching between

two successive time steps may not be solved without an extra assumption e.g. that most of the

nodes do not change group across two diﬀerent time steps. However to our knowledge, this kind of

assumption has never been discussed in the literature. In this work, we are interested in statistical

models for discrete time dynamic random graphs, with the aim of providing a node classiﬁcation

varying with time, while controlling for label switching issues across the diﬀerent time steps. Our

answer to this challenge will be to focus on the detection of groups characterized by a stable within

†Corresponding author.

2 C. Matias and V. Miele

group connectivity behavior. We believe that this is particularly suited for dynamic contact net-

works.

Stochastic block models (SBM) form a widely used class of statistical (and static) random

graphs models that provide a clustering of the nodes. SBM introduces latent (i.e. unobserved)

random variables on the nodes of the graph, taking values in a ﬁnite set. These latent variables

represent the nodes groups and interaction between two nodes is governed by these corresponding

groups. The model includes (but is not restricted to) the speciﬁc case of community detection,

where within groups connections have higher probability than across groups ones. Combining SBM

with a Markov structure on the latent part of the process (the nodes classiﬁcation) is a natural way

of ensuring a smooth evolution of the groups across time and has already been considered in the

literature. In Yang et al. (2011), the authors consider undirected, either binary or ﬁnitely valued,

discrete time dynamic random graphs. The static aspect of the data is handled through SBM,

while its dynamic aspect is as follows. For each node, its group membership forms a Markov chain,

independent of the values of the other nodes memberships. There, only the group membership is

allowed to vary across time while connectivity parameters among groups stay constant through

time. The authors propose a method to infer these parameters (either online or oﬄine), based

on a combination of Gibbs sampling and simulated annealing. For binary random graphs, Xu

and Hero (2014) propose to introduce a state-space model through time on (the logit transform

of) the probability of connection between groups. Contrarily to the previous work, both group

membership and connectivity parameters across groups may vary through time. As such, we will

see below that this model has a strong identiﬁability problem. Their (online) iterative estimation

procedure is based on alternating two steps: a label-switching method to explore the space of

node groups conﬁguration, and the (extended) Kalman ﬁlter that optimizes the likelihood when

the groups memberships are known. Note that neither Yang et al. (2011) nor Xu and Hero (2014)

propose to infer the number of clusters. Bayesian variants of these dynamic SB models may be

found for instance in Ishiguro et al. (2010); Herlau et al. (2013).

Surprisingly, we noticed that the above mentioned methods were evaluated on synthetic datasets

in terms of averaged value over the time steps of a clustering quality index computed at ﬁxed time

step. Naturally, those indexes do not penalize for label switching and two classiﬁcations that are

identical up to a permutation have the highest quality index value. Computing an index for each

time step, the label switching issue between diﬀerent time steps disappears and the classiﬁcation

task becomes easier. Indeed, such criteria do not control for a smoothed recovery of groups along

diﬀerent time points. It should also be noted that the synthetic experiments from these works were

performed under the dynamic version of the binary aﬃliation SBM, which has non identiﬁable

parameters. The aﬃliation SBM, also known as planted partition model, corresponds to the case

where the connectivity parameter matrix has only two diﬀerent values: a diagonal one that drives

within groups connections and an oﬀ-diagonal one for across groups connections. In particular, the

label switching issue between diﬀerent time steps may not be easily removed in this particular case.

Other approaches for model-based clustering of dynamic random graphs do not rely directly on

SBM but rather on variants of SBM. We mention the random subgraph model (RSM) that combines

SBM with the a priori knowledge of a nodes partition (inducing subgraphs), by authorizing the

groups proportions to diﬀer in the diﬀerent subgraphs. A dynamic version of RSM that builds

upon the approach of Xu and Hero (2014) appears in Zreik et al. (2015). Detection of persistent

communities has been proposed in Liu et al. (2014) for directed and dynamic graphs of call counts

between individuals. Here the static underlying model is a time and degree-corrected SBM with

Poisson distribution on the call counts. Groups memberships are independent through time instead

of Markov, but smoothness in the classiﬁcation is obtained by imposing that within groups expected

call volumes are constant through time. Inference is performed through a heuristic greedy search

in the space of groups memberships. Note that only real datasets and no synthetic experiments

have been explored in this latter work.

Another very popular statistical method for analyzing static networks is based on latent space

models. Each node is associated to a point in a latent space and probability of connection is higher

for nodes whose latent points are closer (Hoﬀ et al., 2002). In Sarkar and Moore (2005), a dynamic

version of the latent space model is proposed, where the latent points follow a (continuous state

Clustering dynamic random graphs via SBM 3

space) Markov chain, with transition kernel given by a Gaussian perturbation on current position

with zero mean and small variance. Latent position inference is performed in two steps: a ﬁrst

initial guess is obtained through multi dimensional scaling. Then, nonlinear optimization is used

to maximize the model likelihood. The work by Xu and Zheng (2009) is very similar, adding a

clustering step on the nodes. Finally, Heaukulani and Ghahramani (2013) rely on Monte Carlo

Markov Chain methods to perform a Bayesian inference in a more complicate setup where the

latent positions of the nodes are not independent.

Mixed membership models (Airoldi et al., 2008) are also explored in a dynamic context. The

work by Xing et al. (2010) relies on a state space model for the evolution of the parameters of the

priors of both the mixed membership vector of a node and the connectivity behavior. Inference is

carried out through a variational Bayes expectation maximisation (VBEM) algorithm (e.g. Jordan

et al., 1999).

This non exhaustive bibliography on model-based clustering methods for dynamic random

graphs shows both the importance and the huge interest in the topic.

In the present work, we explore statistical properties and frequentist inference in a model

that combines SBM for its static part with independent Markov chains for the evolution of the

nodes groups through time. Our approach aims at achieving both interpretability and statistical

accuracy. Our setup is very close to the ones of Yang et al. (2011); Xu and Hero (2014), the ﬁrst

and main diﬀerence being that we allow for both groups memberships and connectivity parameters

to vary through time. By focusing on groups characterized by a stable within group connectivity

behavior, we are able to ensure parameter identiﬁability and thus valid statistical inference. Indeed,

while Yang et al. (2011) use the strong constraint of ﬁxed connectivity parameters through time, Xu

and Hero (2014) entirely relax this assumption at the (not acknowledged) cost of a label switching

issue between time steps. Second, we model binary data as well as weighted random graphs, should

they be dense or sparse, with discrete or continuous edges. Third, we propose a model selection

criterion to choose the number of clusters. To simplify notation, we restrict our model to undirected

random graphs with no self-loops but easy generalizations would handle directed datasets and/or

including self-loops.

The manuscript is organized as follows. Section 2.1 describes the model and sets notation. Sec-

tion 2.2 gives intuition on the identiﬁability issues raised by authorizing both groups memberships

and connectivity parameters to freely vary with time. This was not pointed out by Xu and Hero

(2014) despite they worked in this context. The section motivates our focus on groups character-

ized by a stable within group connectivity behavior. Section 2.3 then establishes our identiﬁability

results. To our knowledge, it is the ﬁrst dynamic random graph model where parameters identiﬁa-

bility (up to label switching) is discussed and established. Then, Section 3 describes a variational

expectation maximization (VEM) procedure for inferring the model parameters and clustering the

nodes. The VEM procedure works with a ﬁxed number of groups and an Integrated Classiﬁca-

tion Likelihood (ICL, Biernacki et al., 2000) criterion is proposed for estimating the number of

groups. We also discuss initialization of the algorithm - an important but rarely discussed step,

in Section 3.2. Synthetic experiments are presented in Section 4. There, we discuss classiﬁcation

performances without neglecting the label switching issue that may occur between time steps. In

Section 5, we illustrate our approach with the analysis of real-life contact networks: a dataset of

encounters among high school students and two other datasets of animal interactions. We believe

that our model is particularly suited to handle this type of data. We mention that the methods are

implemented into a R package available at http://lbbe.univ-lyon1.fr/dynsbm and will be soon

available on the CRAN. Supplementary Materials (available at the end of this article) complete

the main manuscript.

2. Setup and notation

2.1. Model description

We consider weighted interactions between N individuals recorded through time in a set of data

matrices Y = (Y

)

1≤t≤T

. Here T is the number of time points and for each value t ∈ {1, . . . , T },

the adjacency matrix Y

= (Y

)

1≤i6=j≤N

contains real values measuring interactions between

4 C. Matias and V. Miele

individuals i, j ∈ {1, . . . , N }

. Without loss of generality, we consider undirected random graphs

without self-loops, so that Y

is a symmetric matrix with no diagonal elements.

We assume that the N individuals are split into Q latent (unobserved) groups that may vary

through time, as encoded by the random variables Z = (Z

)

1≤t≤T,1≤i≤N

with values in Q

{1, . . . , Q}

. This process is modeled as follows. Across individuals, random variables (Z

)

1≤i≤N

are independent and identically distributed (iid). Now, for each individual i ∈ {1, . . . , N }, the

process Z

= (Z

)

1≤t≤T

is an irreducible, aperiodic stationary Markov chain with transition matrix

π = (π

)

1≤q,q

≤Q

and initial stationary distribution α = (α

, . . . , α

). When no confusion occurs,

we may alternatively consider Z

as a value in Q or as a random vector Z

= (Z

, . . . , Z

) ∈

{0, 1}

constrained to

= 1.

Given latent groups Z, the time varying random graphs Y = (Y

)

1≤t≤T

are independent, the

conditional distribution of each Y

depending only on Z

. Then, for ﬁxed 1 ≤ t ≤ T , random graph

follows a stochastic block model. In other words, for each time t, conditional on Z

, random

variables (Y

)

1≤i<j≤N

are independent and the distribution of each Y

only depends on Z

, Z

For now, we assume a very general parametric form for this distribution on R. Following Ambroise

and Matias (2012), in order to take into account possible sparse weighted graphs, we explicitly

introduce a Dirac mass at 0, denoted by δ

, as a component of this distribution. More precisely,

we assume

|{Z

= 1} ∼ (1 − β

)δ

(·) + β

F (·, γ

), (1)

where {F (·, γ), γ ∈ Γ} is a parametric family of distributions with no point mass at 0 and densities

(with respect to Lebesgue or counting measure) denoted by f(·, γ). This could be the Gaussian

family with unknown mean and variance, the truncated Poisson family on N \ {0} (leading to a

0-inﬂated or 0-deﬂated distribution on the edges of the graph), a ﬁnite space distribution on M

values (a case which comprises nonparametric approximations of continuous distributions through

discretization into a ﬁnite number of M bins), etc. Note that the binary case is encompassed

in this setup with F (·, γ) = δ

(·), namely the parametric family of laws is reduced to a single

point, the Dirac mass at 1 and conditional distribution of Y

is simply a Bernoulli B(β

). In

the following and by opposition to the ’binary case’, we will call ’weighted case’ any setup where

the set of distributions F is parametrized and not reduced to a single point. Here, the sparsity

parameters β

= (β

)

1≤q,l≤Q

satisfy β

∈ [0, 1], with β

≡ 1 corresponding to the particular case

of a complete weighted graph. As a result of considering undirected graphs, the parameters β

, γ

moreover satisfy β

= β

and γ

= γ

for all 1 ≤ q, l ≤ Q. Note that for the moment, SBM

parameters may be diﬀerent across time points. We will go back to this point in the next sections.

The model is thus parameterized by

θ = (π, β, γ) = (π, {β

, γ

}

1≤t≤T

) = ({π

}

1≤q,q

≤Q

, {β

, γ

}

1≤t≤T,1≤q≤l≤Q

) ∈ Θ,

and we let P

denote the probability distribution on the whole space Q

×R

. We also let φ(·; β, γ)

denote the density of the distribution given by (1), namely

∀y ∈ R, φ(y; β, γ) = (1 − β)1{y = 0} + βf(y, γ)1{y 6= 0},

where 1{A} is the indicator function of set A. With some abuse of notation and when no confusion

occurs, we shorten φ(·; β

, γ

) to φ

(·) or φ

(·; θ). Directed acyclic graphs (DAGs) describing

the dependency structure of the variables in the model with diﬀerent levels of detail are given in

Figure 1. Note that the model assumes that the individuals are present at any time in the dataset.

An extension that covers for the case where some nodes are not present at every time point is given

in Section E from the Supplementary Materials and used in analyzing the animal datasets from

Section 5.2.

2.2. Varying connectivity parameters vs varying group membership

In this section, we give some intuition on why it is not possible to let both connectivity parameters

and group membership vary through time without entering into label switching issues between

time steps. To this aim, let us consider the toy example from Figure 2.

This ﬁgure shows a graph between N = 12 nodes at two diﬀerent time points t

, t

. Node 1 is

a hub (namely a highly connected node), nodes 2 to 6 form a community at time t

(they tend to

HTML Viewer

Frequently Asked Questions (1)

Q1. What are the contributions in "Statistical clustering of temporal networks through a dynamic stochastic block model" ?

Their approach, motivated by the importance of controlling for label switching issues across the different time steps, focuses on detecting groups characterized by a stable within group connectivity behavior. The authors study identifiability of the model parameters, propose an inference procedure based on a variational expectation maximization algorithm as well as a model selection criterion to select for the number of groups. The authors carefully discuss their initialization strategy which plays an important role in the method and compare their procedure with existing ones on synthetic datasets. The authors also illustrate their approach on dynamic contact networks, one of encounters among high school students and two others on animal interactions.

Statistical clustering of temporal networks through a dynamic stochastic block model

Summary (4 min read)

1. Introduction

2.2. Varying connectivity parameters vs varying group membership

2.3. Parameters identifiability

3.1. General description

3.2. Algorithm initialization

4. Synthetic experiments

4.1. Clustering performances

4.2. Model selection

5.1. Encounters among high school students

A. Counter example of identifiability when groups memberships and connectivity parameters vary freely

B. Non identifiability in affiliation case (planted partition)

D. Estimation of γ and model selection: specific examples

E. Extension to varying number of nodes

Figures (9)

Citations

References

"Statistical clustering of temporal ..." refers methods in this paper

"Statistical clustering of temporal ..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (1)

Q1. What are the contributions in "Statistical clustering of temporal networks through a dynamic stochastic block model" ?