What are the contributions mentioned in the paper "Efficient discovery of frequent subgraph patterns in uncertain graph databases" ?

The main difficulty in solving this problem results from the large number of candidate subgraph patterns to be examined and the large number of subgraph isomorphism tests required to find the graphs that contain a given pattern. In this paper, the authors propose a method that uses an index of the uncertain graph database to reduce the number of comparisons needed to find frequent subgraph patterns. The evaluation of their approach on three real-world datasets as well as on synthetic uncertain graph databases demonstrates the significant cost savings with respect to the state-of-the-art approach. It also enables additional optimizations with respect to scheduling and early termination, that further increase the efficiency of the method.

What have the authors stated for future works in "Efficient discovery of frequent subgraph patterns in uncertain graph databases" ?

Their future work focuses on two main directions.

(Open Access) Efficient discovery of frequent subgraph patterns in uncertain graph databases (2011) | Odysseas Papapetrou

Efﬁcient Discovery of Frequent Subgraph Patterns

in Uncertain Graph Databases

Odysseas Papapetrou

L3S Research Center

Hannover, Germany

papapetrou@L3S.de

Ekaterini Ioannou

L3S Research Center

Hannover, Germany

ioannou@L3S.de

Dimitrios Skoutas

L3S Research Center

Hannover, Germany

skoutas@L3S.de

ABSTRACT

Mining frequent subgraph patterns in graph databases is a challeng-

ing and important problem with applications in several domains.

Recently, there is a growing interest in generalizing the problem to

uncertain graphs, which can model the inherent uncertainty in the

data of many applications. The main diﬃculty in solving this prob-

lem results from the large number of candidate subgraph patterns to

be examined and the large number of subgraph isomorphism tests

required to ﬁnd the graphs that contain a given pattern. The lat-

ter becomes even more challenging, when dealing with uncertain

graphs. In this paper, we propose a method that uses an index of

the uncertain graph database to reduce the number of comparisons

needed to ﬁnd frequent subgraph patterns. The proposed algorithm

relies on the apriori property for enumerating candidate subgraph

patterns eﬃciently. Then, the index is used to reduce the num-

ber of comparisons required for computing the expected support

of each candidate pattern. It also enables additional optimizations

with respect to scheduling and early termination, that further in-

crease the eﬃciency of the method. The evaluation of our approach

on three real-world datasets as well as on synthetic uncertain graph

databases demonstrates the signiﬁcant cost savings with respect to

the state-of-the-art approach.

Categories and Subject Descriptors

G.2.2 [Discrete Mathematics]: Graph Theory—Graph algorithms

General Terms

Theory, Algorithms

1. INTRODUCTION

Graphs constitute a generic data model with wide applicability

in numerous domains and applications. Consequently, mining fre-

quent subgraph patterns in graph databases has become an impor-

tant method for obtaining interesting insights and discovering use-

ful knowledge from the data, for example in bioinformatics, where

graphs are used to represent protein interactions. Given a graph

database D, the support of a graph G is deﬁned as the portion of

graphs in D that contain G as a subgraph. A frequent subgraph

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

EDBT 2011, March 22–24, 2011, Uppsala, Sweden.

pattern is then a graph with support above a minimum speciﬁed

threshold minSup. Discovering frequent subgraph patterns in a

graph database is a challenging task due to two main reasons. First,

any subgraph of a graph in the database constitutes a candidate fre-

quent subgraph pattern. This results in an extremely large number

of candidates that have to be enumerated and examined in an ef-

ﬁcient manner. Existing approaches typically exploit the apriori

property [1] to more eﬃciently enumerate candidate patterns. Ac-

cording to it, if a subgraph is frequent, then all of its subgraphs are

also frequent. This allows for a systematic way to enumerate sub-

graph patterns. Still, there remains a second challenge, that is, com-

puting the support of a candidate pattern in order to decide whether

it is frequent. This requires testing for subgraph isomorphism with

the graphs contained in the database, which is NP-complete.

Recently, there is an increasing need for the management of un-

certain graphs, necessitated by the inherent uncertainty in the data

of many applications. Uncertain graphs are a generalization of

exact graphs, in which each edge is associated with a probability

that indicates the belief we have that this edge exists. This addi-

tional expressiveness makes uncertain graphs a useful data model

in numerous applications. For instance, uncertain graphs are of-

ten used in bioinformatics to model interactions between proteins,

which are derived by experiments that are inevitably noisy and

error-prone [2]. Edge probabilities are then used to express this un-

certainty. Graphs are also used very often to represent communities

of users in social networks, where probabilities can be assigned to

edges to model the belief in the existence of the link or the degree

of inﬂuence between two entities [19, 17]. Furthermore, in com-

munication networks or road networks, edge probabilities are used

to quantify the connectivity between nodes [7], or to take traﬃc

uncertainty into consideration [12], respectively.

The problem of frequent subgraph pattern mining becomes even

more challenging in the case of uncertain graphs. A k-edge uncer-

tain graph implies a set of 2

exact graphs, which are derived by

sampling the edges of the uncertain graph according to their proba-

bilities. Hence, for each uncertain graph, there are now 2

subgraph

isomorphism tests required. In particular, the signiﬁcance of a sub-

graph pattern is now measured by its expected support, since its

support is now a random variable over the support values of the

pattern in all the exact graph databases that can be sampled from

the initial uncertain graph database.

Existing approaches [16, 18, 8, 28, 21, 13, 25] solve the frequent

subgraph pattern mining problem for exact graphs by performing

two main operations: (a) generating candidate patterns to be exam-

ined, and (b) testing for subgraph isomorphism to determine which

graphs in the database contain a given candidate pattern. An ex-

tension to the case of uncertain graphs has been recently presented

in [32]. It relies on the same techniques as the previous methods

for enumerating candidate patterns and focuses on the second task,

where the basic idea is to trade oﬀ accuracy with eﬃciency when

computing the expected support of a pattern. In particular, it pro-

poses an approximate algorithm for computing the expected sup-

port of a subgraph pattern by transforming the problem to an in-

stance of the DNF counting problem. However, although this can

reduce the cost of the computation for a single uncertain graph,

the overall cost still remains prohibitively high, even when dealing

with moderate size databases containing a few hundreds of graphs.

In this paper, we propose an eﬃcient algorithm called UGRAP

to address this problem. UGRAP relies on an index of the uncertain

graph database to signiﬁcantly reduce the amount of computations

required to determine the support of a candidate pattern. The index

comprises two structures. The ﬁrst is an inverted index on graph

edges enhanced with edge probabilities, and the second is a struc-

ture providing summarized information regarding connectivity of

graph nodes up to a speciﬁed path length. Similar to previous ap-

proaches, the algorithm eﬃciently enumerates candidate patterns

based on the apriori property. Then, when the support of a candi-

date subgraph pattern needs to be computed, the index is used to

identify a subset of the uncertain graphs in the database that may

contain this pattern, thus avoiding a signiﬁcant number of unnec-

essary subgraph isomorphism tests. In addition, we also propose

optimizations to further increase the eﬃciency of the method, al-

lowing for early termination and more eﬀective scheduling of the

graphs to be examined. Our extensive experimental evaluation on

three real-world data sets from the bioinformatics domain, as well

as on a synthetic uncertain graph database, demonstrates the sig-

niﬁcant reduction of the computational cost when compared to the

state-of-the-art method for the same problem.

Summarizing, our main contributions are as follows:

• We introduce an index of an uncertain graph database com-

prising information on graph edges along with their proba-

bilities and a summary of connectivity information between

graph nodes.

• We propose an algorithm that uses the aforementioned in-

dex to eﬃciently solve the problem of frequent subgraph pat-

tern mining in uncertain graphs, by pruning the search space

when computing the expected support of candidate patterns.

• We further improve the eﬃciency of the algorithm by propos-

ing additional optimizations for early termination and eﬀec-

tive scheduling of graph comparisons.

• We demonstrate the eﬃciency of our method by conducting

a comprehensive experimental evaluation on large real-world

and synthetic datasets, showing that it signiﬁcantly outper-

forms existing state-of-the-art solutions to the problem.

The rest of the paper is organized as follows. The next section

presents related work. The data model and a formal problem deﬁni-

tion are introduced in Section 3. Section 4 introduces the UGRAP

index, while Section 5 explains the frequent subgraph pattern min-

ing algorithm based on this index. Section 6 presents and discusses

the results of our experimental evaluation on real and synthetic

datasets. Finally, Section 7 concludes the paper.

2. RELATED WORK

In this section, we present related work considering both the case

of exact and uncertain graphs.

2.1 Exact Graphs

Given the signiﬁcance of the problem, a lot of research eﬀorts

have focused on mining frequent subgraph patterns in exact graphs,

with ﬁrst approaches dating back to the 1990’s [26]. Existing meth-

ods are typically classiﬁed in two categories: apriori-based and

pattern growth.

The approaches of the ﬁrst category follow the main idea be-

hind the Apriori algorithm [1] for mining frequent itemsets. More

speciﬁcally, they rely on the apriori property, according to which

all the subpatterns of a frequent subgraph pattern are also frequent.

Thus, to enumerate candidate patterns, they apply breadth-ﬁrst search

to generate subgraphs of size (k + 1), by joining two subgraphs of

the previous level.

The main representatives of this category are AGM [16], FSG [18]

and PM [8]. They mainly diﬀer on the basic building block used to

enumerate candidate patterns, which can be nodes [16], edges [18],

or edge-disjoint paths [8]. AGM starts the search by examining

graphs comprising a single vertex, and then it proceeds by gener-

ating larger candidates adding one extra vertex at each subsequent

step. FSG uses edges, instead of vertices, as the primary building

block for candidate generation. It limits the class of the frequent

subgraphs to connected graphs and introduces several heuristics to

increase the eﬃciency of computing the support of a pattern, us-

ing graph vertex invariants, such as the degree of each vertex in

the graph. It also improves the eﬃciency of the candidate pat-

tern generation by introducing the transaction ID method. PM also

follows breadth-ﬁrst enumeration for generating the candidate pat-

terns; however, in contrast to the previous approaches which em-

ploy single vertices or edges as basic building blocks for pattern

generation, it utilizes edge-disjoint paths. This reduces the required

iterations, while it is proved that completeness is maintained.

To avoid the costly breadth-ﬁrst based candidate pattern genera-

tion, which incurs heavy memory requirements, the methods in the

second category adopt depth-ﬁrst search, where patterns are grown

directly from a single graph instead of joining two previous sub-

graphs. The main representative of this category is gSpan [28],

which also relies on canonical labeling like previous approaches,

but it uses a tree representation instead of an adjacency matrix as

a coding scheme for the graph. Based on the assigned codes, can-

didate patterns are organized lexicographically in a tree hierarchy,

which is then searched in a depth-ﬁrst manner. In the same direc-

tion, GASTON [21] splits the discovery process into several phases

to increase eﬃciency by ﬁrst searching for frequent paths, then for

frequent free trees, and ﬁnally for cyclic graphs. Eﬃciency is im-

proved since these classes of structures are contained in each other.

The basic idea is to store and reuse the embeddings instead of per-

forming subgraph isomorphism tests. However, this has high space

requirements and does not scale well to large graph databases.

Another approach is FFSM [13], which proposes a vertical search

scheme within an algebraic graph framework. Relying on a graph

canonical form, it introduces two new operations, FFSM-Join and

FFSM-Extension, to improve the eﬃciency of pattern enumeration.

An embedding set for each frequent subgraph is also maintained

to avoid expensive subgraph isomorphism tests. Furthermore, an

adjacency index structure, called ADI, is proposed in [25] to deal

with the cases in which the graph database is too large to ﬁt in

main memory. It is also shown how the gSpan algorithm [28] can

be adapted to use ADI.

Finally, to reduce the size of the output, more recent works have

focused on mining only subgraph patterns that are closed [29],

maximal [14], signiﬁcant [27] or representative [31], or on sum-

marizing subgraph patterns [20].

2.2 Uncertain Graphs

Recently, there has been a growing interest in using uncertain

graphs as a data model in applications that need to deal with un-

certainty. Thus, various problems for mining uncertain graphs have

emerged. The problem of ﬁnding reliable subgraphs in uncertain

graphs is studied in [11]. Given a graph that is subject to random

edge failures, the goal is to ﬁnd and remove a number of edges so

that the probability of connecting a set of selected nodes in the re-

maining subgraph is maximized. Three novel types of probabilistic

path queries have been deﬁned in [12] for uncertain graphs repre-

senting road networks, where edge probabilities capture the uncer-

tainty in traﬃc conditions. Also, both exact and approximation al-

gorithms are introduced to answer such queries. A generalization of

k-Nearest Neighbor queries in uncertain graphs is presented in [22],

where a framework is proposed considering alternative ways to de-

ﬁne the distance between nodes taking edge probabilities into ac-

count. All these works clearly show the increasing need and interest

in mining uncertain graphs.

However, to the best of our knowledge, up to now only one work

has dealt with the problem of frequent subgraph pattern mining in

uncertain graphs [32]. The proposed method is an approximation

algorithm, called MUSE, which allows for a tradeoﬀ between accu-

racy and eﬃciency when computing the expected support of can-

didate subgraph patterns. In particular, given a support threshold

minSup and a relative error tolerance ε ∈ [0, 1], the algorithm re-

turns all subgraph patterns with expected support at least minSup,

allowing also for some false positives with expected support in the

range [(1 − ε) minSup, minSup]. Similar to corresponding meth-

ods for exact graphs, the solution addresses two main subtasks: (a)

a method for enumerating candidate patterns, and (b) a method to

compute the expected support of a pattern. Regarding the ﬁrst task,

the method proposed in gSpan [28] is adopted to construct a search

tree of subgraph patterns. For the second task, two algorithms are

proposed, an exact one for small instances of the problem (e.g.,

graphs with up to 30 edges) and an approximate one for larger in-

stances. The main idea in both algorithms is to transform the prob-

lem to an instance of the DNF counting problem [24].

Although this algorithm makes it possible to approximate the ex-

pected support of a candidate pattern for an uncertain graph with a

large number of edges, the computational cost is still quite high,

and therefore the method does not scale well, even for moderate

size databases with up to a few hundreds of uncertain graphs. In our

approach, we remove this limitation, by constructing an index of

the uncertain graph database, which signiﬁcantly prunes the search

space and enables for additional optimizations based on early ter-

mination and eﬃcient scheduling to avoid the expensive subgraph

isomorphism tests.

3. DATA MODEL & PROBLEM DEFINITION

In this section, we formally deﬁne uncertain graphs and the prob-

lem of frequent subgraph pattern mining in uncertain graph databases.

For clarity of the presentation, we ﬁrst introduce the problem of

frequent subgraph pattern mining in exact graphs, and then we ex-

plain how it is generalized in uncertain graphs. The data model and

deﬁnitions used in this paper are in line with previous approaches

for mining frequent subgraph patterns in both exact and uncertain

graphs (e.g., [28, 32]).

Definition 1(Exact Graph). An exact graph is a tuple G =

(V, E, Σ, L), where V is a set of vertices, E ⊆ V × V is a set of edges,

Σ is a set of labels, and L : V ∪ E → Σ is a function assigning

labels to vertices and edges.

The vertex set of a graph G is denoted by V(G) and the edge set

by E(G). The size of a graph G, denoted as |G|, is deﬁned by the

number of edges it contains, i.e., |E(G)|. For simplicity, we assume

that the graph is undirected, since this a more typical scenario in

frequent subgraph pattern mining, e.g., in bioinformatics; however,

it is straightforward to extend the proposed method in the case of

directed graphs.

Definition 2(Subgraph Isomorphism). Given two exact graphs,

G = (V, E, Σ, L) and G



= (V



, E



, Σ



, L



),asubgraph isomorphism

from graph G to graph G



is an injective function f : V → V



such

that:

1. ∀ u ∈ V, f (u) ∈ V



and L(u) = L



( f (u)), and

2. ∀ (u, v) ∈ E, ( f (u), f (v)) ∈ E



and L(u, v) = L



( f (u), f (v)).

If such a function f exists, then G is subgraph isomorphic to G



denoted as G  G



. We also say that G



contains G. Moreover, the

subgraph G



of G



with vertex set V



= { f (u) | u ∈ V} and edge set



= {( f (u), f (v)) | (u, v) ∈ E} is called the embedding of G in G



under f .

Based on the above, we can deﬁne the support or frequency of

a subgraph pattern S in an exact graph database D as the portion

of graphs in D to which S is subgraph isomorphic. Notice that we

only consider connected graphs as subgraph patterns. Furthermore,

a subgraph pattern is considered to be frequent in D, if its support

exceeds a pre-deﬁned threshold minSup. Formally, we deﬁne fre-

quent subgraph patterns in exact graph databases as follows.

Definition 3(Support). Given a subgraph pattern S and an

exact graph database D, the support of S in D is deﬁned by

sup(S, D) =

|{G ∈ D | S  G}|

|D|

(1)

If sup(S, D) ≥ minSup, where minSup is a given support threshold

within [0, 1], then S is a frequent subgraph pattern in D.

In the following, we show how the above concepts generalize

in the case of uncertain graphs. Uncertain or probabilistic graphs

generalize exact graphs by associating to each edge a probability

that it exists. Formally:

Definition 4(Uncertain Graph). An uncertain graph is a tu-

ple G

= (V, E, Σ, L, P), where (V, E, Σ, L) is an exact graph deﬁned

as previously and P : E → (0, 1] is a function assigning to each

edge a probability that it exists.

An uncertain graph G

implies a set of 2

|E|

possible exact graphs.

These are sampled from G

according to the probabilities assigned

by the function P. As in previous approaches, we assume inde-

pendence among edges, which is a realistic assumption in many

real-world applications. The probability of an exact graph G be-

ing implied by G

, denoted as G

⇒ G, is computed based on the

probability of each edge of G

being included or excluded from G:

P(G

⇒ G) =



e ∈ E(G)

P(e)



e ∈ E(G

) \ E(G)

(1 − P(e)) (2)

Consequently, an uncertain graph database D

implies a set of



i=1

|E(G

exact graph databases. Assuming also independence

among the uncertain graphs in the database, the probability of an

exact graph database D being implied by D

is:

P(D

⇒ D) =



i=1

P(G

⇒ G

) (3)

where G

and G

are the i-th graphs in D

and D, respectively.

In an uncertain graph database D

, the support of a subgraph pat-

tern S is based on its support in the implied exact graph databases,

taking also into consideration the corresponding probabilities of

these databases. In particular, the support in this case is a random

variable with probability distribution deﬁned by:

Figure 1: An illustrative example showing (a) an uncertain graph, (b) its implied exact graphs and (c) a subgraph pattern.

P(s

) =



{D | D

⇒D and sup(S ,D) = s

}

P(D

⇒ D) (4)

Therefore, to deﬁne frequent subgraph patterns in uncertain graph

databases, we use as measure the expected support, which is de-

ﬁned as follows.

Definition 5(Expected Support). The expected support of a

subgraph pattern S in an uncertain graph database D

is deﬁned

by:

esu p(S , D

) =



{D | D

⇒D}

P(D

⇒ D) · sup(S , D) (5)

We can now formally deﬁne the problem of frequent subgraph pat-

tern mining in uncertain graph databases.

Problem Deﬁnition. Given an uncertain graph database D

and

a minimum support threshold minSup, return all the subgraph pat-

terns S with expected support greater than or equal to minSup, i.e.,

esu p(S , D

) ≥ minSup.

Example 1. An illustrative example is presented in Figure 1,

comprising an uncertain graph G

and a candidate subgraph pat-

tern S . The labels of the vertices and edges denote their type, e.g.,

category of a protein or type of protein interaction. The ﬁgure also

depicts the 8 exact graphs implied by G

, together with their prob-

abilities, computed according to Equation 2. As shown, the sub-

graph pattern S is contained in the implied graphs G

and G

Therefore, according to Equation 5, the expected support of S in

is 0.276.

A straightforward algorithm for solving this problem works as

follows: (a) enumerate all candidate subgraph patterns; (b) for each

generated candidate pattern, and for each uncertain graph in the

database, generate all the implied exact graphs and compute the

expected support of the pattern. The cost of the ﬁrst step is the

same as in the case of exact graphs. Hence, one of the existing

strategies, based on the apriori property, can be applied for enu-

merating candidate patterns more eﬃciently. In our method, we

use the approach of gSpan [28]. However, the cost of the second

step is signiﬁcantly increased compared to the corresponding one

for the case of exact graph databases. Recall that, each uncertain

graph with k edges implies a set of 2

exact graphs. Therefore, for

each pair of a candidate pattern and a graph, it requires 2

subgraph

isomorphism tests when the graph is uncertain instead of a single

one when the graph is exact. A ﬁrst approach for dealing with this

problem is proposed in [32], which replaces this computation with

a more eﬃcient but approximate algorithm that can estimate the ex-

pected support of a subgraph pattern in an uncertain graph, when

dealing with large graphs (i.e., above 30 edges). However, even

with this approximation, the cost remains prohibitively high even

for moderate size databases (e.g., above 100 graphs). Therefore,

reducing the uncertain graphs to be considered to only those that

may contribute to making a pattern frequent, and especially avoid-

ing large graphs in the computation, becomes crucial. In the next

sections, we propose a solution to this problem, using an index and

a summary of the uncertain graph database, with additional opti-

mizations for early termination and eﬀective scheduling of graph

comparisons.

4. THE UGRAP INDEX

As explained above, our goal is to prune the search space when

computing the expected support of candidate subgraph patterns, by

limiting the number of uncertain graphs that need to be examined

for containment. For this purpose, we construct an index of the

uncertain graph database, containing graph edges and their prob-

abilities. Furthermore, to achieve better pruning, taking into con-

sideration the structure of each candidate pattern and each exam-

ined graph, we also construct a structure containing connectivity

information between graph nodes. This information is summarized

in order to reduce memory requirements when dealing with large

databases and large graphs, especially in the case of dense graphs.

In this section, we present how the UGRAP index is constructed

and maintained.

4.1 Edge Index

The ﬁrst component of the UGRAP index, denoted with I

,is

an inverted index on graph edges extended with information on

edge probabilities in order to take uncertainty of edges into account.

More speciﬁcally, the structure I

is a map where:

• each key is a label triple of the form t = (L

, L

), repre-

senting graph edges, and

• the value of each key is a list containing the identiﬁers of

the graphs in which these edges appear, as well as the corre-

sponding occurrence probability.

An edge (u, v) contained in an uncertain graph in the database is

mapped to the key T (u, v) = (L(u), L(v), L(u, v)). The value of a

key t is then a list of pairs of the form (G

, p

), where p

is the

probability that the graph G

contains at least one edge e mapped

to the key t. Only those graphs with non zero probability are stored

in the index. Given the independence assumption between edges,

this probability is computed by:

= 1 −



e ∈ G

∧ T (e)=t

(1 − P(e)) (6)

where the product denotes the probability that no edge mapped to t

exists. Formally, the edge index I

can be deﬁned as follows.

Definition 6(Edge Index). Given an uncertain graph database

, the edge index I

is a structure that returns, for any given triple

t = (L

, L

), a list of all the pairs (G

, p

), where G

is an

uncertain graph in D

containing an edge (u, v) having probability

> 0, such that L(u) = L

,L(v) = L

, and L(u, v) = L

Constructing the I

structure is straightforward. Each uncertain

graph in the database can be processed independently, parsing its

edges to identify the list of keys and their probabilities, using Equa-

tion 6. The results are then merged to create the map described

above. The process is detailed in Algorithm 1.

Updating the index when an uncertain graph is added or removed

from the database can be performed incrementally. The keys for

this graph are computed and the corresponding entries in the in-

dex are updated accordingly, by appending or removing the cor-

responding item from the list of each of these keys. If the key is

not already contained in the index, a new entry is created (or the

entry is removed if the list of the key becomes empty). Finally, if

an existing uncertain graph is updated, then the probabilities of all

the aﬀected keys need to be updated accordingly (which may also

result in removing or adding keys).

Notice that, although more complex index structures have been

proposed for querying graph databases, which aim at avoiding ex-

pensive subgraph isomorphism tests [5, 9, 30], these structures are

not suitable for our problem for two reasons. First, they target ex-

act graphs; hence, their adaptation to uncertain graph databases is

an open issue. Second, and most importantly, more advanced in-

dex structures, such as the ones proposed in [5, 30], require ﬁrst

to compute the frequent subgraphs in the database, which are then

used as features for the index. Instead, since our goal is to ﬁnd

such frequent subgraph patterns, the index can only rely on sim-

pler features. As shown in Section 6, our index requires negligible

memory and computational resources to be built, even for large un-

certain graph databases.

Example 2. The edge index I

for a database containing only

the uncertain graph illustrated in Figure 1 would contain two keys,

(A, B, p) and (A, B, q), pointing to the lists {(G

, 0.92)} and {(G

, 0.3)},

respectively.

4.2 Connectivity Index

The second component of the UGRAP index, denoted by I

,is

a structure containing summarized information regarding connec-

tivity of graph nodes. This additional structural information is use-

ful when deciding which uncertain graphs may contain a candidate

subgraph pattern with non-zero probability.

Intuitively, the purpose of this structure is to extend the edge in-

dex allowing paths of length >1. In particular, I

provides infor-

mation on whether there exists a path of length  (for values of  up

to a maximum length 

max

) between two vertices u and v of a graph

with labels L(u) and L(v), respectively. Notice that, unlike the

case of single edges, the independence assumption does not hold

between paths, since two paths may contain common edges. There-

fore, the probability that an uncertain graph G

contains a path of

length  between two vertices with labels L(u) and L(v) cannot be

computed in a straightforward way, i.e., similar to Equation 6 for

edges. Instead, it requires applying the inclusion-exclusion princi-

ple, which involves ﬁnding all the possible paths between all pairs

Algorithm 1 Construction of the Edge Index I

Input : An uncertain graph database D

Output : The edge index I

1: Initialize I

to an empty map

2: for all G

∈ D

3: Initialize K to an empty map

4: for all (u, v) ∈ E(G

) do

5: t ← (L(u), L(v), L(u, v))

6: K(t) ← K(t) ∪{(u, v)}

7: end for

8: for all t ∈ K do

9: p

← 1 −



e ∈ K(t)

(1 − P(e))

10: I

(t) ← I

(t) ∪{(G

, p

)}

11: end for

12: end for

13: return I

of vertices with labels L(u) and L(v) and identifying all the over-

laps between any subset of these paths. Since this would make the

construction and maintenance of the index an expensive and com-

plex operation, we do not compute and store these probabilities;

instead, we only store whether such a path exists with probability

higher than zero or not.

Another issue that arises by allowing for paths of length >1

is that the size of the index is signiﬁcantly increased, due to the

exponential increase of the number of possible paths. To deal with

this problem, we only maintain a summary of this information us-

ing Bloom ﬁlters [3]. A Bloom ﬁlter consists of an array of m bits

and k independent hash functions F = { f

, f

,..., f

}, which hash

elements of a universe U to an integer in the range of [1, m]. The

m bits are initially set to 0 in an empty Bloom ﬁlter. An element

x ∈ U is inserted into the Bloom ﬁlter by setting all positions f

(x)

of the bit array to 1, for all f

∈ F. Thus, an element x is con-

tained in the original set only if all the positions f

(x) of the Bloom

ﬁlter are set to 1. If at least one of these positions is set to 0, we

can safely conclude that x is not present in the original set. How-

ever, due to hash collisions, there is also a small probability of false

positives, Pr

≈ (1 − e

−kn/m

)

, where n denotes the number of el-

ements hashed in the Bloom ﬁlter. In our case, a high probability

of false positives decreases the pruning power of the connectivity

index, since we use Bloom ﬁlters to summarize all the paths of a

given length contained in each graph. Each path inserted in the

Bloom ﬁlter is represented by the labels of its start node and end

node, sorted lexicographically. Formally, the connectivity index I

is deﬁned as follows.

Definition 7(Connectivity Index). Given an uncertain graph

, an integer  ≤ 

max

and two labels L

and L

, the connectivity

index I

is a structure such that:

• if the uncertain graph G

contains a path of length  between

two vertices u and v with labels L(u) = L

and L(v) = L

then I

, L

,) = 1,

• otherwise, I

, L

,) = 0 with probability at least 1−ε,

and I

, L

,) = 1 with probability at most ε, for a

ﬁxed error probability threshold ε.

The process of constructing the connectivity index I

is described

in detail in Algorithm 2. As with the edge index, I

can also be

built progressively, or maintained to reﬂect changes in the underly-

ing graph database.

Note that there is no need to construct the index for  = 1, since

the graphs containing single-edge paths can be eﬃciently retrieved

from the edge index I

. Therefore, we only consider  ∈ [2,

max

]

when constructing the connectivity index, as well as for deciding

whether an uncertain graph contains a candidate subgraph pattern.

Efficient discovery of frequent subgraph patterns in uncertain graph databases

Figures

Citations

Discovering frequent subgraphs over uncertain graph databases under probabilistic semantics

Clustering Large Probabilistic Graphs

Reliable clustering on uncertain graphs

Efficient Probabilistic K-Core Computation on Uncertain Graphs

Discriminative Feature Selection for Uncertain Graph Classification.

References

Fast Algorithms for Mining Association Rules in Large Databases

Space/time trade-offs in hash coding with allowable errors

Maximizing the spread of influence through a social network

The link-prediction problem for social networks

The complexity of computing the permanent

Related Papers (5)

Mining Frequent Subgraph Patterns from Uncertain Graph Data

k-nearest neighbors in uncertain graphs

Discovering frequent subgraphs over uncertain graph databases under probabilistic semantics

Distance-constraint reachability computation in uncertain graphs

gSpan: graph-based substructure pattern mining

Frequently Asked Questions (2)

Q1. What are the contributions mentioned in the paper "Efficient discovery of frequent subgraph patterns in uncertain graph databases" ?

Q2. What have the authors stated for future works in "Efficient discovery of frequent subgraph patterns in uncertain graph databases" ?