scispace - formally typeset
Open AccessProceedings ArticleDOI

Efficient discovery of frequent subgraph patterns in uncertain graph databases

Reads0
Chats0
TLDR
This paper proposes a method that uses an index of the uncertain graph database to reduce the number of comparisons required for computing the expected support of each candidate pattern, and relies on the apriori property for enumerating candidate subgraph patterns efficiently.
Abstract
Mining frequent subgraph patterns in graph databases is a challenging and important problem with applications in several domains. Recently, there is a growing interest in generalizing the problem to uncertain graphs, which can model the inherent uncertainty in the data of many applications. The main difficulty in solving this problem results from the large number of candidate subgraph patterns to be examined and the large number of subgraph isomorphism tests required to find the graphs that contain a given pattern. The latter becomes even more challenging, when dealing with uncertain graphs. In this paper, we propose a method that uses an index of the uncertain graph database to reduce the number of comparisons needed to find frequent subgraph patterns. The proposed algorithm relies on the apriori property for enumerating candidate subgraph patterns efficiently. Then, the index is used to reduce the number of comparisons required for computing the expected support of each candidate pattern. It also enables additional optimizations with respect to scheduling and early termination, that further increase the efficiency of the method. The evaluation of our approach on three real-world datasets as well as on synthetic uncertain graph databases demonstrates the significant cost savings with respect to the state-of-the-art approach.

read more

Content maybe subject to copyright    Report

Efficient Discovery of Frequent Subgraph Patterns
in Uncertain Graph Databases
Odysseas Papapetrou
L3S Research Center
Hannover, Germany
papapetrou@L3S.de
Ekaterini Ioannou
L3S Research Center
Hannover, Germany
ioannou@L3S.de
Dimitrios Skoutas
L3S Research Center
Hannover, Germany
skoutas@L3S.de
ABSTRACT
Mining frequent subgraph patterns in graph databases is a challeng-
ing and important problem with applications in several domains.
Recently, there is a growing interest in generalizing the problem to
uncertain graphs, which can model the inherent uncertainty in the
data of many applications. The main diculty in solving this prob-
lem results from the large number of candidate subgraph patterns to
be examined and the large number of subgraph isomorphism tests
required to find the graphs that contain a given pattern. The lat-
ter becomes even more challenging, when dealing with uncertain
graphs. In this paper, we propose a method that uses an index of
the uncertain graph database to reduce the number of comparisons
needed to find frequent subgraph patterns. The proposed algorithm
relies on the apriori property for enumerating candidate subgraph
patterns eciently. Then, the index is used to reduce the num-
ber of comparisons required for computing the expected support
of each candidate pattern. It also enables additional optimizations
with respect to scheduling and early termination, that further in-
crease the eciency of the method. The evaluation of our approach
on three real-world datasets as well as on synthetic uncertain graph
databases demonstrates the significant cost savings with respect to
the state-of-the-art approach.
Categories and Subject Descriptors
G.2.2 [Discrete Mathematics]: Graph Theory—Graph algorithms
General Terms
Theory, Algorithms
1. INTRODUCTION
Graphs constitute a generic data model with wide applicability
in numerous domains and applications. Consequently, mining fre-
quent subgraph patterns in graph databases has become an impor-
tant method for obtaining interesting insights and discovering use-
ful knowledge from the data, for example in bioinformatics, where
graphs are used to represent protein interactions. Given a graph
database D, the support of a graph G is defined as the portion of
graphs in D that contain G as a subgraph. A frequent subgraph
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
EDBT 2011, March 22–24, 2011, Uppsala, Sweden.
Copyright 2011 ACM 978-1-4503-0528-0/11/0003 ...$10.00
pattern is then a graph with support above a minimum specified
threshold minSup. Discovering frequent subgraph patterns in a
graph database is a challenging task due to two main reasons. First,
any subgraph of a graph in the database constitutes a candidate fre-
quent subgraph pattern. This results in an extremely large number
of candidates that have to be enumerated and examined in an ef-
ficient manner. Existing approaches typically exploit the apriori
property [1] to more eciently enumerate candidate patterns. Ac-
cording to it, if a subgraph is frequent, then all of its subgraphs are
also frequent. This allows for a systematic way to enumerate sub-
graph patterns. Still, there remains a second challenge, that is, com-
puting the support of a candidate pattern in order to decide whether
it is frequent. This requires testing for subgraph isomorphism with
the graphs contained in the database, which is NP-complete.
Recently, there is an increasing need for the management of un-
certain graphs, necessitated by the inherent uncertainty in the data
of many applications. Uncertain graphs are a generalization of
exact graphs, in which each edge is associated with a probability
that indicates the belief we have that this edge exists. This addi-
tional expressiveness makes uncertain graphs a useful data model
in numerous applications. For instance, uncertain graphs are of-
ten used in bioinformatics to model interactions between proteins,
which are derived by experiments that are inevitably noisy and
error-prone [2]. Edge probabilities are then used to express this un-
certainty. Graphs are also used very often to represent communities
of users in social networks, where probabilities can be assigned to
edges to model the belief in the existence of the link or the degree
of influence between two entities [19, 17]. Furthermore, in com-
munication networks or road networks, edge probabilities are used
to quantify the connectivity between nodes [7], or to take trac
uncertainty into consideration [12], respectively.
The problem of frequent subgraph pattern mining becomes even
more challenging in the case of uncertain graphs. A k-edge uncer-
tain graph implies a set of 2
k
exact graphs, which are derived by
sampling the edges of the uncertain graph according to their proba-
bilities. Hence, for each uncertain graph, there are now 2
k
subgraph
isomorphism tests required. In particular, the significance of a sub-
graph pattern is now measured by its expected support, since its
support is now a random variable over the support values of the
pattern in all the exact graph databases that can be sampled from
the initial uncertain graph database.
Existing approaches [16, 18, 8, 28, 21, 13, 25] solve the frequent
subgraph pattern mining problem for exact graphs by performing
two main operations: (a) generating candidate patterns to be exam-
ined, and (b) testing for subgraph isomorphism to determine which
graphs in the database contain a given candidate pattern. An ex-
tension to the case of uncertain graphs has been recently presented
in [32]. It relies on the same techniques as the previous methods

for enumerating candidate patterns and focuses on the second task,
where the basic idea is to trade o accuracy with eciency when
computing the expected support of a pattern. In particular, it pro-
poses an approximate algorithm for computing the expected sup-
port of a subgraph pattern by transforming the problem to an in-
stance of the DNF counting problem. However, although this can
reduce the cost of the computation for a single uncertain graph,
the overall cost still remains prohibitively high, even when dealing
with moderate size databases containing a few hundreds of graphs.
In this paper, we propose an ecient algorithm called UGRAP
to address this problem. UGRAP relies on an index of the uncertain
graph database to significantly reduce the amount of computations
required to determine the support of a candidate pattern. The index
comprises two structures. The first is an inverted index on graph
edges enhanced with edge probabilities, and the second is a struc-
ture providing summarized information regarding connectivity of
graph nodes up to a specified path length. Similar to previous ap-
proaches, the algorithm eciently enumerates candidate patterns
based on the apriori property. Then, when the support of a candi-
date subgraph pattern needs to be computed, the index is used to
identify a subset of the uncertain graphs in the database that may
contain this pattern, thus avoiding a significant number of unnec-
essary subgraph isomorphism tests. In addition, we also propose
optimizations to further increase the eciency of the method, al-
lowing for early termination and more eective scheduling of the
graphs to be examined. Our extensive experimental evaluation on
three real-world data sets from the bioinformatics domain, as well
as on a synthetic uncertain graph database, demonstrates the sig-
nificant reduction of the computational cost when compared to the
state-of-the-art method for the same problem.
Summarizing, our main contributions are as follows:
We introduce an index of an uncertain graph database com-
prising information on graph edges along with their proba-
bilities and a summary of connectivity information between
graph nodes.
We propose an algorithm that uses the aforementioned in-
dex to eciently solve the problem of frequent subgraph pat-
tern mining in uncertain graphs, by pruning the search space
when computing the expected support of candidate patterns.
We further improve the eciency of the algorithm by propos-
ing additional optimizations for early termination and eec-
tive scheduling of graph comparisons.
We demonstrate the eciency of our method by conducting
a comprehensive experimental evaluation on large real-world
and synthetic datasets, showing that it significantly outper-
forms existing state-of-the-art solutions to the problem.
The rest of the paper is organized as follows. The next section
presents related work. The data model and a formal problem defini-
tion are introduced in Section 3. Section 4 introduces the UGRAP
index, while Section 5 explains the frequent subgraph pattern min-
ing algorithm based on this index. Section 6 presents and discusses
the results of our experimental evaluation on real and synthetic
datasets. Finally, Section 7 concludes the paper.
2. RELATED WORK
In this section, we present related work considering both the case
of exact and uncertain graphs.
2.1 Exact Graphs
Given the significance of the problem, a lot of research eorts
have focused on mining frequent subgraph patterns in exact graphs,
with first approaches dating back to the 1990’s [26]. Existing meth-
ods are typically classified in two categories: apriori-based and
pattern growth.
The approaches of the first category follow the main idea be-
hind the Apriori algorithm [1] for mining frequent itemsets. More
specifically, they rely on the apriori property, according to which
all the subpatterns of a frequent subgraph pattern are also frequent.
Thus, to enumerate candidate patterns, they apply breadth-first search
to generate subgraphs of size (k + 1), by joining two subgraphs of
the previous level.
The main representatives of this category are AGM [16], FSG [18]
and PM [8]. They mainly dier on the basic building block used to
enumerate candidate patterns, which can be nodes [16], edges [18],
or edge-disjoint paths [8]. AGM starts the search by examining
graphs comprising a single vertex, and then it proceeds by gener-
ating larger candidates adding one extra vertex at each subsequent
step. FSG uses edges, instead of vertices, as the primary building
block for candidate generation. It limits the class of the frequent
subgraphs to connected graphs and introduces several heuristics to
increase the eciency of computing the support of a pattern, us-
ing graph vertex invariants, such as the degree of each vertex in
the graph. It also improves the eciency of the candidate pat-
tern generation by introducing the transaction ID method. PM also
follows breadth-first enumeration for generating the candidate pat-
terns; however, in contrast to the previous approaches which em-
ploy single vertices or edges as basic building blocks for pattern
generation, it utilizes edge-disjoint paths. This reduces the required
iterations, while it is proved that completeness is maintained.
To avoid the costly breadth-first based candidate pattern genera-
tion, which incurs heavy memory requirements, the methods in the
second category adopt depth-first search, where patterns are grown
directly from a single graph instead of joining two previous sub-
graphs. The main representative of this category is gSpan [28],
which also relies on canonical labeling like previous approaches,
but it uses a tree representation instead of an adjacency matrix as
a coding scheme for the graph. Based on the assigned codes, can-
didate patterns are organized lexicographically in a tree hierarchy,
which is then searched in a depth-first manner. In the same direc-
tion, GASTON [21] splits the discovery process into several phases
to increase eciency by first searching for frequent paths, then for
frequent free trees, and finally for cyclic graphs. Eciency is im-
proved since these classes of structures are contained in each other.
The basic idea is to store and reuse the embeddings instead of per-
forming subgraph isomorphism tests. However, this has high space
requirements and does not scale well to large graph databases.
Another approach is FFSM [13], which proposes a vertical search
scheme within an algebraic graph framework. Relying on a graph
canonical form, it introduces two new operations, FFSM-Join and
FFSM-Extension, to improve the eciency of pattern enumeration.
An embedding set for each frequent subgraph is also maintained
to avoid expensive subgraph isomorphism tests. Furthermore, an
adjacency index structure, called ADI, is proposed in [25] to deal
with the cases in which the graph database is too large to fit in
main memory. It is also shown how the gSpan algorithm [28] can
be adapted to use ADI.
Finally, to reduce the size of the output, more recent works have
focused on mining only subgraph patterns that are closed [29],
maximal [14], significant [27] or representative [31], or on sum-
marizing subgraph patterns [20].
2.2 Uncertain Graphs
Recently, there has been a growing interest in using uncertain
graphs as a data model in applications that need to deal with un-

certainty. Thus, various problems for mining uncertain graphs have
emerged. The problem of finding reliable subgraphs in uncertain
graphs is studied in [11]. Given a graph that is subject to random
edge failures, the goal is to find and remove a number of edges so
that the probability of connecting a set of selected nodes in the re-
maining subgraph is maximized. Three novel types of probabilistic
path queries have been defined in [12] for uncertain graphs repre-
senting road networks, where edge probabilities capture the uncer-
tainty in trac conditions. Also, both exact and approximation al-
gorithms are introduced to answer such queries. A generalization of
k-Nearest Neighbor queries in uncertain graphs is presented in [22],
where a framework is proposed considering alternative ways to de-
fine the distance between nodes taking edge probabilities into ac-
count. All these works clearly show the increasing need and interest
in mining uncertain graphs.
However, to the best of our knowledge, up to now only one work
has dealt with the problem of frequent subgraph pattern mining in
uncertain graphs [32]. The proposed method is an approximation
algorithm, called MUSE, which allows for a tradeo between accu-
racy and eciency when computing the expected support of can-
didate subgraph patterns. In particular, given a support threshold
minSup and a relative error tolerance ε [0, 1], the algorithm re-
turns all subgraph patterns with expected support at least minSup,
allowing also for some false positives with expected support in the
range [(1 ε) minSup, minSup]. Similar to corresponding meth-
ods for exact graphs, the solution addresses two main subtasks: (a)
a method for enumerating candidate patterns, and (b) a method to
compute the expected support of a pattern. Regarding the first task,
the method proposed in gSpan [28] is adopted to construct a search
tree of subgraph patterns. For the second task, two algorithms are
proposed, an exact one for small instances of the problem (e.g.,
graphs with up to 30 edges) and an approximate one for larger in-
stances. The main idea in both algorithms is to transform the prob-
lem to an instance of the DNF counting problem [24].
Although this algorithm makes it possible to approximate the ex-
pected support of a candidate pattern for an uncertain graph with a
large number of edges, the computational cost is still quite high,
and therefore the method does not scale well, even for moderate
size databases with up to a few hundreds of uncertain graphs. In our
approach, we remove this limitation, by constructing an index of
the uncertain graph database, which significantly prunes the search
space and enables for additional optimizations based on early ter-
mination and ecient scheduling to avoid the expensive subgraph
isomorphism tests.
3. DATA MODEL & PROBLEM DEFINITION
In this section, we formally define uncertain graphs and the prob-
lem of frequent subgraph pattern mining in uncertain graph databases.
For clarity of the presentation, we first introduce the problem of
frequent subgraph pattern mining in exact graphs, and then we ex-
plain how it is generalized in uncertain graphs. The data model and
definitions used in this paper are in line with previous approaches
for mining frequent subgraph patterns in both exact and uncertain
graphs (e.g., [28, 32]).
Definition 1(Exact Graph). An exact graph is a tuple G =
(V, E, Σ, L), where V is a set of vertices, E V × V is a set of edges,
Σ is a set of labels, and L : V E Σ is a function assigning
labels to vertices and edges.
The vertex set of a graph G is denoted by V(G) and the edge set
by E(G). The size of a graph G, denoted as |G|, is defined by the
number of edges it contains, i.e., |E(G)|. For simplicity, we assume
that the graph is undirected, since this a more typical scenario in
frequent subgraph pattern mining, e.g., in bioinformatics; however,
it is straightforward to extend the proposed method in the case of
directed graphs.
Definition 2(Subgraph Isomorphism). Given two exact graphs,
G = (V, E, Σ, L) and G
= (V
, E
, Σ
, L
),asubgraph isomorphism
from graph G to graph G
is an injective function f : V V
such
that:
1. u V, f (u) V
and L(u) = L
( f (u)), and
2. (u, v) E, ( f (u), f (v)) E
and L(u, v) = L
( f (u), f (v)).
If such a function f exists, then G is subgraph isomorphic to G
,
denoted as G G
. We also say that G
contains G. Moreover, the
subgraph G

of G
with vertex set V

= { f (u) | u V} and edge set
E

= {( f (u), f (v)) | (u, v) E} is called the embedding of G in G
under f .
Based on the above, we can define the support or frequency of
a subgraph pattern S in an exact graph database D as the portion
of graphs in D to which S is subgraph isomorphic. Notice that we
only consider connected graphs as subgraph patterns. Furthermore,
a subgraph pattern is considered to be frequent in D, if its support
exceeds a pre-defined threshold minSup. Formally, we define fre-
quent subgraph patterns in exact graph databases as follows.
Definition 3(Support). Given a subgraph pattern S and an
exact graph database D, the support of S in D is defined by
sup(S, D) =
|{G D | S G}|
|D|
(1)
If sup(S, D) minSup, where minSup is a given support threshold
within [0, 1], then S is a frequent subgraph pattern in D.
In the following, we show how the above concepts generalize
in the case of uncertain graphs. Uncertain or probabilistic graphs
generalize exact graphs by associating to each edge a probability
that it exists. Formally:
Definition 4(Uncertain Graph). An uncertain graph is a tu-
ple G
p
= (V, E, Σ, L, P), where (V, E, Σ, L) is an exact graph defined
as previously and P : E (0, 1] is a function assigning to each
edge a probability that it exists.
An uncertain graph G
p
implies a set of 2
|E|
possible exact graphs.
These are sampled from G
p
according to the probabilities assigned
by the function P. As in previous approaches, we assume inde-
pendence among edges, which is a realistic assumption in many
real-world applications. The probability of an exact graph G be-
ing implied by G
p
, denoted as G
p
G, is computed based on the
probability of each edge of G
p
being included or excluded from G:
P(G
p
G) =
e E(G)
P(e)
e E(G
p
) \ E(G)
(1 P(e)) (2)
Consequently, an uncertain graph database D
P
implies a set of
|D
P
|
i=1
2
|E(G
p
i
)|
exact graph databases. Assuming also independence
among the uncertain graphs in the database, the probability of an
exact graph database D being implied by D
P
is:
P(D
P
D) =
|D
P
|
i=1
P(G
p
i
G
i
) (3)
where G
p
i
and G
i
are the i-th graphs in D
P
and D, respectively.
In an uncertain graph database D
P
, the support of a subgraph pat-
tern S is based on its support in the implied exact graph databases,
taking also into consideration the corresponding probabilities of
these databases. In particular, the support in this case is a random
variable with probability distribution defined by:

Figure 1: An illustrative example showing (a) an uncertain graph, (b) its implied exact graphs and (c) a subgraph pattern.
P(s
i
) =
{D | D
P
D and sup(S ,D) = s
i
}
P(D
P
D) (4)
Therefore, to define frequent subgraph patterns in uncertain graph
databases, we use as measure the expected support, which is de-
fined as follows.
Definition 5(Expected Support). The expected support of a
subgraph pattern S in an uncertain graph database D
P
is defined
by:
esu p(S , D
P
) =
{D | D
P
D}
P(D
P
D) · sup(S , D) (5)
We can now formally define the problem of frequent subgraph pat-
tern mining in uncertain graph databases.
Problem Definition. Given an uncertain graph database D
P
and
a minimum support threshold minSup, return all the subgraph pat-
terns S with expected support greater than or equal to minSup, i.e.,
esu p(S , D
P
) minSup.
Example 1. An illustrative example is presented in Figure 1,
comprising an uncertain graph G
P
and a candidate subgraph pat-
tern S . The labels of the vertices and edges denote their type, e.g.,
category of a protein or type of protein interaction. The figure also
depicts the 8 exact graphs implied by G
P
, together with their prob-
abilities, computed according to Equation 2. As shown, the sub-
graph pattern S is contained in the implied graphs G
P
6
,G
P
7
and G
P
8
.
Therefore, according to Equation 5, the expected support of S in
G
P
is 0.276.
A straightforward algorithm for solving this problem works as
follows: (a) enumerate all candidate subgraph patterns; (b) for each
generated candidate pattern, and for each uncertain graph in the
database, generate all the implied exact graphs and compute the
expected support of the pattern. The cost of the first step is the
same as in the case of exact graphs. Hence, one of the existing
strategies, based on the apriori property, can be applied for enu-
merating candidate patterns more eciently. In our method, we
use the approach of gSpan [28]. However, the cost of the second
step is significantly increased compared to the corresponding one
for the case of exact graph databases. Recall that, each uncertain
graph with k edges implies a set of 2
k
exact graphs. Therefore, for
each pair of a candidate pattern and a graph, it requires 2
k
subgraph
isomorphism tests when the graph is uncertain instead of a single
one when the graph is exact. A first approach for dealing with this
problem is proposed in [32], which replaces this computation with
a more ecient but approximate algorithm that can estimate the ex-
pected support of a subgraph pattern in an uncertain graph, when
dealing with large graphs (i.e., above 30 edges). However, even
with this approximation, the cost remains prohibitively high even
for moderate size databases (e.g., above 100 graphs). Therefore,
reducing the uncertain graphs to be considered to only those that
may contribute to making a pattern frequent, and especially avoid-
ing large graphs in the computation, becomes crucial. In the next
sections, we propose a solution to this problem, using an index and
a summary of the uncertain graph database, with additional opti-
mizations for early termination and eective scheduling of graph
comparisons.
4. THE UGRAP INDEX
As explained above, our goal is to prune the search space when
computing the expected support of candidate subgraph patterns, by
limiting the number of uncertain graphs that need to be examined
for containment. For this purpose, we construct an index of the
uncertain graph database, containing graph edges and their prob-
abilities. Furthermore, to achieve better pruning, taking into con-
sideration the structure of each candidate pattern and each exam-
ined graph, we also construct a structure containing connectivity
information between graph nodes. This information is summarized
in order to reduce memory requirements when dealing with large
databases and large graphs, especially in the case of dense graphs.
In this section, we present how the UGRAP index is constructed
and maintained.
4.1 Edge Index
The first component of the UGRAP index, denoted with I
E
,is
an inverted index on graph edges extended with information on
edge probabilities in order to take uncertainty of edges into account.
More specifically, the structure I
E
is a map where:
each key is a label triple of the form t = (L
u
, L
v
, L
e
), repre-
senting graph edges, and
the value of each key is a list containing the identifiers of
the graphs in which these edges appear, as well as the corre-
sponding occurrence probability.
An edge (u, v) contained in an uncertain graph in the database is
mapped to the key T (u, v) = (L(u), L(v), L(u, v)). The value of a
key t is then a list of pairs of the form (G
P
, p
G
P
t
), where p
G
P
t
is the
probability that the graph G
P
contains at least one edge e mapped
to the key t. Only those graphs with non zero probability are stored
in the index. Given the independence assumption between edges,

this probability is computed by:
p
G
P
t
= 1
e G
P
T (e)=t
(1 P(e)) (6)
where the product denotes the probability that no edge mapped to t
exists. Formally, the edge index I
E
can be defined as follows.
Definition 6(Edge Index). Given an uncertain graph database
D
P
, the edge index I
E
is a structure that returns, for any given triple
t = (L
u
, L
v
, L
e
), a list of all the pairs (G
P
, p
G
P
t
), where G
P
is an
uncertain graph in D
P
containing an edge (u, v) having probability
p
G
P
t
> 0, such that L(u) = L
u
,L(v) = L
v
, and L(u, v) = L
e
.
Constructing the I
E
structure is straightforward. Each uncertain
graph in the database can be processed independently, parsing its
edges to identify the list of keys and their probabilities, using Equa-
tion 6. The results are then merged to create the map described
above. The process is detailed in Algorithm 1.
Updating the index when an uncertain graph is added or removed
from the database can be performed incrementally. The keys for
this graph are computed and the corresponding entries in the in-
dex are updated accordingly, by appending or removing the cor-
responding item from the list of each of these keys. If the key is
not already contained in the index, a new entry is created (or the
entry is removed if the list of the key becomes empty). Finally, if
an existing uncertain graph is updated, then the probabilities of all
the aected keys need to be updated accordingly (which may also
result in removing or adding keys).
Notice that, although more complex index structures have been
proposed for querying graph databases, which aim at avoiding ex-
pensive subgraph isomorphism tests [5, 9, 30], these structures are
not suitable for our problem for two reasons. First, they target ex-
act graphs; hence, their adaptation to uncertain graph databases is
an open issue. Second, and most importantly, more advanced in-
dex structures, such as the ones proposed in [5, 30], require first
to compute the frequent subgraphs in the database, which are then
used as features for the index. Instead, since our goal is to find
such frequent subgraph patterns, the index can only rely on sim-
pler features. As shown in Section 6, our index requires negligible
memory and computational resources to be built, even for large un-
certain graph databases.
Example 2. The edge index I
E
for a database containing only
the uncertain graph illustrated in Figure 1 would contain two keys,
(A, B, p) and (A, B, q), pointing to the lists {(G
P
, 0.92)} and {(G
P
, 0.3)},
respectively.
4.2 Connectivity Index
The second component of the UGRAP index, denoted by I
C
,is
a structure containing summarized information regarding connec-
tivity of graph nodes. This additional structural information is use-
ful when deciding which uncertain graphs may contain a candidate
subgraph pattern with non-zero probability.
Intuitively, the purpose of this structure is to extend the edge in-
dex allowing paths of length >1. In particular, I
C
provides infor-
mation on whether there exists a path of length (for values of up
to a maximum length
max
) between two vertices u and v of a graph
G
P
with labels L(u) and L(v), respectively. Notice that, unlike the
case of single edges, the independence assumption does not hold
between paths, since two paths may contain common edges. There-
fore, the probability that an uncertain graph G
P
contains a path of
length between two vertices with labels L(u) and L(v) cannot be
computed in a straightforward way, i.e., similar to Equation 6 for
edges. Instead, it requires applying the inclusion-exclusion princi-
ple, which involves finding all the possible paths between all pairs
Algorithm 1 Construction of the Edge Index I
E
Input : An uncertain graph database D
P
Output : The edge index I
E
1: Initialize I
E
to an empty map
2: for all G
P
D
P
do
3: Initialize K to an empty map
4: for all (u, v) E(G
P
) do
5: t (L(u), L(v), L(u, v))
6: K(t) K(t) ∪{(u, v)}
7: end for
8: for all t K do
9: p
t
1
e K(t)
(1 P(e))
10: I
E
(t) I
E
(t) ∪{(G
P
, p
t
)}
11: end for
12: end for
13: return I
E
of vertices with labels L(u) and L(v) and identifying all the over-
laps between any subset of these paths. Since this would make the
construction and maintenance of the index an expensive and com-
plex operation, we do not compute and store these probabilities;
instead, we only store whether such a path exists with probability
higher than zero or not.
Another issue that arises by allowing for paths of length >1
is that the size of the index is significantly increased, due to the
exponential increase of the number of possible paths. To deal with
this problem, we only maintain a summary of this information us-
ing Bloom filters [3]. A Bloom filter consists of an array of m bits
and k independent hash functions F = { f
1
, f
2
,..., f
k
}, which hash
elements of a universe U to an integer in the range of [1, m]. The
m bits are initially set to 0 in an empty Bloom filter. An element
x U is inserted into the Bloom filter by setting all positions f
i
(x)
of the bit array to 1, for all f
i
F. Thus, an element x is con-
tained in the original set only if all the positions f
i
(x) of the Bloom
filter are set to 1. If at least one of these positions is set to 0, we
can safely conclude that x is not present in the original set. How-
ever, due to hash collisions, there is also a small probability of false
positives, Pr
fp
(1 e
kn/m
)
k
, where n denotes the number of el-
ements hashed in the Bloom filter. In our case, a high probability
of false positives decreases the pruning power of the connectivity
index, since we use Bloom filters to summarize all the paths of a
given length contained in each graph. Each path inserted in the
Bloom filter is represented by the labels of its start node and end
node, sorted lexicographically. Formally, the connectivity index I
C
is defined as follows.
Definition 7(Connectivity Index). Given an uncertain graph
G
P
, an integer
max
and two labels L
u
and L
v
, the connectivity
index I
C
is a structure such that:
if the uncertain graph G
P
contains a path of length between
two vertices u and v with labels L(u) = L
u
and L(v) = L
v
,
then I
C
(G
P
, L
u
, L
v
,) = 1,
otherwise, I
C
(G
P
, L
u
, L
v
,) = 0 with probability at least 1ε,
and I
C
(G
P
, L
u
, L
v
,) = 1 with probability at most ε, for a
fixed error probability threshold ε.
The process of constructing the connectivity index I
C
is described
in detail in Algorithm 2. As with the edge index, I
C
can also be
built progressively, or maintained to reflect changes in the underly-
ing graph database.
Note that there is no need to construct the index for = 1, since
the graphs containing single-edge paths can be eciently retrieved
from the edge index I
E
. Therefore, we only consider [2,
max
]
when constructing the connectivity index, as well as for deciding
whether an uncertain graph contains a candidate subgraph pattern.

Citations
More filters
Proceedings ArticleDOI

Discovering frequent subgraphs over uncertain graph databases under probabilistic semantics

TL;DR: This paper investigates frequent subgraph mining on uncertain graphs under probabilistic semantics with a measure called φ-frequent probability introduced to evaluate the degree of recurrence of subgraphs.
Journal ArticleDOI

Clustering Large Probabilistic Graphs

TL;DR: A connection is established between the objective function and correlation clustering to propose practical approximation algorithms for the problem of clustering probabilistic graphs and show the practicality of the techniques using a large social network of Yahoo! users consisting of one billion edges.
Proceedings ArticleDOI

Reliable clustering on uncertain graphs

TL;DR: This paper examines the problem of clustering uncertain graphs with the use of a possible worlds model in which the most reliable clusters are discovered in the presence of uncertainty.
Proceedings ArticleDOI

Efficient Probabilistic K-Core Computation on Uncertain Graphs

TL;DR: This paper studies the problem of k-core computation on uncertain graphs and proposes a new model, namely (k,θ )-core, which consists of nodes with probability at least θ to be kcore member in the uncertain graph, which is NP-hard and hence resort to sampling based methods.
Proceedings ArticleDOI

Discriminative Feature Selection for Uncertain Graph Classification.

TL;DR: A novel discriminative subgraph feature selection method, Dug, is proposed, which can find discriminatives subgraph features in uncertain graphs based upon different statistical measures including expectation, median, mode and φ-probability.
References
More filters
Journal ArticleDOI

Space/time trade-offs in hash coding with allowable errors

TL;DR: Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.
Proceedings ArticleDOI

Maximizing the spread of influence through a social network

TL;DR: An analysis framework based on submodular functions shows that a natural greedy strategy obtains a solution that is provably within 63% of optimal for several classes of models, and suggests a general approach for reasoning about the performance guarantees of algorithms for these types of influence problems in social networks.
Journal IssueDOI

The link-prediction problem for social networks

TL;DR: Experiments on large coauthorship networks suggest that information about future interactions can be extracted from network topology alone, and that fairly subtle measures for detecting node proximity can outperform more direct measures.
Journal ArticleDOI

The complexity of computing the permanent

TL;DR: It is shown that the permanent function of (0, 1)-matrices is a complete problem for the class of counting problems associated with nondeterministic polynomial time computations.
Related Papers (5)
Frequently Asked Questions (2)
Q1. What are the contributions mentioned in the paper "Efficient discovery of frequent subgraph patterns in uncertain graph databases" ?

The main difficulty in solving this problem results from the large number of candidate subgraph patterns to be examined and the large number of subgraph isomorphism tests required to find the graphs that contain a given pattern. In this paper, the authors propose a method that uses an index of the uncertain graph database to reduce the number of comparisons needed to find frequent subgraph patterns. The evaluation of their approach on three real-world datasets as well as on synthetic uncertain graph databases demonstrates the significant cost savings with respect to the state-of-the-art approach. It also enables additional optimizations with respect to scheduling and early termination, that further increase the efficiency of the method. 

Their future work focuses on two main directions.