Expressive Languages for Path Queries over Graph-Structured Data

doi:10.1145/2389241.2389250

Expressive Languages for Path Queries over

Graph-Structured Data

Pablo Barcel´o

Dept. of Computer Science, Univ. of Chile

pbarcelo@dcc.uchile.cl

Carlos Hurtado

Fac. Ingenier´ıa y Ciencias, Univ. A. Iba˜nez

carlos.hurtado@uai.cl

Leonid Libkin

Sch. of Informatics, Univ. of Edinburgh

libkin@inf.ed.ac.uk

Peter Wood

Dept. of CS and Inf. Syst., Birkbeck, U. London

ptw@dcs.bbk.ac.uk

ABSTRACT

For many problems arising in the setting of graph

querying (such as ﬁnding semantic associations in RDF

graphs, exact and a pproximate pattern matching, se-

quence alignment, etc.), the power of standard lan-

guages such as the widely studied conjunctive regu-

lar path queries (CRPQs) is ins uﬃcient in at least two

ways. First, they cannot output paths and second, more

crucially, they cannot express relations among paths.

We thus propose a class of extended CRPQs, called

ECRPQs, which add regular relations on tuples of

paths, and allow path variables in the heads of queries.

We provide several examples of their usefulness in

querying g raph structured data, and study their proper-

ties. We analyze query e valuation and representation o f

tuples of paths in the output by means of automata. We

present a detailed analysis of data and combined com-

plexity of queries, and c onsider restrictions tha t lower

the complexity of ECRPQs to that of relational con-

junctive queries. We study the containment problem,

and look at further extensions with ﬁrst-order features,

and with non-regular re lations that express arithmetic

properties of paths, based on the lengths and numbers

of occurrences of labels.

Categories and Subject Descriptors. H.2.1 [Database

Management]: Logical Design—Data Models; F.1.1

[Computation by abstract devices]: Models of

Computation—Automata

General Terms. Theory, Languages, Algorithms

Keywords. Graph databases, conjunctive queries, reg-

ular relations, regular path queries

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

PODS’10, June 6–11, 2010, Indianapolis, Indiana, USA.

1. Introduction

For graph-structured data, queries that allow users to

sp e c ify the ty pes of paths in which they are interested

have always played a central role. Most commonly, the

sp e c iﬁcation of such paths has been by means of regu-

lar expressions over the alphabet o f edge labels [2, 10,

13, 16, 29]. The output o f a query is typically a set of

tuples of nodes that are connected in some way by the

paths speciﬁed. The canonical class of queries with this

functionality are the conjunctive regular path queries

(CRPQs), which have been the subject of much inves-

tigation, e.g. [10, 14, 16].

However, the rapid increase in the size and co mplexity

of graph-structured data (e.g. in the Semantic Web, or

in biological applications) has raised the need for ad-

ditional functionality in query lang uages. Speciﬁcally,

in many examples, the minimum requirements of suﬃ-

ciently expressive queries are: (a) the ability to deﬁne

complex semantic relationships between paths and (b)

the ability to include paths in the output of the query.

Neither of these is supported by CRPQ s.

There are multiple examples of queries that require

these new capabilities. For example, [5] introduces

a query language for RDF/S in which paths can be

compared based on speciﬁc semantic associations. In

handling biological se quences one often needs to com-

pare paths based on similarity (e.g., edit distance) [20].

Paths can be compared with respec t to other parame-

ters, e.g., lengths or numbers of occurrences of labels,

which can be useful in route-ﬁnding applica tions [6 ].

As for the ability to output paths, this has been pro-

posed, for example, as an extensio n to the SPARQL

query language – the standard for retrieving RDF data

[24]. However, [24] only propos e d a declarative lan-

guage, and left most basic questions unexplored (e.g.,

what should an output be if ther e are inﬁnitely many

paths between nodes?). Other applications for this

new functionality include determining the provenance

of data or artifacts [21], ﬁnding associations in linked

data [27], biological data [26 ] or social (or criminal) net-

works [32], a s well as performing semantic searches over

web-derived knowledge [36].

While the need for the extended functionality of graph

query languages is well-documented (and sometimes is

even incorporated into a programming syntax), the ba-

sic theoretical properties of such language s are com-

pletely unexplored. We do not know whether queries

can be meaningfully evaluated, what their complexity

is, whether they can be optimized, etc.

Our main goals, therefore, are to formally deﬁne exten-

sions of graph queries that can express complex seman-

tic a ssociations between paths and output paths to the

user, and to study them, c oncentrating on query evalu-

ation and its complexity, as well as some static analysis

problems.

We work with the cla ss of extended conjunctive regular

path queries or (ECRPQs), which generalize CRPQs by

allowing them to express the kind of sema ntic associa-

tion properties we e xplained a bove. That is, we allow

(i) n-tuples of path labels to be checked for conformity

to n-ary path languages, and (ii) paths, rather than

simply nodes, to be output. Conformity with respect

to n- ary languages is given, following the idea behind

CRPQs, with respect to n-ary regular relations.

As an e xample, consider a graph G w ith a single edge

label, deﬁning the student-advisor relationship. Using

CRPQs, one can express many queries, such as ﬁnd-

ing academic ancestors, or p e ople whose sets o f aca-

demic parents and grandparents intersect, or checking

whether Van Gucht and Tannen have a common aca-

demic ancestor (and if so, who that person is). However,

with CRPQs we cannot express queries asking for pair s

of scientists who have the same-length path to Tarski,

for example, nor can one as k for the precise paths by

which Van Gucht and Tannen are related to their com-

mon academic ancestor. With ECRPQs, we can express

such queries.

While leaving the above queries to the reader as an

exercise, we now outline a few examples of problems

where the p ower of ECRPQ s is required. They will be

fully developed in Section 3, after we have presented the

syntax and semantics of ECRPQs.

(i) Pattern matching Given an alphabet Σ and a set

of variables V, a patt ern is a string over Σ ∪ V. A pat-

tern deﬁnes a pattern language by instantiating vari-

ables with strings in Σ

∗

. Pattern languages need not be

context-free: e.g., the language of squared words over Σ

can be expressed by the pattern XX, where X ∈ V. But

ﬁnding nodes x and y connected by a path whose label

is in the language of squared words can be expressed by

the ECRPQ:

Ans(x, y) ← (x, π

1

, z), (z, π

2

, y), π

1

= π

2

where x, y and z are node variables and π

1

and π

2

are

path variables. Variables z, π

1

, and π

2

are meant to

be existentially qua ntiﬁed. What makes this diﬀerent

from CRPQs is the binary relation π

1

= π

2

on paths:

it states that the paths b e tween x and an intermediate

node z, and b e tween z and y are the same.

(ii) Semantic web associations In RDF/S, prop erties

can be declared to be subproperties of other proper-

ties. This is used in [5] to deﬁne a notion of semantic

association based on ρ-isomorphic property sequences:

two sequences a re ρ-isomorphic if they are of the same

length and the properties at the same position in each

sequence are subproperties of one ano ther. Such pairs

of sequences can be found by a modiﬁcation of the pre-

vious query with a diﬀerent binary relation expressing

the fact that the paths are ρ-isomorphic.

(iii) Approximate matching Approxima te string match-

ing [19, 23] and (biological) sequence alignment [20] are

both based on the notion of edit distance. The relation

representing pairs of sequence s that have edit distance

at most k from one another, for some ﬁxed k, is regular

[18]. So given a graph representing a pair of sequences,

an ECRPQ can determine whether they have edit dis-

tance at most k. We show in Section 3.1 that we can

also output the actual gaps and mismatches in the se-

quences using an ECRPQ.

Outline of the results After we formally deﬁne ECR-

PQs, we present an algorithm for q uery evaluation. It

turns out that the sets of labels of paths satisfying a

query are regular, and thus the evaluatio n algorithm

constructs automata to represent such sets.

We then investigate the complexity of query evaluation.

As yardsticks, we consider relational languag e s as well

as CRPQs. For conjunctive queries, combined complex-

ity is NP-complete, while it jumps to Pspace-complete

for relatio nal calculus. Hence we cannot hope to get

anything below NP for ECRPQs, and we hope not to

exceed the complexity of relational queries in a reason-

able class. As for data complexity, it is known to be

NLogspace-complete for CRPQs, so this will serve as

another be nchmark.

It turns out that the data complexity of ECRPQs

matches that of CRPQs, but combined complexity goes

up from NP to Pspace, matching relational calculus in-

stead. In this case it is natural to look for restrictions.

A standard o ne for CQs is a restriction to acyclicity.

This works for CRPQs – combined complexity becomes

tractable – but does not work for ECRPQs, as the com-

bined complexity remains Pspace-complete. However,

if our regular relations can only talk about lengths of

paths, then the complexity of ECRPQs drops to NP,

matching the complexity of the usual relatio nal CQs.

We then look at extensions of CRPQs and ECRPQs:

with negation and universal quantiﬁcation, and with

some non- regular relations. For the former, we get sur-

prisingly reasonable bounds for CRPQs, but the com-

plexity becomes too high when both neg ation and re-

lations on paths are allowed. For the latter, we look

at extensions with linea r constraints on path lengths,

and prove some good complexity bounds (tractable

data complexity and NP combined complexity). We

also look at relations that compare numbers of occur-

rences of labe ls in paths, and prove some low complexity

bounds for queries with such relations.

While query containment is known to be decidable for

CRPQs, we s how that ECRPQs share more properties

with full relational calculus: containment for them be-

comes undecidable. We recover decidability in one im-

portant subc ase though.

Organization In the next sec tion, we present back-

ground mater ial on graphs, regular relations and CR-

PQs. Section 3 introduces ECRPQs and looks at their

applications in more detail. In Section 4, we consider

the evaluation of E CRPQs. Section 5 deals with the

data and combined complexity of ECRPQs. In Sec-

tion 6 we look at query containment, and in Section 7

we consider extensions with negation, and with non-

regular features.

2. Preliminaries

Labeled graphs and paths Queries in our set-

ting will be evaluated over labeled database graphs

(db-graphs), that naturally model semistructured data .

For mally, if Σ is a ﬁnite alphabet, then a Σ-labeled db-

graph G (or simply db- graph if Σ is clear from the con-

text) is a pair (V, E), such that V is a ﬁnite set of nodes

and E ⊆ V × Σ × V is a set of directed edges labeled in

Σ.

A path ρ between nodes v

0

and v

m

in G is a sequence

v

0

a

0

v

1

a

1

v

2

· · · v

m−1

a

m−1

v

m

, where m ≥ 0, so that all

the v

i

’s are in V , all the a

j

’s are letters of Σ, and

(v

i

, a

i

, v

i+1

) is in E for each i < m. The label of such a

path ρ, denoted by λ(ρ), is the str ing a

0

· · · a

m−1

∈ Σ

∗

.

We also deﬁne the empty path as (v, ε, v) for each v ∈ V ;

the lab e l of such a path is the empty string ε.

Note that a Σ-labeled db-graph G can be naturally

viewed a s a nondeterministic ﬁnite automaton (NFA)

over alphabet Σ without initial and ﬁnal states. Its

states are nodes in V , and its transitions are edges in

E. We use this equivalence in several constructions in

the pape r.

Regular relations As our plan is to extend the notion

of recognizability from string languages to n-ary string

relations, we now give the standard deﬁnition of regular

relations over Σ [15, 18, 8]. Let ⊥ be a symbol not

in Σ. We denote the extended alphabet (Σ ∪ {⊥}) by

Σ

⊥

. Let ¯s = (s

1

, . . . , s

n

) be an n- tuple of strings over

alphabet Σ. We construct a string [¯s] over alphabet

(Σ

⊥

)

n

, whose length is the maximum of the s

j

’s, and

whose i-th symbol is a tuple (c

1

, . . . , c

n

), where each

c

k

is the i-th symbol of s

k

, if the length of s

k

is at

least i, or ⊥ otherwise. In other words, we pad shorter

strings with the symbo l ⊥, and then view the n strings

as one string over the alphabet of n-tuples of letters.

An n-ary relation S on Σ

∗

is regular, if the set {[¯s] |

¯s ∈ S} o f strings ove r alphabe t (Σ

⊥

)

n

is regular (i.e.,

accepted by an automaton over (Σ

⊥

)

n

, or given by a

regular expression over (Σ

⊥

)

n

). We sha ll often use the

same le tter for both a regular expression over (Σ

⊥

)

n

and the relation over Σ

∗

it denotes, as doing so will not

lead to any ambiguity.

As an example, consider a binary relation s  s

′

, saying

that s is a preﬁx o f s

′

. The automaton recognizing this

relation accepts if it reads a sequence of letters of the

form (a, a), for a ∈ Σ, possibly followed by a sequence

of letters of the form (⊥, b), for b ∈ Σ. As another ex-

ample, consider a binary relation el(s, s

′

) (equal length)

saying that |s| = |s

′

|. This relation is recognized by an

automaton that accepts if it does not see any letters

involving the ⊥ s ymbol.

To understand which relations on strings are regular,

it is often useful to provide a model-theor e tic cha rac-

terization of this class. In the following we assume fa-

miliarity with ﬁrst-order logic (FO). Consider the FO-

structure M

univ

= hΣ

∗

, , el, (P

a

)

a∈Σ

i with domain

Σ

∗

, where  and el are as above, and P

a

(s) is true

iﬀ the last letter for s is a. This is known as a uni-

versal automatic structure due to the following [8, 9]:

an n-ary r elation S on Σ

∗

is regular iﬀ there exists

an FO formula φ

S

(x

1

, . . . , x

n

) over M

univ

such that

S = {¯s ∈ (Σ

∗

)

n

| M

univ

|= φ

S

(¯s)}.

In particular, regular relations are closed under all

Boolean combinations, product, and projection. Fur -

thermore, using the above result it is quite easy to show

that an n-ary rela tion is regular, by exhibiting FO for-

mulae deﬁning them (see [8, 9, 7] for examples). For

example, |s| < |s

′

| is a regular relation deﬁnable by

φ(x, y) = ∃y

′

(y

′

 y ∧ y

′

6= y ∧ el(y

′

, x)). On the

other hand, more elaborate techniques have to be used

to prove that an n-ary relation on Σ is not regular. Ex-

amples of this kind include the binary relation 

ss

, that

consists of all pairs (s

1

, s

2

) such that s

1

is a subsequence

of s

2

, and the ternary relation that contains all tuples

(s

1

, s

2

, s

3

) such that s

1

s

2

= s

3

.

Conjunctive regular path queries A basic querying

mechanism for graph databases is the class of regular

path queries [3, 11] that retrieve all pairs of objects in

a db-graph connected by a path conforming to some

regular expression. Howe ver, it has been argued (e.g.

[30]) that in order to make regular path q ueries useful in

practice, they should be extended with several features,

one of them being the possibility of using conjunctions

of atoms. This extension yields the class of conjunctive

regular path queries, which we formally deﬁne below

(see also [13, 29, 16, 10]).

Fix a countable set of node variables (typically denoted

by x, y, z, . . .), and a countable set of path variables (de-

noted by π, ω, χ, . . .). A conjunctive regular path query

(CRPQ) Q over a ﬁnite alphabet Σ is an ex pression of

the form:

Ans(¯z) ←

^

1≤i≤m

(x

i

, π

i

, y

i

),

^

1≤j≤t

L

j

(ω

j

), (1)

such that

(i) m > 0, t ≥ 0,

(ii) each L

j

is a regular expression over Σ,

(iii) ¯x = (x

1

, . . . , x

m

), ¯y = (y

1

, . . . , y

m

) and ¯z are tu-

ples of node va riables,

(iv) {π

1

, . . . , π

m

} are distinct path variables,

(v) {ω

1

, . . . , ω

t

} are distinct path variables and each

ω

j

is among the π

i

’s, and

(vi) ¯z is a tuple of node variables among ¯x and ¯y.

The atom Ans(¯z) is the head of the query, the expres-

sion on the right of the ← is its body. The query Q is

Boolean if its head is of the form Ans(), i.e. ¯z is the

empty tuple.

Intuitively, s uch a query Q selects tuples ¯z for which

there exist values of the remaining node va riables fro m

¯x and ¯y and paths π

i

between x

i

and y

i

whose labels

satisfy the regular expressions L

1

to L

t

. Formally, to

deﬁne the semantics of CRPQs Q of the form (1), we

ﬁrst introduce a relation (G, σ, µ) |= Q, where σ is a

mapping from ¯x, ¯y to the set of nodes of a db-graph

G = (V, E), and µ is a mapping from {π

1

, . . . , π

m

} to

paths in G. This relation ho lds iﬀ µ(π

i

) is a path in

G from σ(x

i

) to σ(y

i

), for 1 ≤ i ≤ m, and the label of

each path µ(ω

j

) is in the language of L

j

, for 1 ≤ j ≤ t.

We now deﬁne Q(G) to be the se t of tuples σ(¯z) such

that (G, σ, µ) |= Q. If Q is Boolean, we let Q(G) be true

if (G, σ, µ) |= Q for some σ and µ (that is, as usual, the

empty tuple models the Boolean constant true, and the

empty set models the Boolean consta nt false).

Remark: Our syntax diﬀers slightly from the usual

CRPQ syntax in the literature (see e.g. [16, 10]). The

reason is that we make explicit use of path va riables

in the queries – to treat CRPQs and ECRP Qs in a

uniform manner – while the standard approach is to

refer to paths only implicitly.

3. Extended Conjunctive Regular Path

Queries

Our goal is to extend the class of CRPQs in two ways.

First, we want to allow free path variables in the heads of

queries. Second, we want the bodies of q ueries to permit

checking relations on sets of paths rather than just con-

formance of individual paths to reg ular languages. This

leads to the deﬁnition of a class of extended CRQPs.

An extended conjunctive regular path query (ECRPQ)

Q over Σ is an expression of the form:

Ans(¯z, ¯χ) ←

^

1≤i≤m

(x

i

, π

i

, y

i

),

^

1≤j≤t

R

j

(¯ω

j

), (2)

such that

(i) m > 0, t ≥ 0 ,

(ii) each R

j

is a r egular ex pression that deﬁnes a reg-

ular relation over Σ,

(iii) ¯x = (x

1

, . . . , x

m

) and ¯y = (y

1

, . . . , y

m

) are tuples

of node variables,

(iv) ¯π = (π

1

, . . . , π

m

) is a tuple o f dis tinct path vari-

ables,

(v) {¯ω

1

, . . . , ¯ω

t

} are distinct tuples of path variables,

such that each ¯ω

j

is a tuple of variables from ¯π, of

the same arity as R

j

,

(vi) ¯z is a tuple of node variables among ¯x, ¯y, and

(vii) ¯χ is a tuple of path variables among those in ¯π.

Note that this is similar to the deﬁnition o f CRPQs; the

main diﬀerences between (1) and (2) are:

• ECRPQs can check whether a tuple of paths be-

longs to a regular relation, rather than just check-

ing whether a path belongs to a regular language;

and

• outputs of ECRPQ s may c ontain both nodes

and paths, while outputs of CRPQs contain only

nodes.

The head, the body, and the notion of B oolean ECRPQs

are deﬁned in the standard way. The relational part of

an ECRPQ Q (2) is

V

1≤i≤m

(x

i

, π

i

, y

i

).

The s e mantics of ECRPQs is deﬁned by a natural ex-

tension of the semantics of CRPQs. For an ECRPQ

Q of the form (2), a db-graph G and mappings σ from

node variables to nodes and µ from path variables to

paths, we write (G, σ, µ) |= Q if

• µ(π

i

) is a path in G from σ(x

i

) to σ(y

i

), for 1 ≤

i ≤ m, and

• for each ¯ω

j

= (π

j

1

, . . . , π

j

k

), the tuple of strings

consisting of labels of µ(π

j

1

), . . . , µ(π

j

k

) belongs

to the relation R

j

.

The output of Q on G (where the head of Q is Ans(¯z, ¯χ))

is deﬁned as

Q(G) = {(σ(¯z), µ( ¯χ)) | (G, σ, µ) |= Q}.

Note that the implicit existential quantiﬁcation over

path variables that appear in the body but not in the

head is quantiﬁcation over a potentially inﬁnite set, a s

there are inﬁnitely many paths in a ny cyclic db-graph.

Fro m now on, we identify the class of CRPQ s with the

restriction of the class of ECRPQs to queries that do not

use regular relations of arity ≥ 2. This is mo re general

than the deﬁnition of the previous section, since we now

allow CRPQs to output paths.

It is easy to prove that the class of ECRPQs is strictly

more expressive than the class of CRPQs. Formally,

Proposition 3.1. There is a Boolean ECRPQ Q that is

not equivalent to any CRPQ Q

′

.

3.1 Applications of ECRPQs

In this section, we show that ECRPQs can express

queries found in a wide variety of application areas, in-

cluding ﬁnding associations in semantic web (or linked)

data, pattern matching, approximate string matching,

and biologica l sequence alignment.

Finding semantic web associations In a query lan-

guage for RDF/S introduced in [5], paths can be com-

pared based on speciﬁc semantic associations. Edges

correspond to RDF prop e rties and paths to property

sequences. A property a can be declared to be a sub-

property of property b, which we denote by a ≺ b. Two

property sequences u and v are called ρ-isomorphic iﬀ

u = u

1

, . . . , u

n

and v = v

1

, . . . , v

n

, for some n, and

u

i

≺ v

i

or v

i

≺ u

i

for every i ≤ n [5]. Nodes x and y

are called ρ-isoAssociated iﬀ x and y are the origins of

two ρ-isomorphic property sequences.

Finding nodes which are ρ-isoAssoc iated cannot be

done in a query language supporting only conventional

regular expressions, not least because doing so requires

checking that two paths ar e of equal length. However,

pairs of ρ-isomorphic sequences can be expressed us-

ing the regular relation R given by the following reg-

ular expression:



S

a,b∈Σ: (a≺b∨b≺a)

(a, b)



∗

. Then an

ECRPQ returning pairs of nodes x and y that are ρ-

isoAssociated can be written as fo llows:

Ans(x, y) ← (x, π

1

, z

1

), (y, π

2

, z

2

), R(π

1

, π

2

)

Path variables in an ECRPQ can also be used to return

the actual paths found by the query, a mechanism found

in the query languages proposed in [2, 5, 21, 24]. For

example, in [5] a ρ-query can take a pair of nodes u, v

and return the prope rty sequences relating them. This

too can be e xpressed by an ECRPQ:

Ans(π

1

, π

2

) ← (u, π

1

, z

1

), (v, π

2

, z

2

), R(π

1

, π

2

)

where the regular relation R is deﬁned as above.

Pattern matching Let Σ be a ﬁnite alphabet and V

be a countable set of variables such that Σ ∩ V = ∅. A

pattern α is a string over Σ∪V. It denotes the language

L

Σ

(α) obtained by applying substitutions σ : V → Σ

∗

to α. As we remarked already, such languages need not

even be context-free.

However, for each pattern α = α

1

· · · α

n

, where every

α

i

∈ Σ ∪ V, we can deﬁne an ECRPQ Q

α

(x, y) which

ﬁnds pairs of nodes connected by a path in L

Σ

(α) (note

that this property is not deﬁnable by a CRPQ).

Indeed, the relational part of Q

α

is

(x

0

, π

1

, x

1

), . . . , (x

n−1

, π

n

, x

n

). If α

i

is a letter,

then Q

α

contains the atom a(π

i

), and if α

i

is a

variable, then it contains Σ

∗

(π

i

). Finally, to ens ure

equality of variables, for every two α

i

, α

j

which are

the same variable, the query Q

α

contains a conjunct

π

i

= π

j

. It is clea r that Q

α

indeed ﬁnds nodes

connected by paths from L

Σ

(α).

In fa c t, EC RPQs can express queries corresponding to

a larger class of lang uages than the pattern languages.

Regular expressions with backreferencing [4], as pro-

vided by egrep and Perl, for example, are in some sense

a generalization of patterns in that substitutions of vari-

ables are restricted by regular expressio ns : the syntax

(e)%X, wher e e is a regular expression and X is a vari-

able, binds a string w ∈ L(e) to X. Subsequent uses of

X in the expression then match w. It s hould be clear

that we can ea sily extend the above construction of an

ECRPQ for patterns to one that corresponds to a re g-

ular expression with backreferencing.

In fact, ECRPQs can match patterns, such as a

n

b

n

c

n

,

where a, b, c ∈ Σ and n ∈ N, that cannot be denoted by

regular expressions with backreferencing, with the help

of the equal length pre dicate:

Ans(x, y) ← (x, π

1

, z

1

), (z

1

, π

2

, z

2

), (z

2

, π

3

, y),

a

∗

(π

1

), b

∗

(π

2

), c

∗

(π

3

), el(π

1

, π

2

), el(π

2

, π

3

),

where el(π, π

′

) is a shorthand for (

S

a,b∈Σ

(a, b))

∗

(π, π

′

).

Approximate matching and sequence alignment

We treat approximate string matching and (biological)

sequence alignment together because both are bas ed on

the notion of edit distance betwe e n strings. We consider

the three edit operations of insertion, deletion and sub-

stitution, deﬁned as follows. Let s, s

′

∈ Σ

∗

. Applying

an edit operation to s yielding s

′

can be modeled as a

binary relation ; over Σ

∗

such that x ; y holds iﬀ

there exist u, v ∈ Σ

∗

, a, b ∈ Σ, with a 6= b, such that

one of the following is satisﬁed:

x = uav, y = ubv (substitution)

x = uav, y = uv (deletion)

x = uv, y = ubv (insertion)

Let

k

; stand for the compo sition of ; with itself k

times. The edit distance d

e

(x, y) between x a nd y is the

minimum number k of edit operations such that x

k

; y.

We deﬁne a relation D

≤k

between string s as follows:

(x, y) ∈ D

≤k

iﬀ d

e

(x, y) ≤ k. This relation is regular

(indeed, it is easy to see that it is accepted by a two-tape

transducer, and the diﬀerence between the lengths of x

and y is bounded by k; then it follows from the fact that

rational re lations of such bounded distance are regular

[18]).

We now consider the use of edit distance in ﬁnding

Expressive Languages for Path Queries over Graph-Structured Data

Figures

Citations

ACM Transactions on Database Systems

Foundations of Modern Query Languages for Graph Databases

Foundations of Modern Query Languages for Graph Databases

Querying graph databases

Adding regular expressions to graph reachability and pattern queries

References

Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology

Algorithms on Strings, Trees, and Sequences: Suffix Trees and Their Uses

Transductions and context-free languages

The Lorel Query Language for Semistructured Data

Integer Programming with a Fixed Number of Variables

Related Papers (5)

A graphical query language supporting recursion

GraphLog: a visual formalism for real life recursion

Survey of graph database models

Foundations of databases

Finding Regular Simple Paths in Graph Databases