scispace - formally typeset
Open AccessBook ChapterDOI

Reasoning about pattern-based XML queries

TLDR
Satisfiability of patterns under schemas, containment of queries for various features of XML used in queries, finding certain answers, and applications of pattern-based queries in reasoning about schema mappings for data exchange are looked at.
Abstract
We survey results about static analysis of pattern-based queries over XML documents. These queries are analogs of conjunctive queries, their unions and Boolean combinations, in which tree patterns play the role of atomic formulae. As in the relational case, they can be viewed as both queries and incomplete documents, and thus static analysis problems can also be viewed as finding certain answers of queries over such documents. We look at satisfiability of patterns under schemas, containment of queries for various features of XML used in queries, finding certain answers, and applications of pattern-based queries in reasoning about schema mappings for data exchange.

read more

Content maybe subject to copyright    Report

Reasoning About Pattern-Based XML Queries
Am´elie Gheerbrant
1
, Leonid Libkin
1
, and Cristina Sirangelo
2
1
School of Informatics, University of Edinburgh
2
LSV, ENS-Cachan INRIA & CNRS
Abstract. We survey results about static analysis of pattern-based queries over
XML documents. These queries are analogs of conjunctive queries, their unions
and Boolean combinations, in which tree patterns play the role of atomic for-
mulae. As in the relational case, they can be viewed as both queries and incom-
plete documents, and thus static analysis problems can also be viewed as finding
certain answers of queries over such documents. We look at satisfiability of pat-
terns under schemas, containment of queries for various features of XML used
in queries, nding certain answers, and applications of pattern-based queries in
reasoning about schema mappings for data exchange.
1 Introduction
Due to the complicated hierarchical structure of XML documents and the many ways in
which it can interact with data, reasoning about XML data has become an active area of
research, and many papers dealing with various aspects of static analysis of XML have
appeared, see, e.g. [1, 6,12, 16–18,24, 26,27, 29].
As most querying tasks for XML have to do with navigation through documents,
reasoning/static analysis tasks deal with mechanisms for specifying interaction between
navigation, data, as well as schemas of documents. Navigation mechanisms that are
studied are largely of two kinds: they either describe paths through documents (most
commonly using the navigational language XPath), or they describe tree patterns.
A tree pattern presents a partial description of a tree, along with some variables that
can be assigned values as a pattern is matched to a complete document. For instance, a
pattern a (x)[b(x), c(y)] describes a tree with the root labeled a and two children labeled
b and c; these carry data values, so that those in the a-node and the b-node are the same.
This pattern matches a tree with root a and children b and c with all of them having data
value 1, for instance; not only that, such a match produces the tuple (1, 1) of data values
witnessing the match. On the other hand, if in the tree the b and the c nodes carry value
2, there is no longer a match.
We deal with patterns that are naturally tree-shaped. This is contrast with some
of the patterns appearing in the literature [9, 10] that can take the shape of arbitrary
graphs (for instance, such a pattern can say that we have an a -node, that has b and c
descendants, that in turn have the same d-descendant: this describes a directed acyclic
graph rather than a tree). In many XML applications it is quite natural to use tree-shaped
patterns though. For example, patterns used in specifying mappings between schemas
(as needed in data integration and exchange applications) are such [3,5, 7]. It is also

natural to use them for defining queries [4,26] as well as for specifying incomplete
XML data [8].
In database theory, there is a well-known duality between partial descriptions of
databases (or databases with incomplete information), and conjunctive queries. Like-
wise for us, patterns can also be viewed as basic queries: in the above example, the
pattern returns pairs (x, y) of data values. Viewing patterns as atomic formulas, we
can close them under conjunction, disjunction, and quantification, obtaining analogs of
relational conjunctive queries and their unions, for instance.
The main reasoning task we deal with is containment of queries. There are three
main reasons for studying this question.
Containment is the most basic query optimization task. Indeed, the goal of query
optimization is to replace a given query with a more efficient but equivalent one;
equivalence of course is testing two containment statements.
Containment can be viewed as finding certain answers over incomplete databases,
using the duality between queries and patterns. A pattern π describes an incomplete
database; if, viewed as a query, it is contained in a query Q, then the certain answer
to Q over π is true, and the converse also holds. This correspondence is well known
for both relations and XML.
Finally, containment is the critical task in data integration, specifically in query
rewriting using views [22]. When a query needs to be rewritten over the source
database, the correctness of a rewriting is verified by checking query containment.
The plan of the survey is as follows. We first explain the basic relevant notions in
the relational case, particularly the pattern/query duality and the connection with incom-
plete information. We then define tree patterns, present their classification, and explain
the notion of satisfaction in data trees, i.e., labeled trees in which nodes can carry data
values. After that we deal with the basic pattern analysis problem: their satisfiability.
Given that patterns are tree-shaped, satisfiability per se is trivial, but we handle it in the
presence of a schema (typically given by an automaton).
We then introduce pattern-based queries, specifically analogs of conjunctive
queries, their unions, and Boolean combination, and survey results on their contain-
ment. Using those results, we derive bounds on finding certain answers for queries over
incomplete documents. Finally, we deal with reasoning tasks for pattern-based schema
mappings, which also rely on a form of containment statement.
2 Relational patterns and pattern-based queries
Tableaux and na¨ıve databases Relational patterns are known under the name of
tableaux if one views them as queries, and as na
¨
ıve tables if one views them as data. The
instance below on the left is a usual relation, and the one on the right is a tableau/na¨ıve
table:
1 2 3 4
5 6 7 8
1 x 3 z
5 y 7 x

Some of the constant entries in relations can be replaced by variables in tableaux. For-
mally, we have two domains, C of constants and V of variables, and a relational vo-
cabulary σ. A relational instance is an instance of σ over C, and a na¨ıve database is an
instance over C V. In case of a single relation, we talk about nıve tables rather than
na¨ıve databases.
A tableau has a list of variables, among those used in it, selected as ‘distinguished’
variables; that is, formally it is a pair (D, ¯x), where D is a na¨ıve database and ¯x is a
tuple of variables among those mentioned in D.
As we already mentioned, there is a natural duality between incomplete databases
and conjunctive queries. Each tableau (D, ¯x) can be viewed as a query Q
D
(¯x) =
¯y
V
D where ¯y is the list of variables in D except ¯x, and
V
D is the conjunction
of all the facts in D. For instance, if D is the na¨ıve table in the above picture, the query
associated with (D, x) is Q(x) = yz D(1, x, 3, z) D(5, y, 7, x). Likewise, every
conjunctive query Q has a tableau tab(Q) which is obtained by viewing conjuncts in it
as a database, and making the list of free variables its distinguished variables.
Homomorphisms A key notion for nıve databases and tableaux is that of a homomor-
phism. Given two na¨ıve databases D
1
and D
2
, a homomorphism h between them is a
mapping h from V to C V defined on all the variables in D
1
so that, if R is a relation
symbol in the vocabulary and ¯a is a tuple in the relation R in D
1
, then h(¯a) is a tuple
in the relation R in D
2
. Of course h(a
1
, . . . , a
n
) stands for (h(a
1
), . . . , h(a
n
)), and we
assume h(c) = c whenever c C.
If h is a homomorphism from D
1
to D
2
, we write h : D
1
D
2
. Such a map is a
homomorphism of two tableaux (D
1
, ¯x
1
) and (D
2
, ¯x
2
) if, in addition, h(¯x
1
) = ¯x
2
. If
we need to state that there is a homomorphism, but it is not important to name it, we
will simply write D
1
D
2
.
Homomorphisms can also be used to give semantics of incomplete databases. It is
assumed that a na¨ıve database D represents all complete databases D
(i.e., databases
over C) such that there is a homomorphism h : D D
. The set of all such D
is
denoted by JDK.
Note that the satisfiability problem for relational patterns expressed via na¨ıve
databases whether the set JDK is not empty is trivial, the answer is always yes. In
the presence of constraints on the schema it can become a fairly complicated problem,
sometimes even undecidable.
Containment Containment asks if for two queries, Q
1
and Q
2
, the result of Q
1
is
contained in the result of Q
2
on every input; equivalence asks if the results are always
the same. We write Q
1
Q
2
and Q
1
= Q
2
to denote containment and equivalence. Of
course equivalence is just a special case of containment: Q
1
= Q
2
iff Q
1
Q
2
and
Q
2
Q
1
.
The containment problem for conjunctive queries is solved via homomorphisms.
Given two conjunctive queries Q
1
and Q
2
, we have Q
1
Q
2
iff there is a homomor-
phism h : tab(Q
2
) tab(Q
1
); this makes the problem NP-complete [13].
In addition to conjunctive queries (sometimes abbreviated as CQs), we shall con-
sider their unions and Boolean combinations. The former class, denoted by UCQs
sometimes, is obtained by closing CQs under union (i.e., if Q
1
, Q
2
are UCQs produc-
ing relations of the same arity, then Q
1
Q
2
is a UCQ). For Boolean combinations of

conjunctive queries (abbreviated BCCQs), the additional closure rules are that Q
1
Q
2
,
Q
1
Q
2
, and Q
1
Q
2
are BCCQs.
For these classes containment is still decidable, and the complexity stays in NP for
UCQs given explicitly as unions of CQs, and goes up to Π
p
2
-complete for BCCQs [28].
Certain answers and na¨ıve evaluation Now suppose we have a na¨ıve database D and
a query Q; assume that Q is Boolean. The standard notion of answering a query on an
incomplete database is that of certain answers:
certain
(Q, D) =
^
{Q(D
) | D
JDK}
Let Q be a conjunctive query. Then, for an arbitrary database D
, we have D
|= Q iff
there is a homomorphism h : tab(Q) D
. Thus, for an incomplete database D, we
have the following easy equivalences:
certain
(Q, D) = true D
JDK : tab(Q) D
tab(Q) D D |= Q
Thus, to compute certain answers, all one needs to do is to run a query on the incomplete
database itself. This is referred to as na
¨
ıve evaluation. Note that the data complexity of
finding certain answers is tractable, as it is the same as evaluation of conjunctivequeries.
The fact that na¨ıve evaluation works for Boolean conjunctive queries extends in two
ways: to UCQs, and to queries with free variables [21]. In some way (for the semantics
we considered) the result is optimal within the class of relational algebra queries [23].
In particular, na¨ıve evaluation does not work for BCCQs (even though it was shown
recently that data complexity of finding certain answers for BCCQs remains polynomial
[19]).
3 Trees, patterns
3.1 Data trees
Data trees provide a standard abstraction of XML documents with data. First we define
their structural part, namely unranked trees. A finite unranked tree domain is a non-
empty, prefix-closed finite subset D of N
(words over N) such that s · i D implies
s ·j D for all j < i and s N
. Elements of unranked tree domains are called nodes.
We assume a countably infinite set L of possible labels that can be used to label tree
nodes. An unranked tree is a structure hD, , , λi, where
D is a finite unranked tree domain,
is the child relation: s s · i for s · i D,
is the next-sibling relation: s · i s · (i + 1) for s · (i + 1) D , and
λ : D L is the labeling function assigning a label to each node.
We denote the reflexive-transitive closure of by
(descendant-or-self), and the
reflexive-transitive closure of by
(following-sibling-or-self).
In data trees, nodes can carry not only labels but also data values. Given a domain C
of data values (e.g., strings, numbers, etc.), a data tree is a structure t = hD, , , λ, ρi,
where hD, , , λi is an unranked tree, and ρ : D C assigns each node a data value.
Note that in XML documents, nodes may have multiple attributes, but this is easily
modeled with data trees.

3.2 Patterns
To explain our approach to defining tree-shaped patterns, consider first data trees re-
stricted just to the child relation, i.e., structures hD, , λ, ρi. They can be defined recur-
sively: a node labeled with a L and carrying a data value v C is a data tree, and
if t
1
, . . . , t
n
are trees, we can form a new tree by making them children of a node with
label a and data value v.
Just like in the relational case, patterns can also use variables from V. So our sim-
plest case of patterns is defined as:
π := a(x)[π, . . . , π] (1)
with a L and x C V. Here the sequence in [. . .] could be empty. In other words,
if π
1
, . . . , π
n
is a sequence of patterns (perhaps empty), a L and x C V, then
a(x)[π
1
, . . . , π
n
] is a pattern. If ¯x is the list of all the variables used in a pattern π, we
write π(¯x).
We denote patterns from this class by PAT(). As with conjunctive queries, the
semantics can be defined via homomorphisms of their tree representations [8, 19], but
here we give it in a different, direct way. The semantics of π(¯x) is defined with respect
to a data tree t = hD, , , λ, ρi, a node s D, and a valuation ν : ¯x C as follows:
(t, s, ν) |= a(x)[π
1
(¯x
1
), . . . , π
n
(¯x
n
)] iff
λ(s) = a (the label of s is a);
ρ(s) =
(
ν(x) if x is a variable
x if x is a data value;
there exist not necessarily distinct children s·i
1
, . . . , s·i
n
of s so that (t, s·i
j
, ν) |=
π
j
(¯x
j
) for each j n (if n = 0, this last item is not needed).
We write (t, ν) |= π(¯x) if there is a node s so that (t, s, ν) |= π(¯x) (i.e., a pattern
is matched somewhere in the tree). Also if ¯v = ν(¯x), we write t |= πv) instead of
(t, ν) |= π(¯x). We also write π(t) for the set {¯v | t |= π(¯v)}.
A natural extension for these simple patterns is to include both vertical and horizon-
tal navigation, resulting in the class PAT(, ):
π := a(x)[µ, . . . , µ]
µ := π . . . π
(2)
with a L and x CV (and the sequences, as before, could be empty). The semantics
is given by:
(t, s, ν) |= a(x)[µ
1
(¯x
1
), . . . , µ
n
(¯x
n
)] if a(x) is satisfied in s by ν as before and
there exist not necessarily distinct children s·i
1
, . . . , s·i
n
of s so that (t, s·i
j
, ν) |=
µ
j
(¯x
j
) for each j n.
(t, s, ν) |= π
1
(¯x
1
) . . . π
m
(¯x
m
) if there exist consecutive siblings s
1
s
2
. . . s
m
, with s
1
= s, so that (t, s
i
, ν) |= π
i
(¯x
i
) for each i m.
Next we consider more expressive versions with transitive closure axes
(descen-
dant) and
(following sibling). As in [3, 19], we define general patterns by the rules:
π := a(x)[µ, . . . , µ]//[µ, . . . , µ]
µ := π . . . π
(3)

Citations
More filters
Book ChapterDOI

Ontology-Based Data Access: Ontop of Databases

TL;DR: The architecture and technologies underpinning the OBDA system Ontop are presented and it is demonstrated that, for standard ontologies, queries and data stored in relational databases, Ontop is fast, efficient and produces SQL rewritings of high quality.
Book ChapterDOI

Answering SPARQL Queries over Databases under OWL 2 QL Entailment Regime

TL;DR: An extension of the ontology-based data access platform Ontop that supports answering SPARQL queries under the OWL 2 QL direct semantics entailment regime for data instances stored in relational databases is presented.
Book ChapterDOI

Deployment of Smart Spaces in Internet of Things: Overview of the Design Challenges

TL;DR: This paper considers the crucial design challenges that smart spaces meet for deploying in IoT: interoperability, information processing, security and privacy, and considers solutions to cope with the challenges.
Book ChapterDOI

Regular path queries on large graphs

TL;DR: An algorithm is devised which decomposes an RPQ into a series of smaller RPQs using rare labels, i.e., elements of the query with few matches, as way-points, and which outperforms the automata-based approach, often by orders of magnitude.
Book ChapterDOI

DynamiTE: Parallel Materialization of Dynamic RDF Data

TL;DR: The results show that the methods are indeed capable to recalculate the derivation in a short time, opening the door to reasoning on much more dynamic data than is currently possible.
References
More filters
Proceedings ArticleDOI

Data integration: a theoretical perspective

TL;DR: The tutorial is focused on some of the theoretical issues that are relevant for data integration: modeling a data integration application, processing queries in data integration, dealing with inconsistent data sources, and reasoning on queries.
Proceedings ArticleDOI

Optimal implementation of conjunctive queries in relational data bases

TL;DR: It is shown that while answering conjunctive queries is NP complete (general queries are PSPACE complete), one can find an implementation that is within a constant of optimal.
Journal ArticleDOI

Incomplete Information in Relational Databases

TL;DR: There are precise conditions that should be satisfied in a semantically meaningful extension of the usual relational operators, such as projection, selection, union, and join, from operators on relations to operators on tables with “null values” of various kinds allowed.
Journal ArticleDOI

Equivalences Among Relational Expressions with the Union and Difference Operators

TL;DR: It is shown that containment of tableaux is a necessary step in testing equivalence of queries with union and difference, and the containment problem is shown to be NP-complete even for tableaux that correspond to expressions with only one project and several join operators.
Journal ArticleDOI

Containment and equivalence for a fragment of XPath

TL;DR: This article identifies one parameterized class of queries for which containment can be decided efficiently, and shows that even with some bounded parameters, containment remains coNP-complete.
Related Papers (5)
Frequently Asked Questions (9)
Q1. What have the authors contributed in "Reasoning about pattern-based xml queries" ?

In this paper, the authors survey results about static analysis of pattern-based queries over XML documents, including satisfiability of patterns under schemas, containment of queries for various features of XML used in queries, finding certain answers, and applications of patternbased queries in reasoning about schema mappings for data exchange. 

The basic reasoning tasks about schema mappings relate to their consistency, or satisfiability:– The problem SATSM(σ) takes a SM(σ) mapping M as an input and asks whether JMK 6= ∅. 

As most querying tasks for XML have to do with navigation through documents, reasoning/static analysis tasks deal with mechanisms for specifying interaction between navigation, data, as well as schemas of documents. 

Due to the complicated hierarchical structure of XML documents and the many ways in which it can interact with data, reasoning about XML data has become an active area of research, and many papers dealing with various aspects of static analysis of XML have appeared, see, e.g. [1, 6, 12, 16–18, 24, 26, 27, 29]. 

Navigation mechanisms that are studied are largely of two kinds: they either describe paths through documents (most commonly using the navigational language XPath), or they describe tree patterns. 

The case of SATaut(↓) restricted to trees without variables is tractable though, as such a pattern can be efficiently translated into an automaton, and the problem is reduced to checking nonemptiness of the product of two automata. 

The upper NP bound is proved by a “cutting” technique: it shows that if there is a data tree t ∈ L(A) in which the pattern π is satisfied, then there is one which is not too large in terms of π and A (a low degree polynomial). 

In fact any query in CQ(σ, ) is equivalent to a single-pattern CQ(σ, ) query: it suffices to connect all patterns π1, . . . πn of the query as descendants of a common wildcardlabeled root. 

D′ ⇔ tab(Q) → D ⇔ D |= QThus, to compute certain answers, all one needs to do is to run a query on the incomplete database itself.