What is the basic reasoning task about schema mappings?

The basic reasoning tasks about schema mappings relate to their consistency, or satisfiability:– The problem SATSM(σ) takes a SM(σ) mapping M as an input and asks whether JMK 6= ∅.

What is the case of SATaut() restricted to trees without variables?

The case of SATaut(↓) restricted to trees without variables is tractable though, as such a pattern can be efficiently translated into an automaton, and the problem is reduced to checking nonemptiness of the product of two automata.

What is the upper NP bound of SATaut()?

The upper NP bound is proved by a “cutting” technique: it shows that if there is a data tree t ∈ L(A) in which the pattern π is satisfied, then there is one which is not too large in terms of π and A (a low degree polynomial).

What is the meaning of a single-pattern CQ()?

In fact any query in CQ(σ, ) is equivalent to a single-pattern CQ(σ, ) query: it suffices to connect all patterns π1, . . . πn of the query as descendants of a common wildcardlabeled root.

(Open Access) Reasoning about pattern-based XML queries (2013) | Amélie Gheerbrant

Q: What have the authors contributed in "Reasoning about pattern-based xml queries" ?

In this paper, the authors survey results about static analysis of pattern-based queries over XML documents, including satisfiability of patterns under schemas, containment of queries for various features of XML used in queries, finding certain answers, and applications of patternbased queries in reasoning about schema mappings for data exchange.

Q: What is the standard notion of answering a query on an incomplete database?

D′ ⇔ tab(Q) → D ⇔ D |= QThus, to compute certain answers, all one needs to do is to run a query on the incomplete database itself.

Reasoning About Pattern-Based XML Queries

Am´elie Gheerbrant

, Leonid Libkin

, and Cristina Sirangelo

School of Informatics, University of Edinburgh

LSV, ENS-Cachan INRIA & CNRS

Abstract. We survey results about static analysis of pattern-based queries over

XML documents. These queries are analogs of conjunctive queries, their unions

and Boolean combinations, in which tree patterns play the role of atomic for-

mulae. As in the relational case, they can be viewed as both queries and incom-

plete documents, and thus static analysis problems can also be viewed as ﬁnding

certain answers of queries over such documents. We look at satisﬁability of pat-

terns under schemas, containment of queries for various features of XML used

in queries, ﬁnding certain answers, and applications of pattern-based queries in

reasoning about schema mappings for data exchange.

1 Introduction

Due to the complicated hierarchical structure of XML documents and the many ways in

which it can interact with data, reasoning about XML data has become an active area of

research, and many papers dealing with various aspects of static analysis of XML have

appeared, see, e.g. [1, 6,12, 16–18,24, 26,27, 29].

As most querying tasks for XML have to do with navigation through documents,

reasoning/static analysis tasks deal with mechanisms for specifying interaction between

navigation, data, as well as schemas of documents. Navigation mechanisms that are

studied are largely of two kinds: they either describe paths through documents (most

commonly using the navigational language XPath), or they describe tree patterns.

A tree pattern presents a partial description of a tree, along with some variables that

can be assigned values as a pattern is matched to a complete document. For instance, a

pattern a (x)[b(x), c(y)] describes a tree with the root labeled a and two children labeled

b and c; these carry data values, so that those in the a-node and the b-node are the same.

This pattern matches a tree with root a and children b and c with all of them having data

value 1, for instance; not only that, such a match produces the tuple (1, 1) of data values

witnessing the match. On the other hand, if in the tree the b and the c nodes carry value

2, there is no longer a match.

We deal with patterns that are naturally tree-shaped. This is contrast with some

of the patterns appearing in the literature [9, 10] that can take the shape of arbitrary

graphs (for instance, such a pattern can say that we have an a -node, that has b and c

descendants, that in turn have the same d-descendant: this describes a directed acyclic

graph rather than a tree). In many XML applications it is quite natural to use tree-shaped

patterns though. For example, patterns used in specifying mappings between schemas

(as needed in data integration and exchange applications) are such [3,5, 7]. It is also

natural to use them for deﬁning queries [4,26] as well as for specifying incomplete

XML data [8].

In database theory, there is a well-known duality between partial descriptions of

databases (or databases with incomplete information), and conjunctive queries. Like-

wise for us, patterns can also be viewed as basic queries: in the above example, the

pattern returns pairs (x, y) of data values. Viewing patterns as atomic formulas, we

can close them under conjunction, disjunction, and quantiﬁcation, obtaining analogs of

relational conjunctive queries and their unions, for instance.

The main reasoning task we deal with is containment of queries. There are three

main reasons for studying this question.

– Containment is the most basic query optimization task. Indeed, the goal of query

optimization is to replace a given query with a more efﬁcient but equivalent one;

equivalence of course is testing two containment statements.

– Containment can be viewed as ﬁnding certain answers over incomplete databases,

using the duality between queries and patterns. A pattern π describes an incomplete

database; if, viewed as a query, it is contained in a query Q, then the certain answer

to Q over π is true, and the converse also holds. This correspondence is well known

for both relations and XML.

– Finally, containment is the critical task in data integration, speciﬁcally in query

rewriting using views [22]. When a query needs to be rewritten over the source

database, the correctness of a rewriting is veriﬁed by checking query containment.

The plan of the survey is as follows. We ﬁrst explain the basic relevant notions in

the relational case, particularly the pattern/query duality and the connection with incom-

plete information. We then deﬁne tree patterns, present their classiﬁcation, and explain

the notion of satisfaction in data trees, i.e., labeled trees in which nodes can carry data

values. After that we deal with the basic pattern analysis problem: their satisﬁability.

Given that patterns are tree-shaped, satisﬁability per se is trivial, but we handle it in the

presence of a schema (typically given by an automaton).

We then introduce pattern-based queries, speciﬁcally analogs of conjunctive

queries, their unions, and Boolean combination, and survey results on their contain-

ment. Using those results, we derive bounds on ﬁnding certain answers for queries over

incomplete documents. Finally, we deal with reasoning tasks for pattern-based schema

mappings, which also rely on a form of containment statement.

2 Relational patterns and pattern-based queries

Tableaux and na¨ıve databases Relational patterns are known under the name of

tableaux if one views them as queries, and as na

ıve tables if one views them as data. The

instance below on the left is a usual relation, and the one on the right is a tableau/na¨ıve

table:

1 2 3 4

5 6 7 8

1 x 3 z

5 y 7 x

Some of the constant entries in relations can be replaced by variables in tableaux. For-

mally, we have two domains, C of constants and V of variables, and a relational vo-

cabulary σ. A relational instance is an instance of σ over C, and a na¨ıve database is an

instance over C ∪ V. In case of a single relation, we talk about na¨ıve tables rather than

na¨ıve databases.

A tableau has a list of variables, among those used in it, selected as ‘distinguished’

variables; that is, formally it is a pair (D, ¯x), where D is a na¨ıve database and ¯x is a

tuple of variables among those mentioned in D.

As we already mentioned, there is a natural duality between incomplete databases

and conjunctive queries. Each tableau (D, ¯x) can be viewed as a query Q

(¯x) =

∃¯y

D where ¯y is the list of variables in D except ¯x, and

D is the conjunction

of all the facts in D. For instance, if D is the na¨ıve table in the above picture, the query

associated with (D, x) is Q(x) = ∃y∃z D(1, x, 3, z) ∧ D(5, y, 7, x). Likewise, every

conjunctive query Q has a tableau tab(Q) which is obtained by viewing conjuncts in it

as a database, and making the list of free variables its distinguished variables.

Homomorphisms A key notion for na¨ıve databases and tableaux is that of a homomor-

phism. Given two na¨ıve databases D

and D

, a homomorphism h between them is a

mapping h from V to C ∪ V deﬁned on all the variables in D

so that, if R is a relation

symbol in the vocabulary and ¯a is a tuple in the relation R in D

, then h(¯a) is a tuple

in the relation R in D

. Of course h(a

, . . . , a

) stands for (h(a

), . . . , h(a

)), and we

assume h(c) = c whenever c ∈ C.

If h is a homomorphism from D

to D

, we write h : D

→ D

. Such a map is a

homomorphism of two tableaux (D

, ¯x

) and (D

, ¯x

) if, in addition, h(¯x

) = ¯x

. If

we need to state that there is a homomorphism, but it is not important to name it, we

will simply write D

→ D

Homomorphisms can also be used to give semantics of incomplete databases. It is

assumed that a na¨ıve database D represents all complete databases D

′

(i.e., databases

over C) such that there is a homomorphism h : D → D

′

. The set of all such D

′

denoted by JDK.

Note that the satisﬁability problem for relational patterns expressed via na¨ıve

databases – whether the set JDK is not empty – is trivial, the answer is always yes. In

the presence of constraints on the schema it can become a fairly complicated problem,

sometimes even undecidable.

Containment Containment asks if for two queries, Q

and Q

, the result of Q

contained in the result of Q

on every input; equivalence asks if the results are always

the same. We write Q

⊆ Q

and Q

= Q

to denote containment and equivalence. Of

course equivalence is just a special case of containment: Q

= Q

iff Q

⊆ Q

and

⊆ Q

The containment problem for conjunctive queries is solved via homomorphisms.

Given two conjunctive queries Q

and Q

, we have Q

⊆ Q

iff there is a homomor-

phism h : tab(Q

) → tab(Q

); this makes the problem NP-complete [13].

In addition to conjunctive queries (sometimes abbreviated as CQs), we shall con-

sider their unions and Boolean combinations. The former class, denoted by UCQs

sometimes, is obtained by closing CQs under union (i.e., if Q

, Q

are UCQs produc-

ing relations of the same arity, then Q

∪ Q

is a UCQ). For Boolean combinations of

conjunctive queries (abbreviated BCCQs), the additional closure rules are that Q

∩Q

∪ Q

, and Q

− Q

are BCCQs.

For these classes containment is still decidable, and the complexity stays in NP for

UCQs given explicitly as unions of CQs, and goes up to Π

-complete for BCCQs [28].

Certain answers and na¨ıve evaluation Now suppose we have a na¨ıve database D and

a query Q; assume that Q is Boolean. The standard notion of answering a query on an

incomplete database is that of certain answers:

certain

(Q, D) =

{Q(D

′

) | D

′

∈ JDK}

Let Q be a conjunctive query. Then, for an arbitrary database D

′

, we have D

′

|= Q iff

there is a homomorphism h : tab(Q) → D

′

. Thus, for an incomplete database D, we

have the following easy equivalences:

certain

(Q, D) = true ⇔ ∀D

′

∈ JDK : tab(Q) → D

′

⇔ tab(Q) → D ⇔ D |= Q

Thus, to compute certain answers, all one needs to do is to run a query on the incomplete

database itself. This is referred to as na

ıve evaluation. Note that the data complexity of

ﬁnding certain answers is tractable, as it is the same as evaluation of conjunctivequeries.

The fact that na¨ıve evaluation works for Boolean conjunctive queries extends in two

ways: to UCQs, and to queries with free variables [21]. In some way (for the semantics

we considered) the result is optimal within the class of relational algebra queries [23].

In particular, na¨ıve evaluation does not work for BCCQs (even though it was shown

recently that data complexity of ﬁnding certain answers for BCCQs remains polynomial

[19]).

3 Trees, patterns

3.1 Data trees

Data trees provide a standard abstraction of XML documents with data. First we deﬁne

their structural part, namely unranked trees. A ﬁnite unranked tree domain is a non-

empty, preﬁx-closed ﬁnite subset D of N

∗

(words over N) such that s · i ∈ D implies

s ·j ∈ D for all j < i and s ∈ N

∗

. Elements of unranked tree domains are called nodes.

We assume a countably inﬁnite set L of possible labels that can be used to label tree

nodes. An unranked tree is a structure hD, ↓, →, λi, where

– D is a ﬁnite unranked tree domain,

– ↓ is the child relation: s ↓ s · i for s · i ∈ D,

– → is the next-sibling relation: s · i → s · (i + 1) for s · (i + 1) ∈ D , and

– λ : D → L is the labeling function assigning a label to each node.

We denote the reﬂexive-transitive closure of ↓ by ↓

∗

(descendant-or-self), and the

reﬂexive-transitive closure of → by →

∗

(following-sibling-or-self).

In data trees, nodes can carry not only labels but also data values. Given a domain C

of data values (e.g., strings, numbers, etc.), a data tree is a structure t = hD, ↓, →, λ, ρi,

where hD, ↓, →, λi is an unranked tree, and ρ : D → C assigns each node a data value.

Note that in XML documents, nodes may have multiple attributes, but this is easily

modeled with data trees.

3.2 Patterns

To explain our approach to deﬁning tree-shaped patterns, consider ﬁrst data trees re-

stricted just to the child relation, i.e., structures hD, ↓ , λ, ρi. They can be deﬁned recur-

sively: a node labeled with a ∈ L and carrying a data value v ∈ C is a data tree, and

if t

, . . . , t

are trees, we can form a new tree by making them children of a node with

label a and data value v.

Just like in the relational case, patterns can also use variables from V. So our sim-

plest case of patterns is deﬁned as:

π := a(x)[π, . . . , π] (1)

with a ∈ L and x ∈ C ∪ V. Here the sequence in [. . .] could be empty. In other words,

if π

, . . . , π

is a sequence of patterns (perhaps empty), a ∈ L and x ∈ C ∪ V, then

a(x)[π

, . . . , π

] is a pattern. If ¯x is the list of all the variables used in a pattern π, we

write π(¯x).

We denote patterns from this class by PAT(↓). As with conjunctive queries, the

semantics can be deﬁned via homomorphisms of their tree representations [8, 19], but

here we give it in a different, direct way. The semantics of π(¯x) is deﬁned with respect

to a data tree t = hD, ↓, →, λ, ρi, a node s ∈ D, and a valuation ν : ¯x → C as follows:

(t, s, ν) |= a(x)[π

(¯x

), . . . , π

(¯x

)] iff

– λ(s) = a (the label of s is a);

– ρ(s) =

(

ν(x) if x is a variable

x if x is a data value;

– there exist not necessarily distinct children s·i

, . . . , s·i

of s so that (t, s·i

, ν) |=

(¯x

) for each j ≤ n (if n = 0, this last item is not needed).

We write (t, ν) |= π(¯x) if there is a node s so that (t, s, ν) |= π(¯x) (i.e., a pattern

is matched somewhere in the tree). Also if ¯v = ν(¯x), we write t |= π(¯v) instead of

(t, ν) |= π(¯x). We also write π(t) for the set {¯v | t |= π(¯v)}.

A natural extension for these simple patterns is to include both vertical and horizon-

tal navigation, resulting in the class PAT(↓, →):

π := a(x)[µ, . . . , µ]

µ := π → . . . → π

(2)

with a ∈ L and x ∈ C∪V (and the sequences, as before, could be empty). The semantics

is given by:

– (t, s, ν) |= a(x)[µ

(¯x

), . . . , µ

(¯x

)] if a(x) is satisﬁed in s by ν as before and

there exist not necessarily distinct children s·i

, . . . , s·i

of s so that (t, s·i

, ν) |=

(¯x

) for each j ≤ n.

– (t, s, ν) |= π

(¯x

) → . . . → π

(¯x

) if there exist consecutive siblings s

→

→ . . . → s

, with s

= s, so that (t, s

, ν) |= π

(¯x

) for each i ≤ m.

Next we consider more expressive versions with transitive closure axes ↓

∗

(descen-

dant) and →

∗

(following sibling). As in [3, 19], we deﬁne general patterns by the rules:

π := a(x)[µ, . . . , µ]//[µ, . . . , µ]

µ := π ❀ . . . ❀ π

(3)

Reasoning about pattern-based XML queries

Citations

Ontology-Based Data Access: Ontop of Databases

Answering SPARQL Queries over Databases under OWL 2 QL Entailment Regime

Deployment of Smart Spaces in Internet of Things: Overview of the Design Challenges

Regular path queries on large graphs

DynamiTE: Parallel Materialization of Dynamic RDF Data

References

Data integration: a theoretical perspective

Optimal implementation of conjunctive queries in relational data bases

Incomplete Information in Relational Databases

Equivalences Among Relational Expressions with the Union and Difference Operators

Containment and equivalence for a fragment of XPath

Related Papers (5)

Containment of pattern-based queries over data trees

Containment of conjunctive object meta-queries

Certain answers for XML queries

Containment of Conjunctive Queries on Annotated Relations

Containment of conjunctive queries: beyond relations as sets

Frequently Asked Questions (9)

Q1. What have the authors contributed in "Reasoning about pattern-based xml queries" ?

Q2. What is the basic reasoning task about schema mappings?

Q3. What are the main tasks of reasoning/static analysis?

Q4. Why is reasoning about XML so important?

Q5. What is the common way to describe a path through a document?

Q6. What is the case of SATaut() restricted to trees without variables?

Q7. What is the upper NP bound of SATaut()?

Q8. What is the meaning of a single-pattern CQ()?

Q9. What is the standard notion of answering a query on an incomplete database?