scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Perm: Processing Provenance and Data on the Same Data Model through Query Rewriting

29 Mar 2009-Iss: 25, pp 174-185
TL;DR: This paper presents an alternative approach that uses query rewriting to annotate result tuples with provenance information and formalizes the query rewriting procedures, proves their correctness, and evaluates a first implementation of the ideas using PostgreSQL.
Abstract: Data provenance is information that describes how a given data item was produced. The provenance includes source and intermediate data as well as the transformations involved in producing the concrete data item. In the context of a relational databases, the source and intermediate dataitems are relations, tuples and attribute values. The transformations are SQL queries and/or functions on the relational data items. Existing approaches capture provenance information by extending the underlying data model. This has the intrinsic disadvantage that the provenance must be stored and accessed using a different model than the actual data. In this paper, we present an alternative approach that uses query rewriting to annotate result tuples with provenance information. The rewritten query and its result use the same model and can, thus, be queried, stored and optimized using standard relational database techniques. In the paper we formalize the query rewriting procedures, prove their correctness, and evaluate a first implementation of the ideas using PostgreSQL. As the experiments indicate, our approach efficiently provides provenance information inducing only a small overhead on normal operations.

Summary (5 min read)

Introduction

  • Data provenance is information that describes how a given data item was produced.
  • In the context of a relational databases, the source and intermediate data items are relations, tuples and attribute values.
  • The rewritten query and its result use the same model and can, thus, be queried, stored and optimized using standard relational database techniques.
  • The authors experiments also demonstrate that Perm can efficiently compute the provenance of complex queries with only a minimal overhead on normal operations.

III. THE PERM APPROACH

  • Introduce a pure relational provenance representation, and demonstrate how provenance is computed using relational algebra exclusively.the authors.
  • Perm uses the influencecontribution-semantics of Cui et al. [4], because it better captures user expectations than copy-contibution-semantics.
  • For a query computing the sum of an attribute for all tuples of an input relation the influence-contribution source consists of all input tuples, because each tuple contributed to the result of the sum.
  • This seems to be more intuitive than none of the inputs which would be the result for copy-contribution semantics.
  • The authors consider queries expressed in the extended version of relational algebra presented below.

A. Underlying Algebra

  • The authors have developed the Perm approach for the variant of relational algebra depicted in Figure 1.
  • Sometimes the authors abuse this notation and use T for result relations and algebra expressions interchangeably.
  • The left outer join operation ><C is a union from the tuples of ><C and all tuples from T1 that did not have a join partner in T2 .
  • The aggregation operator αagg,G allows to specify a list of grouping attributes G on which the input tuples are grouped before the aggregation functions agg1, . . . , aggn from list agg are applied .
  • Set operations (union, intersection and difference) operate on union-compatible inputs T1 and T2 and are available as set- and bag-semantics versions.

B. Perm Provenance Representation

  • Before the authors introduce the relational provenance representation used by Perm, they demonstrate the disadvantages of nonrelational provenance representations by a motivating example .
  • The authors represent provenance information as a single relation by extending the original result tuples with contributing tuples from each base relation accessed by the original query.
  • Multiple references to a base relation are handled as separate relations.
  • This representation omits the redundancy of duplicating result tuples, but has the same disadvantages as the representation from [4].

C. Provenance Query Rewrite Mechanism

  • Having presented the structure of a provenance query result and the algebra for which rewrites should be produced, the authors now present how a query is transformed by the Perm approach into a query that generates the desired provenance result schema and propagates provenance according to influencecontribution-semantics.
  • Perm propagates provenance from the source to the result of a query.
  • The rewrite rules that are produced by these equivalence rules are of the form (where is one of the defined join types): (T1 T2)+ = T1+ T2+ The rewrite rule R5 rewrites an aggregation operation.
  • This is feasible because, according to the definition of aggregation, all tuples with the same grouping attribute values ~g contribute to a single result tuple with grouping attribute values ~g.
  • For set operations the result of the operation would change if the authors add attributes or tuples to its input relations.

D. Example Query Rewrite

  • At this point, T+ has not been computed so P(qex+) is left undefined for now.
  • It seems that the bottom-up approach is better suited, because the P-lists of sub-expressions of a query q needed to compute q’s P-list are immediately available, but as the authors will see in section IV, the bottom-up approach has other disadvantages.
  • A major advantage of their provenance representation format is that provenance is presented as complete base relation tuples and is directly associated with the original data.

E. Correctness Proof

  • It can be shown that the influence-contribution-semantics defined by the Perm rewrite rules is equivalent to the semantics introduced in [4].
  • The equivalence is proven by first showing that the original attribute part of a rewritten query’s result tuples is the same as the original result tuples except for multiplicity: S ΠT(T+) = S ΠT(T ).
  • This can be proven by induction over the set of algebra operators.
  • Having shown that the authors do not create spurious result tuples or omit original result tuples, they prove in a second step that the provenance attached to the original result tuples represent exactly the provenance produced by Cui’s approach [4].
  • In this section the authors present the prototype implementation of the Perm approach.

A. SQL Provenance Language Extension

  • Either a new language construct has to be introduced or provenance and normal attributes are distinguished by name, also known as 1) Provenance Attribute Handling.
  • A provenance attribute name consists of the fixed prefix prov , the name of the base relation, the attribute is derived from, and the original attribute name.
  • To support external provenance and incremental provenance computation a user can define that a subset of a from-clause item’s attributes are provenance attributes by appending PROVENANCE to the text of the from-clause item.
  • The Perm module is instructed by the PROVENANCE clause to stop rewriting when processing the subquery and to accept it as an already rewritten subquery.
  • The BASERELATION keyword should be used if a user does not want to trace provenance down to the base relations, but is interested in the influence a view or subquery had on the query results.

B. Extensions to PostgreSQL

  • The authors have implemented a first prototype of Perm by extending the PostgreSQL DBMS [16].
  • To integrate the rewrite mechanism presented in the last section in PostgreSQL, the rewrite rules have to be adapted for SQL or the internal algebraic operator tree representation of the database system.
  • For PostgreSQL, each node in the query tree is either an aggregate-select-project-join (ASPJ) segment or a set-operation segment.
  • The position of the brackets is defined by the structure of the set operation tree.
  • This means the authors have to split a set-difference set operation query node into a query node for the set-difference operator Qdiff and two new nodes for its left and right subtree (Q1 and Q2).

C. Rewrite Algorithm

  • With the results from the three cases of query node handling the authors are able to explain the rewrite algorithm used in the provenance rewrite module.
  • Node q is passed to the rewriteQueryNode procedure that computes q+.
  • In a second step, all direct child query nodes are rewritten by recursive calls to rewriteQueryNode.
  • To rewrite the aggregation the original projection is needed.
  • Hence the authors compute provenance top-down to avoid this additional complexity.

D. Rewrite Example

  • The query tree representation of qex is depicted in Figure 8.1.
  • The traverseQueryTree procedure recognizes that the top query node Q is marked for rewrite and calls rewriteQueryNode.
  • In the next step, aggregation target entries are removed from QD, hasAggs is set to false, and group-by- and having-clauses are removed (8.3).
  • Figure 8.4 captures the state after the rewrite of shop.
  • After the rewrite of all base relations there are three attribute lists on pStack .

E. Subqueries in Expressions

  • Until now the authors did not explain how subqueries in conditions are handled by the provenance rewrite module.
  • To distinguish between from-clause subqueries and expression subqueries, the authors refer to the latter as sublinks.
  • According to the contribution definition of Cui et. al, the provenance of a tuple includes all tuples accessed by the sublink query, if the expression C is true independent of the truth-value of the sublink condition Csub.
  • For the tuple t = (Merdies, 3) from the shop relation the where condition C is true whether or not the sublink condition Csub is fulfilled.

V. EXPERIMENTS

  • Before explaining the experimental evaluation of the Perm approach, the authors summarize the functionality of their system.
  • The authors used the TPC-H benchmark [17] to evaluate the cost of provenance computation with the Perm prototype.
  • In the last configuration the queries are executed with provenance computation.
  • The TPCH benchmark is provided with a query generator that randomly sets the parameters of a query and adapts the queries for a certain database size.
  • The same sets were used in all configurations.

A. TPC-H Benchmark Results

  • The average absolute and relative overhead introduced by Perm for normal query execution is depicted in Figure 9.
  • This experiment clearly shows that the overhead of Perm for normal operations is negligible.
  • 2) Costs of Provenance Computation: A comparison between the run times of normal and provenance queries is shown in Figure 10.
  • For query 1 and database size 10MB the number of result tuples increases approximately by a factor of 15000, but the execution time only increases by a factor of 4.
  • In the rewritten form, this query includes a theta-join on the rewritten sublink, which itself is a join of its aggregation operation with the accessed base tables.

B. Results for Artificial Queries

  • The authors generated queries consisting only of set operations with simple selections on the TPC-H base table part restricted to a random range of primary key attribute values as inputs for the set operations.
  • The authors expect a significant speedup using the other set rewrite variant (3.a), because it omits the creation of unnecessary intermediate results.
  • Each query is built using numSub leaf subqueries.
  • The results indicate that provenance computation of SPJ queries is a very efficient operation.
  • Each aggregation operates on the result of its child aggregation.

C. Comparison with the Trio Approach

  • As a final experiment, the authors compared the execution of queries between Perm and the Trio system.
  • For these experiments the authors had to use simple SPJ queries and one level set operations, because other query types are not supported by Trio.
  • Figure 15 presents the overall execution times in seconds for the complete set of queries.
  • Note that Trio does not support lazy provenance computation, so the provenance was computed beforehand.
  • The authors did not compare the performance of their system to the approach presented in [12] because this approach uses different contribution semantics.

VI. CONCLUSION

  • A provenance management system supporting an almost complete subset of the SQL language.the authors.
  • In addition to providing lazy and eager computation of provenance, SQL query facilities, and support for external provenance, the experiments described in the paper indicate that Perm can outperform existing approaches by a factor of 30.
  • Further areas of interest include different contribution semantics, different data item granularity and more efficient set-operation, aggregation and sublink computation using specialized plan operators.
  • 5For some values of agg the expression numGrp yields an integer result.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

University of Zurich
Zurich Open Repository and Archive
Winterthurerstr. 190
CH-8057 Zurich
http://www.zora.uzh.ch
Year: 2009
Perm: Processing provenance and data on the same data model
through query rewriting
Glavic, B; Alonso, G
Glavic, B; Alonso, G (2009). Perm: Processing provenance and data on the same data model through query
rewriting. In: Proceedings of the 25th International Conference on Data Engineering. Shanghai, China. , 174-185.
Postprint available at:
http://www.zora.uzh.ch
Posted at the Zurich Open Repository and Archive, University of Zurich.
http://www.zora.uzh.ch
Originally published at:
2009. Proceedings of the 25th International Conference on Data Engineering. Shanghai, China. , 174-185.
Glavic, B; Alonso, G (2009). Perm: Processing provenance and data on the same data model through query
rewriting. In: Proceedings of the 25th International Conference on Data Engineering. Shanghai, China. , 174-185.
Postprint available at:
http://www.zora.uzh.ch
Posted at the Zurich Open Repository and Archive, University of Zurich.
http://www.zora.uzh.ch
Originally published at:
2009. Proceedings of the 25th International Conference on Data Engineering. Shanghai, China. , 174-185.

Perm: Processing provenance and data on the same data model
through query rewriting
Abstract
Data provenance is information that describes how a given data item was produced. The provenance
includes source and intermediate data as well as the transformations involved in producing the concrete
data item. In the context of a relational databases, the source and intermediate data items are relations,
tuples and attribute values. The transformations are SQL queries and/or functions on the relational data
items. Existing approaches capture provenance information by extending the underlying data model.
This has the intrinsic disadvantage that the provenance must be stored and accessed using a different
model than the actual data. In this paper, we present an alternative approach that uses query rewriting to
annotate result tuples with provenance information. The rewritten query and its result use the same
model and can, thus, be queried, stored and optimized using standard relational database techniques. In
the paper we formalize the query rewriting procedures, prove their correctness, and evaluate a first
implementation of the ideas using PostgreSQL. As the experiments indicate, our approach efficiently
provides provenance information inducing only a small overhead on normal operations.

Perm: Processing provenance and data on the same
data model through query rewriting
Boris Glavic Gustavo Alonso
Database Technology Research Group Systems Group
University of Zurich Department of Computer Science ETH Zurich
glavic@ifi.uzh.ch alonso@inf.ethz.ch
Abstract Data provenance is information that describes how
a given data item was produced. The provenance includes source
and intermediate data as well as the transformations involved in
producing the concrete data item. In the context of a relational
databases, the source and intermediate data items are relations,
tuples and attribute values. The transformations are SQL queries
and/or functions on the relational data items. Existing approaches
capture provenance information by extending the underlying data
model. This has the intrinsic disadvantage that the provenance
must be stored and accessed using a different model than the
actual data. In this paper, we present an alternative approach that
uses query rewriting to annotate result tuples with provenance
information. The rewritten query and its result use the same
model and can, thus, be queried, stored and optimized using
standard relational database techniques. In the paper we formal-
ize the query rewriting procedures, prove their correctness, and
evaluate a first implementation of the ideas using PostgreSQL.
As the experiments indicate, our approach efficiently provides
provenance information inducing only a small overhead on
normal operations.
I. INTRODUCTION
A large portion of data generated and stored by scientific
databases, data warehouses, and workflow management sys-
tems is not entered manually by a user, but is derived from
existing data using complex transformations. Understanding
the semantics of such data and estimating its quality is not
possible without extensive knowledge about the data’s origin
and the transformations that were used to create it. For these
application domains and, in general, for every application
domain where data is heavily transformed, data provenance
is of essential importance.
Mere storage of provenance information is not very useful,
if no query facilities for provenance are provided. Ideally it
should be possible to use provenance and normal data in the
same query. We use the term provenance management system
(PMS) to refer to a system that is capable of computing
and storing provenance and supports querying the provenance
information [1]. A PMS should also be able to handle prove-
nance imported from external sources (external provenance).
In this paper we describe a novel provenance management
system called Perm (Provenance Extension of the Relational
Model) that is capable of computing, storing and querying
provenance for relational databases. Perm generates prove-
nance by rewriting transformations (queries). For a given query
q, Perm generates a single query q
+
that produces the same
result as q but extended with additional attributes used to
store provenance data. An important advantage of the approach
used in Perm is that q
+
is also a regular relational algebra
statement. Thus, we can use the full expressive power of SQL
to, e.g, query the provenance of data items from the result of
q, store q
+
as a materialized view, and apply standard query
optimization techniques to the execution of q
+
. Perm can be
used both to compute provenance on the fly (i.e., at query time)
and to store provenance persistently for future access. Perm
also supports external provenance and incremental provenance
computation reusing stored provenance information.
An important contribution of Perm is that, although not yet
complete, it already covers a far wider range of relational alge-
bra than existing systems. As shown in our experiments, Perm
can run almost all of the queries of the TPC-H benchmark
and efficiently compute their provenance. This is a significant
improvement over existing systems, which typically can only
run a few, if any at all, of the queries in the benchmark.
Our experiments also demonstrate that Perm can efficiently
compute the provenance of complex queries with only a
minimal overhead on normal operations. As an example, on
the few queries that are also supported by the Trio system,
Perm outperforms Trio at least by a factor of 30.
The paper is organized as follows. In the next section we
presented related work. Section III covers the formal founda-
tions of the Perm approach. The prototype implementation of
Perm is presented in section IV. An experimental evaluation
of the prototype is discussed in section V. We conclude in
section VI.
II. RELATED WORK
Two types of information are generally considered to form
the provenance of a data item. Information about the source
data items that contributed to a result and the transformations
used in the creation process of the data item. These types
where first introduced in [2] as provenance of data and
provenance of a data product.
Source provenance can be distinguished by the definition of
contribution used to estimate if an input data item influenced
a result data item. Buneman et al.[3] describe two definitions
of contribution semantics: Why- and Where-provenance. Why-
Provenance includes all data items that influence the creation
of a data item. Where-Provenance includes only parts of
the input literally copied to the output. In practice, these
definitions are only two possibilities from a broader solution

space. For example, some systems consider the whole input
of a transformation as the provenance of its output. To have
a consistent terminology, we refer to these definitions as
input-, influence- (for Why-provenance) and copy-contribution
(for Where-provenance). There are basically two approaches
to compute a mapping between sources and results of a
transformation. One approach is to compute some kind of
inverse of the transformation that maps a result (or a set of
results) to the sources contributing to it. The other approach
generates the mapping by propagating identifying annotations
from the source to the result during the execution of the
transformation.
Source provenance can be examined at different levels of
granularity. A composite data item can be seen as a single
data item or each of its components examined separately. For
example, for the relational model, the provenance of data items
could be at the granularity of relations, tuples or attribute
values.
Some approaches compute provenance at the time new data
is derived by a transformation, other approaches reconstruct
provenance when is is requested by a user. In [2], Tan
referred to these strategies as eager and lazy. A provenance
management system that supports both strategies leaves this
choice to the user.
In the area of data warehouses, Cui et al.[4] studied the
problem of finding the portions of the input of a relational
query that influenced a certain part of the its output (influence-
contribution). This is achieved by generating inverse queries
for an original query. The result of a provenance computation
is a list of subsets of the base relations accessed by the original
query that contributed to a specific subset of the result. A
disadvantage of this approach is that the result of a provenance
computation is a list of relations and the computation is not
expressed in relational algebra. Hence, provenance queries
cannot be used as subqueries of normal relational queries
and only partially benefit from the query optimization of
the underlying DBMS. Depending on the type of algebraic
operators that are used in a query, the provenance computation
may also require the instantiation of intermediate results.
Trio is a system extending the relational model with prove-
nance and uncertainty [5]. Trio computes the provenance
of a query during execution and stores the results in so-
called lineage relations (eager provenance computation). To
trace the provenance of a tuple produced by a sequence of
transformations the system iteratively computes the contribut-
ing input tuples of each transformation using the lineage
relations. To our knowledge, Trio currently supports only a
small subset of features present in SQL (i.e., it does support
neither aggregation nor subqueries, and supports only single
set operations and a limited set of datatypes).
DBNotes [6] extends the relational model with attribute level
annotations. Every attribute value can be annotated with a
set of textual annotations. The pSQL language (a extension
of a subset of SQL) allows to specify how annotations are
propagated from the source to the results of a query. Two
standard propagation schemes are defined from which one is
independent under query rewrite. Base relations are extended
with additional annotation attributes. Each annotation attribute
stores the annotations of one of the original attributes. pSQL
currently does not support aggregation, disjunctive queries and
set operations.
Theoretical aspects of provenance have been studied by
Buneman, Tan and Cheney in [7], [8]. Newer approaches use
the nested relational algebra and calculus [9] instead of the
simpler hierarchical language used in [10]. In [11], [12] the
relationships between provenance and program dependency
analysis are studied. While stating interesting theoretical re-
sults, their work has not yet been implemented so it is not
possible to ascertain its impact on practical systems.
Provenance methods developed in the context of workflow
systems and grid computing such as [13], [14] and [15] are
more focused on transformation provenance and only analyze
input-contribution source provenance. In general, these sys-
tems provide more extensive transformation provenance and
some allow re-execution of transformations in case of changed
source data.
The aforementioned systems represent provenance informa-
tion in a data model that is either an extension or completely
different from the data model for which the provenance is
computed. This has the disadvantage that provenance data
cannot be stored directly in the data repositories or queried
using the query language used for the normal data. The
DBNotes and Trio systems provide provenance computation
and query facilities only for a small subset of SQL. We
take a different approach by using solely relational algebra
for provenance computation and representing provenance in
the relational data model. At first sight this seems to be a
strong restriction, but in fact allows us to achieve a broad
range of functionality by reusing most of the query, storage,
and optimization methods provided by relational databases.
The Perm prototype supports provenance computation for the
complete SQL language except correlated subqueries. For
normal queries and queries on provenance data we support
the complete SQL functionality implemented by PostgreSQL.
III. THE PERM APPROACH
In this section we present the underlying algebra of the
Perm approach, introduce a pure relational provenance rep-
resentation, and demonstrate how provenance is computed
using relational algebra exclusively. Perm uses the influence-
contribution-semantics of Cui et al. [4], because it better
captures user expectations than copy-contibution-semantics.
For example, for a query computing the sum of an attribute for
all tuples of an input relation the influence-contribution source
consists of all input tuples, because each tuple contributed to
the result of the sum. This seems to be more intuitive than none
of the inputs which would be the result for copy-contribution
semantics. We consider queries expressed in the extended
version of relational algebra presented below. How the rewrite
mechanism developed for this algebra can be applied to SQL
statements is covered in section IV.

(1a) Set operators
S
Π
A
(T ) = {a = (a
1
, . . . , a
n
)
1
| t T, t.A = a}
T
1
S
T
2
= {t
1
| t T
1
t T
2
}
T
1
S
T
2
= {t
1
| t T
1
t T
2
}
T
1
S
T
2
= {t
1
| t T
1
t / T
2
}
(1b) Bag operators
B
Π
A
(T ) = {a = (a
1
, . . . , a
n
)
sum
| sum =
X
t
n
T,t.A=a
(n)}
T
1
B
T
2
= {t
n+m
| t
n
T
1
t
m
T
2
}
T
1
B
T
2
= {t
min(n,m)
| t
n
T
1
t
m
T
2
}
T
1
B
T
2
= {t
nm
| t
n
T
1
t
m
T
2
}
(1c) Set/Bag operators
σ
C
(T ) = {t
n
| t
n
T t |= C}
T
1
× T
2
= {(t
1
, t
2
)
n×m
| t
1
n
T
1
t
2
m
T
2
}
T
1
><
C
T
2
= {t = (t
1
, t
2
)
n×m
| t
1
n
T
1
t
2
m
T
2
t |= C}
T
1
><
C
T
2
= {(t
1
, t
2
)
n×m
| t
1
n
T
1
t
2
m
T
2
} {(t
1
, t
null
2
)
n
| t
1
n
T
1
(6 t
2
T
2
: (t
1
, t
2
) |= C)}
α
G,aggr
(T ) = {(t.G, res
1
, . . . , res
n
)
1
| t T i {1, n} : res
i
= aggr
i
B
i
(σ
G=t.G
(T ))}
Fig. 1: The Perm algebra
A. Underlying Algebra
We have developed the Perm approach for the variant of
relational algebra depicted in Figure 1. The algebra operates
on bag-semantic relations. For a tuple t
n
, n is the multiplicity
of t. If no multiplicity is given, an arbitrary multiplicity is
assumed. For convenience, a multiplicity of 0 or below is an
alternative representation for a tuple that is not included in a
relation. t R is a shortcut for t
n
R with n > 0. We use R
to refer to a base relation and T to refer to a base relation or
the result of an algebra statement. Sometimes we abuse this
notation and use T for result relations and algebra expressions
interchangeably. R is the schema of a relation R.
Some of the algebra operators are provided as a set-
and a bag-semantic version denoted by S or B (Figures 1a
and 1b). Projection Π
A
projects the attributes from T on
A and is available as a set- and a bag-semantics version.
A is a list of attributes, constants, attribute renamings and
deterministic functions on one or more attributes or constants.
Renaming attribute a in attribute b is presented as a b.
The selection operator’s condition C can contain boolean
operations, comparison operators, attribute and constant values
and functions on values, but is restricted to yield a boolean
result. Crossproduct × and inner join ><
C
are defined as in
standard relational algebra. The left outer join operation
><
C
is a union from the tuples of ><
C
and all tuples from T
1
that
did not have a join partner in T
2
(Figure 1c). t
null
2
, used in the
definition of the left outer join, is a single tuple relation with
schema T
2
and all attribute values set to null. Right outer join
and full outer join are defined analogously and are therefore
not included in the figure. The aggregation operator α
ag g,G
allows to specify a list of grouping attributes G on which
the input tuples are grouped before the aggregation functions
agg
1
, . . . , agg
n
from list agg are applied (Figure 1c). An
aggregation function computes a single result value for a set of
tuples with schema B. The result of the aggregation operator
includes the results of the aggregation functions and the
grouping attribute values. Set operations (union, intersection
and difference) operate on union-compatible inputs T
1
and T
2
and are available as set- and bag-semantics versions. The result
schema of a set operation is T
1
.
For convenience, renaming all attributes from a list of
attributes A into B and applying a comparison operator comp
to all attributes from lists A and B is presented as A B
and A comp B. For a tuple t, t.A is the projection of t on a
list A of attributes, functions and renamings.
B. Perm Provenance Representation
Before we introduce the relational provenance representa-
tion used by Perm, we demonstrate the disadvantages of non-
relational provenance representations by a motivating example
(Figure 2). The example is a database of shops (with name and
number of employees), items they are selling, and purchases
(sales relation). Assume a user wants to query the total profits
for each shop. This query can be expressed in relational
algebra as:
q
ex
= α
name,sum(price)
(σ
name=sNameitemId=id
(prod))
prod = shop × sales × items
If the user wants to investigate which source tuples con-
tributed to the result tuple (M erdies, 120) using influence-
contribution semantics (Why-provenance) the result should
include tuple (Merdies, 3) from relation shop, tuples
(Merdies, 1), (M erdies, 2) and (M erdies, 2) from relation
sales and tuples (1, 100), (2, 10) from relation items. The
approach from [4] would present this provenance data as the

Citations
More filters
Book
01 Jan 2010
TL;DR: This monograph contends that provenance can and should reliably be tracked and exploited on the Web, and investigates the necessary foundations to achieve such a vision, as well as identifying an open approach and a model for provenance.
Abstract: Provenance, i.e., the origin or source of something, is becoming an important concern, since it offers the means to verify data products, to infer their quality, to analyse the processes that led to them, and to decide whether they can be trusted. For instance, provenance enables the reproducibility of scientific results; provenance is necessary to track attribution and credit in curated databases; and, it is essential for reasoners to make trust judgements about the information they use over the Semantic Web. As the Web allows information sharing, discovery, aggregation, filtering and flow in an unprecedented manner, it also becomes very difficult to identify, reliably, the original source that produced an information item on the Web. Since the emerging use of provenance in niche applications is undoubtedly demonstrating the benefits of provenance, this monograph contends that provenance can and should reliably be tracked and exploited on the Web, and investigates the necessary foundations to achieve such a vision. Multiple data sources have been used to compile the largest bibliographical database on provenance so far. This large corpus permits the analysis of emerging trends in the research community. Specifically, the CiteSpace tool identifies clusters of papers that constitute research fronts, from which characteristics are extracted to structure a foundational framework for provenance on the Web. Such an endeavour requires a multi-disciplinary approach, since it requires contributions from many computer science sub-disciplines, but also other non-technical fields given the human challenge that is anticipated. To develop such a vision, it is necessary to provide a definition of provenance that applies to the Web context. A conceptual definition of provenance is expressed in terms of processes, and is shown to generalise various definitions of provenance commonly encountered. Furthermore, by bringing realistic distributed systems assumptions, this definition is refined as a query over assertions made by applications. Given that the majority of work on provenance has been undertaken by the database, workflow and e-science communities, some of their work is reviewed, contrasting approaches, and focusing on important topics believed to be crucial for bringing provenance to the Web, such as abstraction, collections, storage, queries, workflow evolution, semantics and activities involving human interactions. However, provenance approaches developed in the context of databases and workflows essentially deal with closed systems. By that, it is meant that workflow or database management systems are in full control of the data they manage, and track their provenance within their own scope, but not beyond. In the context of the Web, a broader approach is required by which chunks of provenance representation can be brought together to describe the provenance of information flowing across multiple systems. For this purpose, this monograph puts forward the Open Provenance Vision, which is an approach that consists of controlled vocabulary, serialisation formats and interfaces to allow the provenance of individual systems to be expressed, connected in a coherent fashion, and queried seamlessly. In this context, the Open Provenance Model is an emerging community-driven representation of provenance, which has been actively used by some 20 teams to exchange provenance information, in line with the Open Provenance Vision. After identifying an open approach and a model for provenance, techniques to expose provenance over the Web are investigated. In particular, Semantic Web technologies are discussed since they have been successfully exploited to express, query and reason over provenance. Symmetrically, Semantic Web technologies such as RDF, underpinning the Linked Data effort, are analysed since they offer their own difficulties with respect to provenance. A powerful argument for provenance is that it can help make systems transparent, so that it becomes possible to determine whether a particular use of information is appropriate under a set of rules. Such capability helps make systems and information accountable. To offer accountability, provenance itself must be authentic, and rely on security approaches, which are described in the monograph. This is then followed by systems where provenance is the basis of an auditing mechanism to check past processes against rules or regulations. In practice, not all users want to check and audit provenance, instead, they may rely on measures of quality or trust; hence, emerging provenance-based approaches to compute trust and quality of data are reviewed.

248 citations


Cites background from "Perm: Processing Provenance and Dat..."

  • ...databases), whereas [55] deals with the provenance of cell contents in the presence of updates, and [107, 176] deal with tuples in SQL databases....

    [...]

  • ...Glavic [176] introduces an alternative terminology for these notions: copy-...

    [...]

  • ...Others are most prescriptive, and identify data items that are the original raw value as it entered a system [55], data items that were the cause of a given data [107, 176, 59], and variants where a summary of how the data were used is incorporated [188]....

    [...]

  • ...Examples of such systems include operating system [218][159], desktop [265], statistical packages [143, 21], databases [59, 406, 176], data warehouses [104], and workflow systems (Kepler [39, 41], Taverna [419], VisTrails [156], VDS [424], Pegasus [278])....

    [...]

  • ...In the case of databases, such information may be referred to as annotations [93] or simply provenance data [176]....

    [...]

Book ChapterDOI
03 Dec 2012
TL;DR: The system has been designed to decouple the collection, storage, and querying of provenance metadata, with a novel provenance kernel that mediates between the producers and consumers ofprovenance information, and handles the persistent storage of records.
Abstract: SPADE is an open source software infrastructure for data provenance collection and management. The underlying data model used throughout the system is graph-based, consisting of vertices and directed edges that are modeled after the node and relationship types described in the Open Provenance Model. The system has been designed to decouple the collection, storage, and querying of provenance metadata. At its core is a novel provenance kernel that mediates between the producers and consumers of provenance information, and handles the persistent storage of records. It operates as a service, peering with remote instances to enable distributed provenance queries. The provenance kernel on each host handles the buffering, filtering, and multiplexing of incoming metadata from multiple sources, including the operating system, applications, and manual curation. Provenance elements can be located locally with queries that use wildcard, fuzzy, proximity, range, and Boolean operators. Ancestor and descendant queries are transparently propagated across hosts until a terminating expression is satisfied, while distributed path queries are accelerated with provenance sketches.

237 citations


Cites background from "Perm: Processing Provenance and Dat..."

  • ...A number of systems track the provenance of database elements, including Trio [65], DBNotes [6], and Perm [20]....

    [...]

Proceedings ArticleDOI
06 Jun 2010
TL;DR: A query language for provenance is developed, which can express all of the aforementioned types of queries, as well as many more, and the feasibility of provenance querying and the benefits of the indexing techniques across a variety of application classes and queries are experimentally validated.
Abstract: Many advanced data management operations (e.g., incremental maintenance, trust assessment, debugging schema mappings, keyword search over databases, or query answering in probabilistic databases), involve computations that look at how a tuple was produced, e.g., to determine its score or existence. This requires answers to queries such as, "Is this data derivable from trusted tuples?"; "What tuples are derived from this relation?"; or "What score should this answer receive, given initial scores of the base tuples?". Such questions can be answered by consulting the provenance of query results. In recent years there has been significant progress on formal models for provenance. However, the issues of provenance storage, maintenance, and querying have not yet been addressed in an application-independent way. In this paper, we adopt the most general formalism for tuple-based provenance, semiring provenance. We develop a query language for provenance, which can express all of the aforementioned types of queries, as well as many more; we propose storage, processing and indexing schemes for data provenance in support of these queries; and we experimentally validate the feasibility of provenance querying and the benefits of our indexing techniques across a variety of application classes and queries.

215 citations


Cites background from "Perm: Processing Provenance and Dat..."

  • ...In recent years there has been signi.cant progress on formal models for provenance....

    [...]

Proceedings ArticleDOI
06 Jun 2010
TL;DR: The design and implementation of ExSPAN is presented, a generic and extensible framework that achieves efficient network provenance in a distributed environment and demonstrates that the system supports a wide range of distributed provenance computations efficiently, resulting in significant reductions in bandwidth costs compared to traditional approaches.
Abstract: Network accountability, forensic analysis, and failure diagnosis are becoming increasingly important for network management and security. Such capabilities often utilize network provenance - the ability to issue queries over network meta-data. For example, network provenance may be used to trace the path a message traverses on the network as well as to determine how message data were derived and which parties were involved in its derivation. This paper presents the design and implementation of ExSPAN, a generic and extensible framework that achieves efficient network provenance in a distributed environment. We utilize the database notion of data provenance to "explain" the existence of any network state, providing a versatile mechanism for network provenance. To achieve such flexibility at Internet-scale, ExSPAN uses declarative networking in which network protocols can be modeled as continuous queries over distributed streams and specified concisely in a declarative query language. We extend existing data models for provenance developed in database literature to enable distribution at Internet-scale, and investigate numerous optimization techniques to maintain and query distributed network provenance efficiently. The ExSPAN prototype is developed using RapidNet, a declarative networking platform based on the emerging ns-3 toolkit. Experiments over a simulated network and an actual deployment in a testbed environment demonstrate that our system supports a wide range of distributed provenance computations efficiently, resulting in significant reductions in bandwidth costs compared to traditional approaches.

150 citations

Journal ArticleDOI
01 Nov 2015
TL;DR: Titan is a library that enables data provenance—tracking data through transformations—in Apache Spark while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.
Abstract: Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence (e.g., from log files) and performing trial and error debugging. To aid this effort, we built Titian, a library that enables data provenance---tracking data through transformations---in Apache Spark. Data scientists using the Titian Spark extension will be able to quickly identify the input data at the root cause of a potential bug or outlier result. Titian is built directly into the Spark platform and offers data provenance support at interactive speeds---orders-of-magnitude faster than alternative solutions---while minimally impacting Spark job performance; observed overheads for capturing data lineage rarely exceed 30% above the baseline job execution time.

116 citations

References
More filters
Book ChapterDOI
04 Jan 2001
TL;DR: An approach to computing provenance when the data of interest has been created by a database query is described, adopting a syntactic approach and present results for a general data model that applies to relational databases as well as to hierarchical data such as XML.
Abstract: With the proliferation of database views and curated databases, the issue of data provenance - where a piece of data came from and the process by which it arrived in the database - is becoming increasingly important, especially in scientific databases where understanding provenance is crucial to the accuracy and currency of data. In this paper we describe an approach to computing provenance when the data of interest has been created by a database query. We adopt a syntactic approach and present results for a general data model that applies to relational databases as well as to hierarchical data such as XML. A novel aspect of our work is a distinction between "why" provenance (refers to the source data that had some influence on the existence of the data) and "where" provenance (refers to the location(s) in the source databases from which the data was extracted).

1,338 citations

Proceedings ArticleDOI
27 Jun 2006
TL;DR: This paper describes an approach in which the user's actions are tracked while browsing source databases and copying data into a curated database, in order to record theuser's actions in a convenient, queryable form and uses it to evaluate the feasibility of database support for provenance management.
Abstract: Curated databases in bioinformatics and other disciplines are the result of a great deal of manual annotation, correction and transfer of data from other sources. Provenance information concerning the creation, attribution, or version history of such data is crucial for assessing its integrity and scientific value. General purpose database systems provide little support for tracking provenance, especially when data moves among databases. This paper investigates general-purpose techniques for recording provenance for data that is copied among databases. We describe an approach in which we track the user's actions while browsing source databases and copying data into a curated database, in order to record the user's actions in a convenient, queryable form. We present an implementation of this technique and use it to evaluate the feasibility of database support for provenance management. Our experiments show that although the overhead of a naive approach is fairly high, it can be decreased to an acceptable level using simple optimizations.

317 citations


"Perm: Processing Provenance and Dat..." refers background in this paper

  • ...Theoretical aspects of provenance have been studied by Buneman, Tan and Cheney in [7], [8]....

    [...]

Book
15 Dec 2000
TL;DR: The history of POSTGRESQL and its applications, including its development at the University of California at Berkeley, and its use in Java and other programming languages, is reviewed.
Abstract: List of Figures. List of Tables. Foreword. Preface. Acknowledgments. Steering. Major Developers. Non-code Contributors. History of POSTGRESQL. Introduction. University of California at Berkeley. Development Leaves Berkeley. POSTGRESQL Global Development Team. Open Source Software. Summary. Issuing Database Commands. Starting a Database Session. Choosing an Interface. Choosing a Database. Starting a Session. Controlling a Session. Typing in the Query Buffer. Displaying the Query Buffer. Erasing the Query Buffer. Getting Help. Exiting a Session. Summary. Basic SQL Commands. Relational Databases. Creating Tables. Adding Data with INSERT. Viewing Data with SELECT. Selecting Specific Rows with WHERE. Removing Data with DELETE. Modifying Data with UPDATE. Sorting Data with ORDER BY. Destroying Tables. Summary. Customizing Queries. Data Types. Quotes Inside Text. Using NULL Values. Controlling DEFAULT Values. Column Labels. Comments. AND/OR Usage. Range of Values. LIKE Comparison. Regular Expressions. CASE Clause. Distinct Rows. Functions and Operators. SET, SHOW, and RESET. Summary. SQL Aggregates. Aggregates. Using GROUP BY. Using HAVING. Query Tips. Summary. Joining Tables. Table and Column References. Joined Tables. Creating Joined Tables. Performing Joins. Three- and Four-Table Joins. Additional Join Possibilities. Choosing a Join Key. One-to-Many Joins. Unjoined Tables. Table Aliases and Self-joins. Non-equijoins. Ordering Multiple Parts. Primary and Foreign Keys. Summary. Numbering Rows. Object Identification Numbers (OIDs). Object Identification Number Limitations. Nonsequential Numbering. Nonmodifiable. Not Backed Up by Default. Sequences. Creating Sequences. Using Sequences to Number Rows. Serial Column Type. Manually Numbering Rows. Summary. Combining SELECTs. UNION, EXCEPT, and INTERSECT Clauses. Subqueries. Subqueries as Constants. Subqueries as Correlated Values. Subqueries as Lists of Values. NOT IN and Subqueries with NULL Values. Subqueries Returning Multiple Columns. ANY, ALL, and EXISTS Clauses. Summary. Outer Joins. Subqueries in Non-SELECT Queries. UPDATE with FROM. Inserting Data Using SELECT. Creating Tables Using SELECT. Summary. Data Types. Purpose of Data Types. Installed Types. Character String. Number. Temporal. Logical. Geometric. Network. Internal. Type Conversion Using CAST. Support Functions. Support Operators. Support Variables. Arrays. Large Objects (BLOBs). Summary. Transactions and Locks. Transactions. Multistatement Transactions. Visibility of Committed Transactions. Read Committed and Serializable Isolation Levels. Locking. Deadlocks. Summary. Performance. Indexes. Unique Indexes. CLUSTER. VACUUM. VACUUM ANALYZE. EXPLAIN. Summary. Controlling Results. LIMIT. Cursors. Summary. Table Management. Temporary Tables. ALTER TABLE. GRANT and REVOKE. Inheritance. Views. Rules. LISTEN and NOTIFY. Summary. Constraints. NOT NULL. UNIQUE. PRIMARY KEY. Foreign Key/REFERENCES. Modification of Primary Key Row. Multicolumn Primary Keys. Handling NULL Values in the Foreign Key. Frequency of Foreign Key Checking. CHECK. Summary. Importing and Exporting Data. Using COPY. COPY File Format. DELIMITERS. COPY Without Files. Backslashes and NULL Values. COPY Tips. Summary. Database Query Tools. Psql. Query Buffer Commands. General Commands. Output Format Options. Output Format Shortcuts. Variables. Listing Commands. Large Object Commands. Psql Command-line Arguments and Start-up File. Pgaccess. Summary. Programming Interfaces. C Language Interface (LIBPQ). Pgeasy (LIBPGEASY). Embedded C (ECPG). C++ (LIBPQ++). Compiling Programs. Assignment to Program Variables. ODBC. Java (JDBC). Scripting Languages. Perl. TCL/TK (PGTCLSH/PGTKSH). Python. PHP. Installing Scripting Languages. Summary. Functions and Triggers. Functions. SQL Functions. PL/PGSQL Functions. Triggers. Summary. Extending POSTGRESQL Using C. Write the C Code. Compile the C Code. Register the New Functions. Create Operators, Types, and Aggregates. Summary. Administration. Files. Creating Users. Creating Databases. Access Configuration. Local. Host and Hostssl. User Mappings. Examples. Backup and Restore. Server Start-up and Shutdown. Monitoring. Performance. System Tables. Internationalization. Upgrading. Summary. Additional Resources. Mailing List Support. Supplied Documentation. Commercial Support. Modifying the Source Code. Frequently Asked Questions (FAQs). Installation. Getting POSTGRESQL. Creating the POSTGRESQL User. Configuration. Compilation. Installation. Initialization. Starting the Server. Creating a Database. PostgreSQL Nonstandard Features by Chapter. Reference Manual. ABORT. ALTER GROUP. ALTER TABLE. ALTER USER. BEGIN. CLOSE. CLUSTER. COMMENT. COMMIT. COPY. CREATE AGGREGATE. CREATE CONSTRAINT TRIGGER. CREATE DATABASE. CREATE FUNCTION. CREATE GROUP. CREATE INDEX. CREATE LANGUAGE. CREATE OPERATOR. CREATE RULE. CREATE SEQUENCE. CREATE TABLE. CREATE TABLE AS. CREATE TRIGGER. CREATE TYPE. CREATE USER. CREATE VIEW. createdb. createlang. createuser. DECLARE. DELETE. DROP AGGREGATE. DROP DATABASE. DROP FUNCTION. DROP GROUP. DROP INDEX. DROP LANGUAGE. DROP OPERATOR. DROP RULE. DROP SEQUENCE. DROP TABLE. DROP TRIGGER. DROP TYPE. DROP USER. DROP VIEW. dropdb. droplang. dropuser. ecpg. END. EXPLAIN. FETCH. GRANT. initdb. initlocation. INSERT. ipcclean. LISTEN. LOAD. LOCK. MOVE. NOTIFY. pg_ctl. pg_dump. pg_dumpall. pg_passwd. pg_upgrade. pgaccess. pgtclsh. pgtksh. postgres. postmaster. psql. REINDEX. RESET. REVOKE. ROLLBACK. SELECT. SELECT INTO. SET. SHOW. TRUNCATE. UNLISTEN. UPDATE. VACUUM. vacuumdb. Bibliography. Index. 0201703319T04062001

197 citations


"Perm: Processing Provenance and Dat..." refers methods in this paper

  • ...We have implemented a first prototype of Perm by extending the PostgreSQL DBMS [16]....

    [...]

  • ...Hence, provenance queries cannot be used as subqueries of normal relational queries and only partially benefit from the query optimization of the underlying DBMS. Depending on the type of algebraic operators that are used in a query, the provenance computation may also require the instantiation of intermediate results....

    [...]

Proceedings ArticleDOI
14 Jun 2005
TL;DR: DBNotes, a Post-It note system for relational databases where every piece of data may be associated with zero or more notes (or annotations), is demonstrated, which can easily determine the provenance of data through a sequence of transformation steps simply by examining the annotations.
Abstract: We demonstrate DBNotes, a Post-It note system for relational databases where every piece of data may be associated with zero or more notes (or annotations). These annotations are transparently propagated along as data is being transformed. The method by which annotations are propagated is based on provenance (aka lineage): the annotations associated with a piece of data d in the result of a transformation consist of the annotations associated with each piece of data in the source where d is copied from. One immediate application of this system is to use annotations to systematically trace the provenance and flow of data. If every piece of source data is attached with an annotation that describes its address (i.e., origins), then the annotations of a piece of data in the result of a transformation describe its provenance. Hence, one can easily determine the provenance of data through a sequence of transformation steps simply by examining the annotations. Annotations can also be used to store additional information about data. Since a database schema is often proprietary, the ability to insert new information about data without having to change the underlying schema is a useful feature. For example, an error report could be attached to an erroneous piece of data, and this error report will be propagated to other databases along transformations, thus notifying other users of the error. Overall, the annotations on the result of a transformation can also provide an estimate on the quality of the resulting database.

152 citations


"Perm: Processing Provenance and Dat..." refers background in this paper

  • ...DBNotes [6] extends the relational model with attribute level annotations....

    [...]

Journal ArticleDOI
TL;DR: In this article, the authors study provenance for query and update languages that are closely related to SQL, and compare two ways in which they can manipulate provenance so that elements of the input are rearranged to elements of an output, where a query or update only provides the rearranged output, and provenance is provided implicitly by a default provenance semantics.
Abstract: Information describing the origin of data, generally referred to as provenance, is important in scientific and curated databases where it is the basis for the trust one puts in their contents. Since such databases are constructed using operations of both query and update languages, it is of paramount importance to describe the effect of these languages on provenance.In this article we study provenance for query and update languages that are closely related to SQL, and compare two ways in which they can manipulate provenance so that elements of the input are rearranged to elements of the output: implicit provenance, where a query or update only provides the rearranged output, and provenance is provided implicitly by a default provenance semantics; and explicit provenance, where a query or update provides both the output and the description of the provenance of each component of the output. Although explicit provenance is in general more expressive, we show that the classes of implicit provenance operations expressible by query and update languages correspond to natural semantic subclasses of the explicit provenance queries.One of the consequences of this study is that provenance separates the expressive power of query and update languages. The model is also relevant to annotation propagation schemes in which annotations on the input to a query or update have to be transferred to the output or vice versa.

131 citations

Frequently Asked Questions (2)
Q1. What contributions have the authors mentioned in the paper "Perm: processing provenance and data on the same data model through query rewriting" ?

In this paper, the authors present an alternative approach that uses query rewriting to annotate result tuples with provenance information. In the paper the authors formalize the query rewriting procedures, prove their correctness, and evaluate a first implementation of the ideas using PostgreSQL. As the experiments indicate, their approach efficiently provides provenance information inducing only a small overhead on normal operations. In this paper, the authors present an alternative approach that uses query rewriting to annotate result tuples with provenance information. In the paper the authors formalize the query rewriting procedures, prove their correctness, and evaluate a first implementation of the ideas using PostgreSQL. As the experiments indicate, their approach efficiently provides provenance information inducing only a small overhead on normal operations. In this paper, the authors present an alternative approach that uses query rewriting to annotate result tuples with provenance information. In the paper the authors formalize the query rewriting procedures, prove their correctness, and evaluate a first implementation of the ideas using PostgreSQL. As the experiments indicate, their approach efficiently provides provenance information inducing only a small overhead on normal operations. 

The authors plan to investigate if the Perm approach can be extended for transformation provenance, data manipulation queries and correlated sublinks. Further areas of interest include different contribution semantics, different data item granularity and more efficient set-operation, aggregation and sublink computation using specialized plan operators.