scispace - formally typeset
Open AccessProceedings ArticleDOI

Semistructured data

Reads0
Chats0
TLDR
A number of issues surrounding semistructured data are covered: finding a concise formulation, building a sufficiently expressive language for querying and transformation, and optimizat,ion problems.
Abstract
In semistructured data, the information that is normally associated with a schema is contained within the data, which is sometimes called “self-describing”. In some forms of semistructured data there is no separate schema, in others it exists but only places loose constraints on the data. Semistructured data has recently emerged as an important topic of study for a variety of reasons. First, there are data sources such as the Web, which we would like to treat as databases but which cannot be constrained by a schema. Second, it may be desirable to have an extremely flexible format for data exchange between disparate databases. Third, even when dealing with structured data, it may be helpful to view it. as semistructured for the purposes of browsing. This tutorial will cover a number of issues surrounding such data: finding a concise formulation, building a sufficiently expressive language for querying and transformation, and optimizat,ion problems.

read more

Content maybe subject to copyright    Report

Edinburgh Research Explorer
Semistructured Data
Citation for published version:
Buneman, P 1997, Semistructured Data. in PODS '97 Proceedings of the sixteenth ACM SIGACT-
SIGMOD-SIGART symposium on Principles of database systems. ACM, pp. 117-121.
https://doi.org/10.1145/263661.263675
Digital Object Identifier (DOI):
10.1145/263661.263675
Link:
Link to publication record in Edinburgh Research Explorer
Document Version:
Early version, also known as pre-print
Published In:
PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database
systems
General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.
Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.
Download date: 10. Aug. 2022

Semistructured Data
Peter Buneman
Department of Computer and Information Science
University of Pennsylvania
Philadelphia, PA 19104-6389
peter@cis.up enn.edu
Abstract
In semistructured data, the information that is normally as-
sociated with a schema is contained within the data, which is
sometimes called \self-describing". In some forms of semi-
structured data there is no separate schema, in others it
exists but only places loose constraints on the data. Semi-
structured data has recently emerged as an imp ortant topic
of study for a variety of reasons. First, there are data sources
such as the Web, which we would like to treat as databases
but which cannot be constrained by a schema. Second, it
may be desirable to have an extremely exible format for
data exchange between disparate databases. Third, even
when dealing with structured data, it may be helpful to view
it as semistructured for the purp oses of browsing. This tu-
torial will cover a number of issues surrounding such data:
nding a concise formulation, building a suciently expres-
sive language for querying and transformation, and opti-
mization problems.
1 The motivation
The topic of semistructured data (also called unstructured
data) is relatively recent, and a tutorial on the topic may
well b e premature. It represents, if anything, the conver-
gence of a number of lines of thinking ab out new ways to
represent and query data that do not completely t with
conventional data models. The purp ose of this tutorial is
to to describe this motivation and to suggest areas in which
further research may b e fruitful. For a similar exp osition,
the reader is referred to Serge Abiteboul's recent survey pa-
per [1].
The slides for this tutorial will b e made available from a
section of the Penn database home page
http://www.cis.upenn.edu/~db
.
This work was partly supp orted by the Army Research Oce
(DAAH04-95-1-0169) and the National Science Foundation (CCR92-
16122).
1.1 Some data really is unstructured
The most obvious motivation comes from the need to bring
new forms of data into the ambit of conventional database
technology. Some of these, such as documents with struc-
tured text [3, 2] and data formats [9, 17], while they may
call for increasingly expressive query languages and new op-
timization techniques, only require mild extensions to the
existing notion of data mo dels such as ODMG [13]. How-
ever these extensions still require the prior imp osition of
structure on the data, and there are some forms of data for
which this is genuinely dicult.
The most immediate example of data that cannot be con-
strained by a schema is the World-Wide-Web. As database
researchers we would like to think of this as a database, but
to what extent are database tools available for querying or
maintaining the web? Most web queries exploit information
retrieval techniques to retrieve individual pages from their
contents, but there is little available that allows us to use
the structure of the web in formulating queries, and since
the web do es not obviously conform to any standard data
model, we need a method of describing its structure.
Another example, little known to the database commu-
nity but responsible for piquing the author's interest in this
topic, is the database management system ACeDB, which
is p opular with biologists [36]. Supercially it lo oks like
an ob ject-oriented database system, for it has a schema
language that resembles that of an ob ject-oriented DBMS;
but this schema imp oses only loose constraints on the data.
Moreover the relationship between data and schema is not
easily described in ob ject-oriented terms, and there are struc-
tures that are naturally expressed in ACeDB, such as trees of
arbitrary depth, that cannot b e queried using conventional
techniques.
1.2 Data Integration
A second motivation is that of data exchange and transfor-
mation, which is the starting p oint for the Tsimmis pro ject
[33, 21 ] at Stanford. The rationale here is that none of the
existing data mo dels is all-embracing, so that it is dicult
to build software that will easily convert between two dis-
parate mo dels. The Ob ject Exchange Mo del (OEM) oers
a highly exible data structure that may b e used to cap-
ture most kinds of data and provides a substrate in which
almost any other data structure may b e represented. In ef-
fect, OEM is an internal data structure for exchange of data
between DBMSs, but having such a structure invites the
idea of querying data in OEM format directly.

Entry Entry Entry
Movie Movie TV Show
Title Cast Director Title Cast Director Title Cast Episode
1 2 3
Special Guests
“Casablanca”
“Bogart” “Bacall”
“Play it again, Sam”
Credit Actors
“Allen”
1.2E6
Director
“Allen”
References
Is referenced in
Actors
Figure 1: An example movie database.
1.3 Browsing
A nal motivation is that of browsing. Generally sp eaking,
a user cannot write a database query without knowledge
of the schema. However, schemas may have opaque termi-
nology and the rationale for the design is often dicult to
understand. It may help in understanding the schema to be
able to query data without full knowledge of the schema.
For example the queries,
Where in the database is the string
"Casablanca"
to
be found?
Are there integers in the database greater than 2
16
?
What ob jects in the database have an attribute name
that starts with
"act"
Such questions cannot b e answered in any generic fashion
by standard relational or ob ject-oriented query languages.
While languages have b een proposed that allow schema and
data to b e queried simultaneously [24 ] in the context of
relational and ob ject-oriented database systems, these lan-
guages do not have the exibility to express complex con-
straints on paths, and it is not clear how their implementa-
tion will work on the structures describ ed below.
2 The Mo del
The unifying idea in semi-structured data is the representa-
tion of data as some kind of graph-like or tree-like structure.
Although we shall allow cycles in the data, we shall gener-
ally refer to these graphs as trees. The example in gure 1 is
taken from [10 ] in which the data mo del is formalized as an
edge labeled graph
. The structure is taken (with some inac-
curacies) from a well-known web database [23 ] that provides
a goo d example of semistructured data. There are several
things to note about it. If one connes ones attention to the
parts of the database below
Movie
edges, the data appears
fairly regular except that there are two ways of representing
a cast. That is, the data does not quite t with some re-
lational or ob ject-oriented presentation. Edges are lab eled
both with data, of types such as
int
and
string
and p ossi-
bly other base or external abstract types (video, audio etc.).
Edges are also with names such as
Movie
and
Title
that
would normally b e used for attribute or class names. We
shall refer to such labels as
symbols
. Internally they are rep-
resented as strings. Note that arrays may be represented by
labeling internal edges with integers. We can formulate the
type of this kind of lab eled tree as:
type label = int
j
string
j
...
j
symbol
type tree = set
(
label
tree
)
The rst line describ es a tagged union or variant, the
second says that a tree is a set of lab el/tree pairs. The edges
out of nodes in our trees are assumed to b e unordered.
There are a number of variations on this basic mo del,
and it is worth briey reviewing them. In [5] leaf nodes
are lab eled with data, internal no des are not lab eled with
meaningful data, and edges are lab eled only with symbols
type base = int
j
string
j
...
type tree = base
j
set
(
symbol
tree
)
The dierences b etween the two models are minor and
give rise to minor dierences in the query language. It is
easy to dene mappings in b oth directions.
Another p ossibility is to allow lab els on internal no des,
for example:
type base = int
j
string
j
...
j
symbol
type tree = label
set
(
label
tree
)
The problem with using this representation directly is
that it makes the operation of taking the union of two trees
dicult to dene. However, by intro ducing extra edges,
this represaentation can be converted into one of the edge-
labelled representations ab ove.
A nal and more complex issue is that of ob ject identity,
by which we mean node lab els { or p ossibly edge labels
{ that, apart from an equality test, are not observable in
the query language. In OEM, ob ject identities are used as
node lab els and place-holders to dene trees. While ob ject-
identities provide an ecient way to dene and test equality

within
a database, they pose problems when comparing data
across
databases. See [10, 25, 32] discussions and related
work.
It is straightforward to enco de relational and ob ject-
oriented databases in this mo del, although in the latter case
one must take care to deal with the issue of ob ject-identity.
However, the coding is not unique, and the examples in
[10] and [5 ] show some dierences in how tuples of sets are
treated.
The term \self describing" is often used to describe un-
structured data. In each of the models we have describ ed,
the data is a tagged union type, and one can imagine a pro-
gram whose b ehavior is dynamically determined by \switch-
ing" on the type. For example, a program's b ehavior may
be altered by whether it nds an integer or string as a lab el,
and one would expect any language for dealing with semi-
structured data to incorporate predicates that describ e the
type of an edge or no de. The situation is similar to that
in programming languages. Lisp and many interpreted and
scripting languages are
dynamical ly
typed. Predicates are
available to determine (at run time) type of a value or class
of an ob ject. Languages in the Algol tradition (Pascal, C,
ML, Mo dula) are
statical ly
typed. Predicates are not needed
to determine the type of a value b ecause it is known from the
source code of the program and hence to the programmer.
There is a go od analogy b etween dynamic type systems and
semistructured data on one hand, and static type systems
and databases with schemas on the other
3 Query Languages
There app ear to be two general approaches to devising query
languages for semistructured data. First, take SQL (or p er-
haps OQL[14, 13 ]) as a starting p oint and add enough \fea-
tures" to p erform a useful class of queries. The second ap-
proach is to start from a language based on some formal
notion of computation on semistructured data then to mas-
sage that language into acceptable syntax. It is remarkable
that the two approaches app ear to end up with very similar
languages.
Let us start with the rst approach to see what what
kinds of queries are useful. The following SQL-like syntax
suggests itself:
select Entry.Movie.Title
from DB
where Entry.Movie.Director ...
However the syntax do es not make clear how much of the
two paths
Entry.Movie.Title
and
Entry.Movie.Director
are to b e taken as the same. The solution is to introduce
variables to indicate how paths or edges are to b e tied to-
gether. These variables can then be used in other expres-
sions to form new structures. Lab el variables, tree variables
and p ossibly path variables are needed to express a reason-
able set of queries.
The next problem is that one wants to sp ecify paths of
arbitrary length to nd, for example, all the strings in the
database. This requires us to b e able to express arbitrary
paths in our syntax. Even this is not enough. Consider the
problem of nding whether
"Allen"
acted in
"Casablanca"
.
One might try this by searching for paths from a
Movie
edge down to an
"Allen"
edge, but one would
not
want
this path to contain another
Movie
edge. These problems
indicate that one would like to have something like regular
expressions to constrain paths.
The \select" fragment of UnQL[10 ] and the Lorel query
language [5] solve these problems with very similar syntactic
forms. Lorel, which is a comp onent of the Lore pro ject [27]
requires a rich set of overloadings for its op erators for deal-
ing with comparisons of ob jects with values and of values
with sets. These are avoided in UnQL by not having ob ject
identity and exploiting a simple form of pattern matching.
Other languages that use a SQL-like syntax include a pre-
cursor to Lorel [34], and WebSQL [29, 7] which contains a
numb er of constructs sp ecic to web queries. A language
for web site management is prop osed in [18 ].
Having asked what the surface syntax should lo ok like,
one wants to ask what the underlying computational strat-
egy should b e. Here there appear to b e two principled strate-
gies. The rst is to mo del the graph as a relational database
and then exploit a relational query language. In our labeled
graph mo del this is remarkably simple. We can take the
database as a large relation of type (no de-id, label, no de-id)
and consider the expressive p ower of relational languages
on this structure, but this apparently simple approach has
a number of complications:
1. Our lab els are drawn from a heterogeneous collection
of types, so it may be appropriate to use more than
one relation.
2. If information also is held at no des, one needs addi-
tional relations to express this.
3. The no de identiers may only be used as temp orary
node labels, and one may want to limit the way they
can app ear in the output of the query. How they are
used is related to the discussion of ob ject identity.
4. We are concerned with what is accessible from a given
\root" by forward traversal of the edges, and one may
want to limit the languages appropriately.
Some forms of unbounded search will require recursive
queries, i.e., a \graph datalog", and such languages are pro-
posed in [26 , 16] for the web and for hypertext. Theoretical
treatments of queries that deal with computation on graphs
or on the web app ear in [6 , 30]. It should also b e mentioned
that this mo del of computation is used in [5, 15] as a starting
point for optimization.
The second strategy is adopted in the basis for UnQL
[11, 10 ]. Here the starting point is that of
structural recur-
sion
, and is an extension of a principle put forward in [12]
that there are natural forms of computation asso ciated with
the type. For semistructured data one starts with the natu-
ral form of recursion associated with the recursive datatyp e
of labeled trees. However, some restrictions need to be
placed for such recursive programs to be well-dened: we
want them to b e well-dened on graphs with cycles. These
restrictions give rise to an algebra that can be viewed as
having two components: a \horizontal" comp onent that ex-
presses computations across the edges of a given node (and
from this, computations to a xed depth from the ro ot); and
a \vertical" component that expresses computations that go
to arbitrary depths in the graph. A prop erty of this algebra
is that, when restricted to input and output data that con-
form to a relational (nested relational) schema, it expresses
exactly the relational (nested relational) algebra. Hence an
SQL-like language is a natural fragment of UnQL.
The SQL or OQL like languages we have mentioned typ-
ically bring information to the surface, but they are not
capable of p erforming complex or \deep" restructuring of

the data. Simple examples of such op erations include delet-
ing/collapsing edges with a certain property, relab eling edges,
or performing lo cal interchanges. Both \graph datalog" and
UnQL are capable of various forms of restructuring. For ex-
ample, in UnQL one can write a query that corrects the
egregious error in the
"Bacall"
edge lab el. One can also
perform a number of global restructuring functions such as
deleting edges with certain properties or adding new edges
to \short-circuit" various paths. The the relationship be-
tween the restructuring p ossible in UnQL and what can b e
done in \graph datalog" is not understoo d. Some simple
forms of restructuring are also present in a view denition
language prop osed in [4].
4 Implementation and Optimizations
This topic is very much in its infancy and again dep ends on
the underlying representation of the data. Moreover the op-
timization prblems dier depending on whether one is using
a semistructured model as an interface to existing data or
one is building a data structure to represent semistructured
data directly [28]. In the former case the extensions of ex-
isting techniques for optimization of ob ject-oriented or re-
lational query languages mentioned above may be exploited
together with the addition of path or text indices on lab els
and strings. In the second case, disk layout and clustering,
together with appropriate indexing, is also important.
In [10 ] a large class of computations can be shown to
be translatable into a basic graph transformation technique
which, in turn, allows some simple optimizations. Also some
of the basic optimizations of the relational algebra can b e
applied to the \vertical" computations. In [35] it is shown
how an analysis of the query, combined with some segmen-
tation of the graph into lo cal \sites" can b e used to decom-
pose a query into indep endent, parallel sub-queries. In [5 ]
and [15] extensions to optimization techniques for ob ject-
oriented query languages are exploited. In [19 ] a translation
is sp ecied for a fragment UnQL into a an underlying rela-
tional structure.
5 Adding Structure
One of the main attractions of semistructured data is that
it is unconstrained. Nevertheless, it may b e appropriate
to imp ose (or to discover) some form of structure in the
data. In [8] a schema is dened as a graph whose edges are
labeled with predicates and the prop erty of simulation is
used to describe the relationship between data and schema.
In [31 , 22] the schema is also an edge lab eled graph and the
stronger relationship of automata equivalence is used. In
[20] schemas are used for further optimization.
Schemas are useful for browsing and for providing partial
answers to queries. They will also b e needed for the passage
back from semistructured to structured data, for which a
richer notion of schema is necessary. This is an area in
which much further work is needed.
6 Acknowledgments
I would like to thank Susan Davidson and Dan Suciu for
their collab oration and for stimulating my interest in this
area. I am greatly indebted to Serge Abiteboul for most
constructive discussions on a number of issues.
References
[1] Serge Abiteb oul. Querying semi-structured data. In
Proceedings of ICDT
, Jan 1997.
[2] Serge Abiteboul, Sophie Cluet, Vassilis Christophides,
Tova Milo, and Jer^ome Simeon. Querying documents
in ob ject databases. In
Journal of Digital Libraries
,
volume 1:1, 1997.
[3] Serge Abiteb oul, Sophie Cluet, and Tova Milo. Query-
ing and up dating the le. In
Proceedings of 19th In-
ternational Conference on Very Large Databases
, pages
73{84, Dublin, Ireland, 1993.
[4] Serge Abiteboul, Roy Goldman, Jason McHugh, Vasilis
Vassalos, and Yue Zhuge. Views for semistructured
data. Technical rep ort, Stanford University, 1977.
[5] Serge Abiteboul, Dallan Quass, Jason McHugh, Jen-
nifer Widom, and Janet L. Weiner. The lorel query
language for semistructured data. In
Journal of Dig-
ital Libraries
, volume 1:1, 1997. To appear. See
http://www-db.stanford.edu/pub/papers/
.
[6] Serge Abiteb oul and Victor Vianu. Queries and compu-
tation on the web. In
Proceedings of ICDT
, Jan 1997.
[7] Gustavo O. Aro cena, Alberto O. Mendelzon, and
George A. Mihaila. Applications of a Web query lan-
guage. In
Proc. 6th. Int'l. WWW Conf.
, April 1997. In
press.
[8] P. Buneman, S. Davidson, Mary Fernandez, and D. Su-
ciu. Adding structure to unstructured data. In
Pro-
ceedings of ICDT
, January 1997.
[9] P. Buneman, S.B. Davidson, K. Hart, C. Overton, and
L. Wong. A data transformation system for biological
data sources. In
Proceedings of VLDB
, Sept 1995.
[10] Peter Buneman, Susan Davidson, Gerd Hillebrand, and
Dan Suciu. A query language and optimization tech-
niques for unstructured data. In
Proceedings of ACM-
SIGMOD International Conference on Management of
Data
, pages 505{516, Montreal, Canada, June 1996.
[11] Peter Buneman, Susan Davidson, and Dan Suciu. Pro-
gramming constructs for unstructured data. In
Proceed-
ings of 5th International Workshop on Database Pro-
gramming Languages
, Gubbio, Italy, Septemb er 1995.
To app ear.
[12] Peter Buneman, Shamim Naqvi, Val Tannen, and Lim-
soon Wong. Principles of programming with complex
ob jects and collection types.
Theoretical Computer Sci-
ence
, 149(1):3{48, September 1995.
[13] R. G. G. Cattell, editor.
The Object Database Standard:
ODMG-93
. Morgan Kaufmann, San Mateo, California,
1996.
[14] Sophie Cluet and Claude Delobel. A general frame-
work for the optimization of ob ject oriented queries. In
M. Stonebraker, editor,
Proceedings ACM-SIGMOD In-
ternational Conference on Management of Data
, pages
383{392, San Diego, California, June 1992.
[15] Sophie Cluet and Guido Moerkotte. Query pro cessing
in the schemaless and semistructured context. Techni-
cal rep ort, INRIA, 1997.

Citations
More filters
Proceedings ArticleDOI

Data integration: a theoretical perspective

TL;DR: The tutorial is focused on some of the theoretical issues that are relevant for data integration: modeling a data integration application, processing queries in data integration, dealing with inconsistent data sources, and reasoning on queries.
Journal ArticleDOI

Web mining research: a survey

TL;DR: This paper surveys the research in the area of Web mining, point out some confusions regarded the usage of the term Web mining and suggest three Web mining categories, which are then situate some of the research with respect to these three categories.
Journal ArticleDOI

Survey of graph database models

TL;DR: The main objective of this survey is to present the work that has been conducted in the area of graph database modeling, concentrating on data structures, query languages, and integrity constraints.
Journal ArticleDOI

Answering queries using views: A survey

TL;DR: The state of the art on the problem of answering queries using views is surveyed, the algorithms proposed to solve it are described, and the disparate works into a coherent framework are synthesized.
Journal ArticleDOI

Methodologies for data quality assessment and improvement

TL;DR: Methodologies are compared along several dimensions, including the methodological phases and steps, the strategies and techniques, the data quality dimensions, the types of data, and, finally, thetypes of information systems addressed by each methodology.
References
More filters
Proceedings Article

DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases

TL;DR: The theoretical foundations of DataGuides are presented along with an algorithm for their creation and an overview of incremental maintenance, and performance results based on the implementation of dataGuides in the Lore DBMS for semistructured data are provided.
Journal ArticleDOI

The Lorel Query Language for Semistructured Data

TL;DR: The main novelties of the Lorel language are the extensive use of coercion to relieve the user from the strict typing of OQL, which is inappropriate for semistructured data; and powerful path expressions, which permit a flexible form of declarative navigational access and are particularly suitable when the details of the structure are not known to the user.
Book

The object database standard: ODMG 2.0

TL;DR: With this book, standards are defined for object management systems and this will be the foundational book for object-oriented database product.
Proceedings ArticleDOI

Object exchange across heterogeneous information sources

TL;DR: An object-based information exchange model and a corresponding query language are defined that are well suited for integration of diverse information sources and used to integrate heterogeneous bibliographic information sources.
Book ChapterDOI

Querying Semi-Structured Data

TL;DR: The main purpose of the paper is to isolate the essential aspects of semistructured data, and survey some proposals of models and query languages for semi-structured data.
Related Papers (5)