Semistructured data

doi:10.1145/263661.263675

Edinburgh Research Explorer

Semistructured Data

Citation for published version:

Buneman, P 1997, Semistructured Data. in PODS '97 Proceedings of the sixteenth ACM SIGACT-

SIGMOD-SIGART symposium on Principles of database systems. ACM, pp. 117-121.

https://doi.org/10.1145/263661.263675

Digital Object Identifier (DOI):

10.1145/263661.263675

Link:

Link to publication record in Edinburgh Research Explorer

Document Version:

Early version, also known as pre-print

Published In:

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database

systems

General rights

Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)

and / or other copyright owners and it is a condition of accessing these publications that users recognise and

abide by the legal requirements associated with these rights.

Take down policy

The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer

content complies with UK legislation. If you believe that the public display of this file breaches copyright please

contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and

investigate your claim.

Download date: 10. Aug. 2022

Semistructured Data



Peter Buneman

Department of Computer and Information Science

University of Pennsylvania

Philadelphia, PA 19104-6389

peter@cis.up enn.edu

Abstract

In semistructured data, the information that is normally as-

sociated with a schema is contained within the data, which is

sometimes called \self-describing". In some forms of semi-

structured data there is no separate schema, in others it

exists but only places loose constraints on the data. Semi-

structured data has recently emerged as an imp ortant topic

of study for a variety of reasons. First, there are data sources

such as the Web, which we would like to treat as databases

but which cannot be constrained by a schema. Second, it

may be desirable to have an extremely exible format for

data exchange between disparate databases. Third, even

when dealing with structured data, it may be helpful to view

it as semistructured for the purp oses of browsing. This tu-

torial will cover a number of issues surrounding such data:

nding a concise formulation, building a suciently expres-

sive language for querying and transformation, and opti-

mization problems.

1 The motivation

The topic of semistructured data (also called unstructured

data) is relatively recent, and a tutorial on the topic may

well b e premature. It represents, if anything, the conver-

gence of a number of lines of thinking ab out new ways to

represent and query data that do not completely t with

conventional data models. The purp ose of this tutorial is

to to describe this motivation and to suggest areas in which

further research may b e fruitful. For a similar exp osition,

the reader is referred to Serge Abiteboul's recent survey pa-

per [1].

The slides for this tutorial will b e made available from a

section of the Penn database home page

http://www.cis.upenn.edu/~db

.



This work was partly supp orted by the Army Research Oce

(DAAH04-95-1-0169) and the National Science Foundation (CCR92-

16122).

1.1 Some data really is unstructured

The most obvious motivation comes from the need to bring

new forms of data into the ambit of conventional database

technology. Some of these, such as documents with struc-

tured text [3, 2] and data formats [9, 17], while they may

call for increasingly expressive query languages and new op-

timization techniques, only require mild extensions to the

existing notion of data mo dels such as ODMG [13]. How-

ever these extensions still require the prior imp osition of

structure on the data, and there are some forms of data for

which this is genuinely dicult.

The most immediate example of data that cannot be con-

strained by a schema is the World-Wide-Web. As database

researchers we would like to think of this as a database, but

to what extent are database tools available for querying or

maintaining the web? Most web queries exploit information

retrieval techniques to retrieve individual pages from their

contents, but there is little available that allows us to use

the structure of the web in formulating queries, and since

the web do es not obviously conform to any standard data

model, we need a method of describing its structure.

Another example, little known to the database commu-

nity but responsible for piquing the author's interest in this

topic, is the database management system ACeDB, which

is p opular with biologists [36]. Supercially it lo oks like

an ob ject-oriented database system, for it has a schema

language that resembles that of an ob ject-oriented DBMS;

but this schema imp oses only loose constraints on the data.

Moreover the relationship between data and schema is not

easily described in ob ject-oriented terms, and there are struc-

tures that are naturally expressed in ACeDB, such as trees of

arbitrary depth, that cannot b e queried using conventional

techniques.

1.2 Data Integration

A second motivation is that of data exchange and transfor-

mation, which is the starting p oint for the Tsimmis pro ject

[33, 21 ] at Stanford. The rationale here is that none of the

existing data mo dels is all-embracing, so that it is dicult

to build software that will easily convert between two dis-

parate mo dels. The Ob ject Exchange Mo del (OEM) oers

a highly exible data structure that may b e used to cap-

ture most kinds of data and provides a substrate in which

almost any other data structure may b e represented. In ef-

fect, OEM is an internal data structure for exchange of data

between DBMSs, but having such a structure invites the

idea of querying data in OEM format directly.

Entry Entry Entry

Movie Movie TV Show

Title Cast Director Title Cast Director Title Cast Episode

1 2 3

Special Guests

“Casablanca”

“Bogart” “Bacall”

“Play it again, Sam”

Credit Actors

“Allen”

1.2E6

Director

“Allen”

References

Is referenced in

Actors

Figure 1: An example movie database.

1.3 Browsing

A nal motivation is that of browsing. Generally sp eaking,

a user cannot write a database query without knowledge

of the schema. However, schemas may have opaque termi-

nology and the rationale for the design is often dicult to

understand. It may help in understanding the schema to be

able to query data without full knowledge of the schema.

For example the queries,



Where in the database is the string

"Casablanca"

to

be found?



Are there integers in the database greater than 2

16

?



What ob jects in the database have an attribute name

that starts with

"act"

Such questions cannot b e answered in any generic fashion

by standard relational or ob ject-oriented query languages.

While languages have b een proposed that allow schema and

data to b e queried simultaneously [24 ] in the context of

relational and ob ject-oriented database systems, these lan-

guages do not have the exibility to express complex con-

straints on paths, and it is not clear how their implementa-

tion will work on the structures describ ed below.

2 The Mo del

The unifying idea in semi-structured data is the representa-

tion of data as some kind of graph-like or tree-like structure.

Although we shall allow cycles in the data, we shall gener-

ally refer to these graphs as trees. The example in gure 1 is

taken from [10 ] in which the data mo del is formalized as an

edge labeled graph

. The structure is taken (with some inac-

curacies) from a well-known web database [23 ] that provides

a goo d example of semistructured data. There are several

things to note about it. If one connes ones attention to the

parts of the database below

Movie

edges, the data appears

fairly regular except that there are two ways of representing

a cast. That is, the data does not quite t with some re-

lational or ob ject-oriented presentation. Edges are lab eled

both with data, of types such as

int

and

string

and p ossi-

bly other base or external abstract types (video, audio etc.).

Edges are also with names such as

Movie

and

Title

that

would normally b e used for attribute or class names. We

shall refer to such labels as

symbols

. Internally they are rep-

resented as strings. Note that arrays may be represented by

labeling internal edges with integers. We can formulate the

type of this kind of lab eled tree as:

type label = int

j

string

j

...

j

symbol

type tree = set

(

label



tree

)

The rst line describ es a tagged union or variant, the

second says that a tree is a set of lab el/tree pairs. The edges

out of nodes in our trees are assumed to b e unordered.

There are a number of variations on this basic mo del,

and it is worth briey reviewing them. In [5] leaf nodes

are lab eled with data, internal no des are not lab eled with

meaningful data, and edges are lab eled only with symbols

type base = int

j

string

j

...

type tree = base

j

set

(

symbol



tree

)

The dierences b etween the two models are minor and

give rise to minor dierences in the query language. It is

easy to dene mappings in b oth directions.

Another p ossibility is to allow lab els on internal no des,

for example:

type base = int

j

string

j

...

j

symbol

type tree = label



set

(

label



tree

)

The problem with using this representation directly is

that it makes the operation of taking the union of two trees

dicult to dene. However, by intro ducing extra edges,

this represaentation can be converted into one of the edge-

labelled representations ab ove.

A nal and more complex issue is that of ob ject identity,

by which we mean node lab els { or p ossibly edge labels

{ that, apart from an equality test, are not observable in

the query language. In OEM, ob ject identities are used as

node lab els and place-holders to dene trees. While ob ject-

identities provide an ecient way to dene and test equality

within

a database, they pose problems when comparing data

across

databases. See [10, 25, 32] discussions and related

work.

It is straightforward to enco de relational and ob ject-

oriented databases in this mo del, although in the latter case

one must take care to deal with the issue of ob ject-identity.

However, the coding is not unique, and the examples in

[10] and [5 ] show some dierences in how tuples of sets are

treated.

The term \self describing" is often used to describe un-

structured data. In each of the models we have describ ed,

the data is a tagged union type, and one can imagine a pro-

gram whose b ehavior is dynamically determined by \switch-

ing" on the type. For example, a program's b ehavior may

be altered by whether it nds an integer or string as a lab el,

and one would expect any language for dealing with semi-

structured data to incorporate predicates that describ e the

type of an edge or no de. The situation is similar to that

in programming languages. Lisp and many interpreted and

scripting languages are

dynamical ly

typed. Predicates are

available to determine (at run time) type of a value or class

of an ob ject. Languages in the Algol tradition (Pascal, C,

ML, Mo dula) are

statical ly

typed. Predicates are not needed

to determine the type of a value b ecause it is known from the

source code of the program and hence to the programmer.

There is a go od analogy b etween dynamic type systems and

semistructured data on one hand, and static type systems

and databases with schemas on the other

3 Query Languages

There app ear to be two general approaches to devising query

languages for semistructured data. First, take SQL (or p er-

haps OQL[14, 13 ]) as a starting p oint and add enough \fea-

tures" to p erform a useful class of queries. The second ap-

proach is to start from a language based on some formal

notion of computation on semistructured data then to mas-

sage that language into acceptable syntax. It is remarkable

that the two approaches app ear to end up with very similar

languages.

Let us start with the rst approach to see what what

kinds of queries are useful. The following SQL-like syntax

suggests itself:

select Entry.Movie.Title

from DB

where Entry.Movie.Director ...

However the syntax do es not make clear how much of the

two paths

Entry.Movie.Title

and

Entry.Movie.Director

are to b e taken as the same. The solution is to introduce

variables to indicate how paths or edges are to b e tied to-

gether. These variables can then be used in other expres-

sions to form new structures. Lab el variables, tree variables

and p ossibly path variables are needed to express a reason-

able set of queries.

The next problem is that one wants to sp ecify paths of

arbitrary length to nd, for example, all the strings in the

database. This requires us to b e able to express arbitrary

paths in our syntax. Even this is not enough. Consider the

problem of nding whether

"Allen"

acted in

"Casablanca"

.

One might try this by searching for paths from a

Movie

edge down to an

"Allen"

edge, but one would

not

want

this path to contain another

Movie

edge. These problems

indicate that one would like to have something like regular

expressions to constrain paths.

The \select" fragment of UnQL[10 ] and the Lorel query

language [5] solve these problems with very similar syntactic

forms. Lorel, which is a comp onent of the Lore pro ject [27]

requires a rich set of overloadings for its op erators for deal-

ing with comparisons of ob jects with values and of values

with sets. These are avoided in UnQL by not having ob ject

identity and exploiting a simple form of pattern matching.

Other languages that use a SQL-like syntax include a pre-

cursor to Lorel [34], and WebSQL [29, 7] which contains a

numb er of constructs sp ecic to web queries. A language

for web site management is prop osed in [18 ].

Having asked what the surface syntax should lo ok like,

one wants to ask what the underlying computational strat-

egy should b e. Here there appear to b e two principled strate-

gies. The rst is to mo del the graph as a relational database

and then exploit a relational query language. In our labeled

graph mo del this is remarkably simple. We can take the

database as a large relation of type (no de-id, label, no de-id)

and consider the expressive p ower of relational languages

on this structure, but this apparently simple approach has

a number of complications:

1. Our lab els are drawn from a heterogeneous collection

of types, so it may be appropriate to use more than

one relation.

2. If information also is held at no des, one needs addi-

tional relations to express this.

3. The no de identiers may only be used as temp orary

node labels, and one may want to limit the way they

can app ear in the output of the query. How they are

used is related to the discussion of ob ject identity.

4. We are concerned with what is accessible from a given

\root" by forward traversal of the edges, and one may

want to limit the languages appropriately.

Some forms of unbounded search will require recursive

queries, i.e., a \graph datalog", and such languages are pro-

posed in [26 , 16] for the web and for hypertext. Theoretical

treatments of queries that deal with computation on graphs

or on the web app ear in [6 , 30]. It should also b e mentioned

that this mo del of computation is used in [5, 15] as a starting

point for optimization.

The second strategy is adopted in the basis for UnQL

[11, 10 ]. Here the starting point is that of

structural recur-

sion

, and is an extension of a principle put forward in [12]

that there are natural forms of computation asso ciated with

the type. For semistructured data one starts with the natu-

ral form of recursion associated with the recursive datatyp e

of labeled trees. However, some restrictions need to be

placed for such recursive programs to be well-dened: we

want them to b e well-dened on graphs with cycles. These

restrictions give rise to an algebra that can be viewed as

having two components: a \horizontal" comp onent that ex-

presses computations across the edges of a given node (and

from this, computations to a xed depth from the ro ot); and

a \vertical" component that expresses computations that go

to arbitrary depths in the graph. A prop erty of this algebra

is that, when restricted to input and output data that con-

form to a relational (nested relational) schema, it expresses

exactly the relational (nested relational) algebra. Hence an

SQL-like language is a natural fragment of UnQL.

The SQL or OQL like languages we have mentioned typ-

ically bring information to the surface, but they are not

capable of p erforming complex or \deep" restructuring of

the data. Simple examples of such op erations include delet-

ing/collapsing edges with a certain property, relab eling edges,

or performing lo cal interchanges. Both \graph datalog" and

UnQL are capable of various forms of restructuring. For ex-

ample, in UnQL one can write a query that corrects the

egregious error in the

"Bacall"

edge lab el. One can also

perform a number of global restructuring functions such as

deleting edges with certain properties or adding new edges

to \short-circuit" various paths. The the relationship be-

tween the restructuring p ossible in UnQL and what can b e

done in \graph datalog" is not understoo d. Some simple

forms of restructuring are also present in a view denition

language prop osed in [4].

4 Implementation and Optimizations

This topic is very much in its infancy and again dep ends on

the underlying representation of the data. Moreover the op-

timization prblems dier depending on whether one is using

a semistructured model as an interface to existing data or

one is building a data structure to represent semistructured

data directly [28]. In the former case the extensions of ex-

isting techniques for optimization of ob ject-oriented or re-

lational query languages mentioned above may be exploited

together with the addition of path or text indices on lab els

and strings. In the second case, disk layout and clustering,

together with appropriate indexing, is also important.

In [10 ] a large class of computations can be shown to

be translatable into a basic graph transformation technique

which, in turn, allows some simple optimizations. Also some

of the basic optimizations of the relational algebra can b e

applied to the \vertical" computations. In [35] it is shown

how an analysis of the query, combined with some segmen-

tation of the graph into lo cal \sites" can b e used to decom-

pose a query into indep endent, parallel sub-queries. In [5 ]

and [15] extensions to optimization techniques for ob ject-

oriented query languages are exploited. In [19 ] a translation

is sp ecied for a fragment UnQL into a an underlying rela-

tional structure.

5 Adding Structure

One of the main attractions of semistructured data is that

it is unconstrained. Nevertheless, it may b e appropriate

to imp ose (or to discover) some form of structure in the

data. In [8] a schema is dened as a graph whose edges are

labeled with predicates and the prop erty of simulation is

used to describe the relationship between data and schema.

In [31 , 22] the schema is also an edge lab eled graph and the

stronger relationship of automata equivalence is used. In

[20] schemas are used for further optimization.

Schemas are useful for browsing and for providing partial

answers to queries. They will also b e needed for the passage

back from semistructured to structured data, for which a

richer notion of schema is necessary. This is an area in

which much further work is needed.

6 Acknowledgments

I would like to thank Susan Davidson and Dan Suciu for

their collab oration and for stimulating my interest in this

area. I am greatly indebted to Serge Abiteboul for most

constructive discussions on a number of issues.

References

[1] Serge Abiteb oul. Querying semi-structured data. In

Proceedings of ICDT

, Jan 1997.

[2] Serge Abiteboul, Sophie Cluet, Vassilis Christophides,

Tova Milo, and Jer^ome Simeon. Querying documents

in ob ject databases. In

Journal of Digital Libraries

,

volume 1:1, 1997.

[3] Serge Abiteb oul, Sophie Cluet, and Tova Milo. Query-

ing and up dating the le. In

Proceedings of 19th In-

ternational Conference on Very Large Databases

, pages

73{84, Dublin, Ireland, 1993.

[4] Serge Abiteboul, Roy Goldman, Jason McHugh, Vasilis

Vassalos, and Yue Zhuge. Views for semistructured

data. Technical rep ort, Stanford University, 1977.

[5] Serge Abiteboul, Dallan Quass, Jason McHugh, Jen-

nifer Widom, and Janet L. Weiner. The lorel query

language for semistructured data. In

Journal of Dig-

ital Libraries

, volume 1:1, 1997. To appear. See

http://www-db.stanford.edu/pub/papers/

.

[6] Serge Abiteb oul and Victor Vianu. Queries and compu-

tation on the web. In

Proceedings of ICDT

, Jan 1997.

[7] Gustavo O. Aro cena, Alberto O. Mendelzon, and

George A. Mihaila. Applications of a Web query lan-

guage. In

Proc. 6th. Int'l. WWW Conf.

, April 1997. In

press.

[8] P. Buneman, S. Davidson, Mary Fernandez, and D. Su-

ciu. Adding structure to unstructured data. In

Pro-

ceedings of ICDT

, January 1997.

[9] P. Buneman, S.B. Davidson, K. Hart, C. Overton, and

L. Wong. A data transformation system for biological

data sources. In

Proceedings of VLDB

, Sept 1995.

[10] Peter Buneman, Susan Davidson, Gerd Hillebrand, and

Dan Suciu. A query language and optimization tech-

niques for unstructured data. In

Proceedings of ACM-

SIGMOD International Conference on Management of

Data

, pages 505{516, Montreal, Canada, June 1996.

[11] Peter Buneman, Susan Davidson, and Dan Suciu. Pro-

gramming constructs for unstructured data. In

Proceed-

ings of 5th International Workshop on Database Pro-

gramming Languages

, Gubbio, Italy, Septemb er 1995.

To app ear.

[12] Peter Buneman, Shamim Naqvi, Val Tannen, and Lim-

soon Wong. Principles of programming with complex

ob jects and collection types.

Theoretical Computer Sci-

ence

, 149(1):3{48, September 1995.

[13] R. G. G. Cattell, editor.

The Object Database Standard:

ODMG-93

. Morgan Kaufmann, San Mateo, California,

1996.

[14] Sophie Cluet and Claude Delobel. A general frame-

work for the optimization of ob ject oriented queries. In

M. Stonebraker, editor,

Proceedings ACM-SIGMOD In-

ternational Conference on Management of Data

, pages

383{392, San Diego, California, June 1992.

[15] Sophie Cluet and Guido Moerkotte. Query pro cessing

in the schemaless and semistructured context. Techni-

cal rep ort, INRIA, 1997.

Semistructured data

Citations

Data integration: a theoretical perspective

Web mining research: a survey

Survey of graph database models

Answering queries using views: A survey

Methodologies for data quality assessment and improvement

References

DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases

The Lorel Query Language for Semistructured Data

The object database standard: ODMG 2.0

Object exchange across heterogeneous information sources

Querying Semi-Structured Data

Related Papers (5)

Querying Semi-Structured Data

The Lorel Query Language for Semistructured Data

Data on the Web: From Relations to Semistructured Data and XML

Object exchange across heterogeneous information sources

DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases