Querying Semi-Structured Data

doi:10.1007/3-540-62222-5_33

Serge Abiteboul

?

INRIA-Rocquencourt

Serge.Abiteboul@in ria.fr

1 Intro duction

The amount of data of all kinds available electronically has increased dramat-

ically in recentyears. The data resides in dierent forms, ranging from un-

structured data in le systems to highly structured in relational database sys-

tems. Data is accessible through a varietyofinterfaces including Web browsers,

database query languages, application-sp ecic interfaces, or data exchange for-

mats. Some of this data is

raw

data, e.g., images or sound. Some of it has struc-

ture even if the structure is often implicit, and not as rigid or regular as that

found in standard database systems. Sometimes the structure exists but has to

be extracted from the data. Sometimes also it exists but we prefer to ignore it for

certain purposes suchasbrowsing. We call here

semi-structured data

this data

that is (from a particular viewp oint) neither raw data nor strictly typed, i.e., not

table-oriented as in a relational model or sorted-graph as in ob ject databases.

As will seen later when the notion of semi-structured data is more precisely

dened, the need for semi-structured data arises naturally in the context of data

integration, even when the data sources are themselves well-structured. Although

data integration is an old topic, the need to integrate a wider variety of data-

formats (e.g., SGML or ASN.1 data) and data found on the Web has brought

the topic of semi-structured data to the forefront of research.

The main purp ose of the pap er is to isolate the essential asp ects of semi-

structured data. We also survey some prop osals of models and query languages

for semi-structured data. In particular, we consider recentworks at Stanford U.

and U. Penn on semi-structured data. In b oth cases, the motivation is found in

the integration of heterogeneous data. The \lightweight" data mo dels they use

(based on labelled graphs) are very similar.

As we shall see, the topic of semi-structured data has no precise boundary.

Furthermore, a theory of semi-structured data is still missing. We will try to

highlight some important issues in this context.

The pap er is organized as follows. In Section 2, we discuss the particularitie s

of semi-structured data. In Section 3, we consider the issue of the data structure

and in Section 4, the issue of the query language.

?

Currently visiting the Computer Science Dept., Stanford U. Work supp orted in part

by CESDIS, NASA Go ddard Space Flight Center; by the Air Force Wright Lab ora-

tory Aeronautical Systems Center under ARPA Contract F33615-93-1-1339, and by

the Air Force Rome Laboratories under ARPA Contract F30602-95-C-0119.

Report Documentation Page

Form Approved

OMB No. 0704-0188

Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and

maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,

including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington

VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it

does not display a currently valid OMB control number.

1. REPORT DATE

1997

2. REPORT TYPE

N/A

3. DATES COVERED

-

4. TITLE AND SUBTITLE

Querying Semi-Sttuctured Data

5a. CONTRACT NUMBER

5b. GRANT NUMBER

5c. PROGRAM ELEMENT NUMBER

6. AUTHOR(S) 5d. PROJECT NUMBER

5e. TASK NUMBER

5f. WORK UNIT NUMBER

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES)

Air Force Wright Patterson AFB, OH 45433

8. PERFORMING ORGANIZATION

REPORT NUMBER

9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)

11. SPONSOR/MONITOR’S REPORT

NUMBER(S)

12. DISTRIBUTION/AVAILABILITY STATEMENT

Approved for public release, distribution unlimited

13. SUPPLEMENTARY NOTES

14. ABSTRACT

15. SUBJECT TERMS

16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF

ABSTRACT

UU

18. NUMBER

OF PAGES

16

19a. NAME OF

RESPONSIBLE PERSON

a. REPORT

unclassified

b. ABSTRACT

unclassified

c. THIS PAGE

unclassified

Standard Form 298 (Rev. 8-98)

Prescribed by ANSI Std Z39-18

2 Semi-Structured Data

In this section, we make more precise what we mean by semi-structured data,

how such data arises, and emphasize its main aspects.

Roughly speaking, semi-structured data is data that is neither raw data, nor

very strictly typed as in conventional database systems. Clearly, this denition

is imprecise. For instance, would a BibTex le be considered structured or semi-

structured? Indeed, the same piece of information may be viewed as unstructured

at some early processing stage, but later b ecome very structured after some

analysis has been performed. In this section, we give examples of semi-structured

data, make more precise this notion and describe important issues in this context.

2.1 Examples

We will often discuss in this pap er BibTex les [Lam94] that present the ad-

vantage of being more familiar to researchers than other well-accepted formats

such as SGML [ISO86] or ASN.1 [ISO87]. Data in BibTex les closely resembles

relational data. Such a le is composed of records. But, the structure is not as

regular. Some elds may b e missing. (Indeed, it is customary to even nd com-

pulsory elds missing.) Other elds have some meaningful structure, e.g., author.

There are complex features such as abbreviations or cross references that are not

easy to describe in some database systems.

The Web also provides numerous popular examples of semi-structured data.

In the Web, data consists of les in a particular format, HTML, with some struc-

turing primitives such as tags and anchors. A typical example is a data source

about restaurants in the Bay Area (from the Palo Alto Weekly newspaper), that

we will call Guide. It consists of an HTML le with one entry per restaurant

and provides some information on prices, addresses, styles of restaurants and

reviews. Data in Guide resides in les of text with some implicit structure. One

can write a parser to extract the underlying structure. However, there is a large

degree of irregularity in the structure since (i) restaurants are not all treated in

a uniform manner (e.g., much less information is given for fast-foo d joints) and

(ii) information is entered as plain text byhuman beings that do not present the

standard rigidityofyour favorite data loader. Therefore, the parser will haveto

be tolerant and accept to fail parsing portions of text that will remain as plain

text.

Also, semi-structured data arises often when integrating several (possibly

structured) sources. Data integration of indep endent sources has b een a popular

topic of research since the very early days of databases. (Surveys can be found in

[SL90, LMR90, Bre90], and more recentwork on the integration of heterogeneous

sources in e.g., [LRO96, QRS

+

95,C

+

95].) It has gained a new vigor with the

recent p opularity of the Web. Consider the integration of car retailer databases.

Some retailers will represent addresses as strings and others as tuples. Retailers

will probably use dierent conventions for representing dates, prices, invoices,

etc. We should expect some information to b e missing from some sources. (E.g.,

some retailers may not record whether non-automatic transmission is available).

More generally, a wide heterogeneity in the organization of data should be ex-

pected from the car retailer data sources and not all can b e resolved by the

integration software.

Semi-structured data arises under a variety of forms for a wide range of appli-

cations such as genome databases, scientic databases, libraries of programs and

more generally, digital libraries, on-line documentations, electronic commerce. It

is thus essential to better understand the issue of querying semi-structured data.

2.2 Main asp ects

The structure is irregular:

This must be clear from the previous discussion. In many of these applications,

the large collections that are maintained often consist of heterogeneous elements.

Some elements may be incomplete. On the other hand, other elements may record

extra information, e.g., annotations. Dierenttypes may b e used for the same

kind of information, e.g., prices may be in dollars in p ortions of the database

and in francs in others. The same piece of information. e.g., an address, maybe

structured in some places as a string and in others as a tuple.

Modelling and querying such irregular structures are essential issues.

The structure is implicit:

In many applications, although a precise structuring exists, it is given implicitly.

For instance, electronic documents consist often of a text and a grammar (e.g., a

DTD in SGML). The parsing of the document then allows one to isolate pieces of

information and detect relationships between them. However, the interpretation

of these relationships (e.g., SGML exceptions) maybebeyond the capabilities of

standard database models and are left to the particular applications and specic

tools. We view this structure as implicit (although sp ecied explicitly by tags)

since (i) some computation is required to obtain it (e.g., parsing) and (ii) the

correspondence b etween the parse-tree and the logical representation of the data

is not always immediate.

It is also sometimes the case, in particular for the Web, that the documents

come as plain text. Some ad-hoc analysis is then needed to extract the structure.

For instance, in the Guide data source, the description of restaurant is in plain

text. Now, clearly, it is p ossible to develop some analysis tools to recognize prices,

addresses, etc. and then extract the structure of the le. The issue of extracting

the structure of some text (e.g., HTML) is a challenging issue.

The structure is partial:

To completely structure the data often remains an elusive goal. Parts of the data

may lack structure (e.g., bitmaps); other parts may only unveil some very sketchy

structure (e.g., unstructured text). Information retrieval tools may provide a

limited form of structure, e.g., by computing occurrences of particular words or

group of words and by classifying documents based on their content.

An application may also decide to leave large quantities of data outside the

database. This data then remains unstructured from a database viewpoint. The

loading of this external data, its analysis, and its integration to the database have

to b e p erformed eciently.Wemaywant to also use optimization techniques to

only load selective portions of this data, in the style of [ACM93]. In general, the

management and access of this

external data

and its interoperability with the

data from the database is an imp ortant issue.

Indicative structure vs. constraining structure:

In standard database applications, a strict typing policy is enforced to protect

data. We are concerned here with applications where such strict p olicy is often

viewed as too constraining. Consider for instance the Web. A p erson developing

a personal Web site would b e reluctant to accept strict typing restrictions.

In the context of the Lore Pro ject at Stanford, the term

data guide

was

adopted to emphasize non-conventional approaches to typing found in most semi-

structured data applications. A

schema

(as in conventional databases) describes

a strict type that is adhered to by all data managed by the system. An up date

not conforming is simply rejected. On the other hand, a

data guide

provides some

information ab out the currenttyp e of the data. It does not have to be the most

accurate. (Accuracy may be traded in for simplicity.) All new data is accepted,

eventually at the cost of modifying the data guide.

A-priori schema vs. a-posteriori data guide:

Traditional database systems are based on the hypothesis of a xed schema that

has to b e dened prior to introducing any data. This is not the case for semi-

structured data where the notion of schema is often posterior to the existence

of data.

Continuing with the Web example, when all the members of an organization

haveaWeb page, there is usually some pressure to unify the style of these

home-pages, or at least agree on some minimal structure to facilitate the design

of global entry-points. Indeed, it is a general pattern for large Web sources to

start with a very loose structure and then acquire some structure when the need

for it is felt.

Further on, we will briey mention issues concerning data guides.

The schema is very large:

Often as a consequence of heterogeneity, the schema would typically be quite

large. This is in contrast with relational databases where the schema was ex-

pected to be orders of magnitude smaller than the data. For instance, suppose

that we are interested in Californian Impressionist Painters. Wemay nd some

data about these painters in many heterogeneous information sources on the

Web, so the schema is probably quite large. But the data itself is not so large.

Note that as a consequence, the user is not exp ected to know all the details of

the schema. Thus, queries over the schema are as imp ortant as standard queries

over the data. Indeed, one cannot separate anymore these two aspects of queries.

The schema is ignored:

Typically, it is useful to ignore the schema for some queries that have more of a

discovery nature. Such queries may consist in simply browsing through the data

or searching for some string or pattern without any precise indication on where it

may occur. Such searching or browsing are typically not p ossible with SQL-like

languages. They pose new challenges: (i) the extension of the query languages;

and (ii) the integration of new optimization techniques such as full-text indexing

[ACC

+

96] or evaluation of generalized path expressions [CCM96].

The schema is rapidly evolving:

In standard database systems, the schema is viewed as almost immutable, schema

updates as rare, and it is well-accepted that schema up dates are very expensive.

Now, in contrast, consider the case of genome data [DOB95]. The schema is

expected to change quite rapidly, at the same speed as exp erimental techniques

are improved or novel techniques introduced. As a consequence, expressive for-

mats such as ASN.1 or ACeDB [TMD92]were preferred to a relational or ob ject

database system approach. Indeed, the fact that schema evolves very rapidly is

often given as the reason for not using database systems in applications that are

managing large quantities of data. (Other reasons include the cost of database

systems and the interoperabili ty with other systems, e.g., Fortran libraries.)

In the context of semi-structured data, wehave to assume that the schema is

very exible and can b e up dated as easily as data which p oses serious challenges

to database technology.

The type of data elements is eclectic:

Querying Semi-Structured Data

Figures

Citations

Web mining research: a survey

Survey of graph database models

Answering queries using views: A survey

The Lorel Query Language for Semistructured Data

The state of the art in distributed query processing

References

Foundations of databases

Principles of database and knowledge-base systems

Federated database systems for managing distributed, heterogeneous, and autonomous databases

Principles of Database and Knowledge-Base Systems Volume II: The New Technologies

Querying Heterogeneous Information Sources Using Source Descriptions

Related Papers (5)

The Lorel Query Language for Semistructured Data

Semistructured data

Object exchange across heterogeneous information sources

DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases

Lore: a database management system for semistructured data