scispace - formally typeset
Open AccessBook ChapterDOI

Querying Semi-Structured Data

Reads0
Chats0
TLDR
The main purpose of the paper is to isolate the essential aspects of semistructured data, and survey some proposals of models and query languages for semi-structured data.
Abstract
The amount of data of all kinds available electronically has increased dramatically in recent years. The data resides in different forms, ranging from unstructured data in the systems to highly structured in relational database systems. Data is accessible through a variety of interfaces including Web browsers, database query languages, application-specic interfaces, or data exchange formats. Some of this data is raw data, e.g., images or sound. Some of it has structure even if the structure is often implicit, and not as rigid or regular as that found in standard database systems. Sometimes the structure exists but has to be extracted from the data. Sometimes also it exists but we prefer to ignore it for certain purposes such as browsing. We call here semi-structured data this data that is (from a particular viewpoint) neither raw data nor strictly typed, i.e., not table-oriented as in a relational model or sorted-graph as in object databases. As will seen later when the notion of semi-structured data is more precisely de ned, the need for semi-structured data arises naturally in the context of data integration, even when the data sources are themselves well-structured. Although data integration is an old topic, the need to integrate a wider variety of data- formats (e.g., SGML or ASN.1 data) and data found on the Web has brought the topic of semi-structured data to the forefront of research. The main purpose of the paper is to isolate the essential aspects of semi- structured data. We also survey some proposals of models and query languages for semi-structured data. In particular, we consider recent works at Stanford U. and U. Penn on semi-structured data. In both cases, the motivation is found in the integration of heterogeneous data.

read more

Content maybe subject to copyright    Report

Querying Semi-Structured Data
Serge Abiteboul
?
INRIA-Rocquencourt
Serge.Abiteboul@in ria.fr
1 Intro duction
The amount of data of all kinds available electronically has increased dramat-
ically in recentyears. The data resides in dierent forms, ranging from un-
structured data in le systems to highly structured in relational database sys-
tems. Data is accessible through a varietyofinterfaces including Web browsers,
database query languages, application-sp ecic interfaces, or data exchange for-
mats. Some of this data is
raw
data, e.g., images or sound. Some of it has struc-
ture even if the structure is often implicit, and not as rigid or regular as that
found in standard database systems. Sometimes the structure exists but has to
be extracted from the data. Sometimes also it exists but we prefer to ignore it for
certain purposes suchasbrowsing. We call here
semi-structured data
this data
that is (from a particular viewp oint) neither raw data nor strictly typed, i.e., not
table-oriented as in a relational model or sorted-graph as in ob ject databases.
As will seen later when the notion of semi-structured data is more precisely
dened, the need for semi-structured data arises naturally in the context of data
integration, even when the data sources are themselves well-structured. Although
data integration is an old topic, the need to integrate a wider variety of data-
formats (e.g., SGML or ASN.1 data) and data found on the Web has brought
the topic of semi-structured data to the forefront of research.
The main purp ose of the pap er is to isolate the essential asp ects of semi-
structured data. We also survey some prop osals of models and query languages
for semi-structured data. In particular, we consider recentworks at Stanford U.
and U. Penn on semi-structured data. In b oth cases, the motivation is found in
the integration of heterogeneous data. The \lightweight" data mo dels they use
(based on labelled graphs) are very similar.
As we shall see, the topic of semi-structured data has no precise boundary.
Furthermore, a theory of semi-structured data is still missing. We will try to
highlight some important issues in this context.
The pap er is organized as follows. In Section 2, we discuss the particularitie s
of semi-structured data. In Section 3, we consider the issue of the data structure
and in Section 4, the issue of the query language.
?
Currently visiting the Computer Science Dept., Stanford U. Work supp orted in part
by CESDIS, NASA Go ddard Space Flight Center; by the Air Force Wright Lab ora-
tory Aeronautical Systems Center under ARPA Contract F33615-93-1-1339, and by
the Air Force Rome Laboratories under ARPA Contract F30602-95-C-0119.

Report Documentation Page
Form Approved
OMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.
1. REPORT DATE
1997
2. REPORT TYPE
N/A
3. DATES COVERED
-
4. TITLE AND SUBTITLE
Querying Semi-Sttuctured Data
5a. CONTRACT NUMBER
5b. GRANT NUMBER
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES)
Air Force Wright Patterson AFB, OH 45433
8. PERFORMING ORGANIZATION
REPORT NUMBER
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT
NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release, distribution unlimited
13. SUPPLEMENTARY NOTES
14. ABSTRACT
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF
ABSTRACT
UU
18. NUMBER
OF PAGES
16
19a. NAME OF
RESPONSIBLE PERSON
a. REPORT
unclassified
b. ABSTRACT
unclassified
c. THIS PAGE
unclassified
Standard Form 298 (Rev. 8-98)
Prescribed by ANSI Std Z39-18

2 Semi-Structured Data
In this section, we make more precise what we mean by semi-structured data,
how such data arises, and emphasize its main aspects.
Roughly speaking, semi-structured data is data that is neither raw data, nor
very strictly typed as in conventional database systems. Clearly, this denition
is imprecise. For instance, would a BibTex le be considered structured or semi-
structured? Indeed, the same piece of information may be viewed as unstructured
at some early processing stage, but later b ecome very structured after some
analysis has been performed. In this section, we give examples of semi-structured
data, make more precise this notion and describe important issues in this context.
2.1 Examples
We will often discuss in this pap er BibTex les [Lam94] that present the ad-
vantage of being more familiar to researchers than other well-accepted formats
such as SGML [ISO86] or ASN.1 [ISO87]. Data in BibTex les closely resembles
relational data. Such a le is composed of records. But, the structure is not as
regular. Some elds may b e missing. (Indeed, it is customary to even nd com-
pulsory elds missing.) Other elds have some meaningful structure, e.g., author.
There are complex features such as abbreviations or cross references that are not
easy to describe in some database systems.
The Web also provides numerous popular examples of semi-structured data.
In the Web, data consists of les in a particular format, HTML, with some struc-
turing primitives such as tags and anchors. A typical example is a data source
about restaurants in the Bay Area (from the Palo Alto Weekly newspaper), that
we will call Guide. It consists of an HTML le with one entry per restaurant
and provides some information on prices, addresses, styles of restaurants and
reviews. Data in Guide resides in les of text with some implicit structure. One
can write a parser to extract the underlying structure. However, there is a large
degree of irregularity in the structure since (i) restaurants are not all treated in
a uniform manner (e.g., much less information is given for fast-foo d joints) and
(ii) information is entered as plain text byhuman beings that do not present the
standard rigidityofyour favorite data loader. Therefore, the parser will haveto
be tolerant and accept to fail parsing portions of text that will remain as plain
text.
Also, semi-structured data arises often when integrating several (possibly
structured) sources. Data integration of indep endent sources has b een a popular
topic of research since the very early days of databases. (Surveys can be found in
[SL90, LMR90, Bre90], and more recentwork on the integration of heterogeneous
sources in e.g., [LRO96, QRS
+
95,C
+
95].) It has gained a new vigor with the
recent p opularity of the Web. Consider the integration of car retailer databases.
Some retailers will represent addresses as strings and others as tuples. Retailers
will probably use dierent conventions for representing dates, prices, invoices,
etc. We should expect some information to b e missing from some sources. (E.g.,
some retailers may not record whether non-automatic transmission is available).
More generally, a wide heterogeneity in the organization of data should be ex-
pected from the car retailer data sources and not all can b e resolved by the
integration software.
Semi-structured data arises under a variety of forms for a wide range of appli-
cations such as genome databases, scientic databases, libraries of programs and

more generally, digital libraries, on-line documentations, electronic commerce. It
is thus essential to better understand the issue of querying semi-structured data.
2.2 Main asp ects
The structure is irregular:
This must be clear from the previous discussion. In many of these applications,
the large collections that are maintained often consist of heterogeneous elements.
Some elements may be incomplete. On the other hand, other elements may record
extra information, e.g., annotations. Dierenttypes may b e used for the same
kind of information, e.g., prices may be in dollars in p ortions of the database
and in francs in others. The same piece of information. e.g., an address, maybe
structured in some places as a string and in others as a tuple.
Modelling and querying such irregular structures are essential issues.
The structure is implicit:
In many applications, although a precise structuring exists, it is given implicitly.
For instance, electronic documents consist often of a text and a grammar (e.g., a
DTD in SGML). The parsing of the document then allows one to isolate pieces of
information and detect relationships between them. However, the interpretation
of these relationships (e.g., SGML exceptions) maybebeyond the capabilities of
standard database models and are left to the particular applications and specic
tools. We view this structure as implicit (although sp ecied explicitly by tags)
since (i) some computation is required to obtain it (e.g., parsing) and (ii) the
correspondence b etween the parse-tree and the logical representation of the data
is not always immediate.
It is also sometimes the case, in particular for the Web, that the documents
come as plain text. Some ad-hoc analysis is then needed to extract the structure.
For instance, in the Guide data source, the description of restaurant is in plain
text. Now, clearly, it is p ossible to develop some analysis tools to recognize prices,
addresses, etc. and then extract the structure of the le. The issue of extracting
the structure of some text (e.g., HTML) is a challenging issue.
The structure is partial:
To completely structure the data often remains an elusive goal. Parts of the data
may lack structure (e.g., bitmaps); other parts may only unveil some very sketchy
structure (e.g., unstructured text). Information retrieval tools may provide a
limited form of structure, e.g., by computing occurrences of particular words or
group of words and by classifying documents based on their content.
An application may also decide to leave large quantities of data outside the
database. This data then remains unstructured from a database viewpoint. The
loading of this external data, its analysis, and its integration to the database have
to b e p erformed eciently.Wemaywant to also use optimization techniques to
only load selective portions of this data, in the style of [ACM93]. In general, the
management and access of this
external data
and its interoperability with the
data from the database is an imp ortant issue.
Indicative structure vs. constraining structure:
In standard database applications, a strict typing policy is enforced to protect
data. We are concerned here with applications where such strict p olicy is often
viewed as too constraining. Consider for instance the Web. A p erson developing
a personal Web site would b e reluctant to accept strict typing restrictions.
In the context of the Lore Pro ject at Stanford, the term
data guide
was
adopted to emphasize non-conventional approaches to typing found in most semi-

structured data applications. A
schema
(as in conventional databases) describes
a strict type that is adhered to by all data managed by the system. An up date
not conforming is simply rejected. On the other hand, a
data guide
provides some
information ab out the currenttyp e of the data. It does not have to be the most
accurate. (Accuracy may be traded in for simplicity.) All new data is accepted,
eventually at the cost of modifying the data guide.
A-priori schema vs. a-posteriori data guide:
Traditional database systems are based on the hypothesis of a xed schema that
has to b e dened prior to introducing any data. This is not the case for semi-
structured data where the notion of schema is often posterior to the existence
of data.
Continuing with the Web example, when all the members of an organization
haveaWeb page, there is usually some pressure to unify the style of these
home-pages, or at least agree on some minimal structure to facilitate the design
of global entry-points. Indeed, it is a general pattern for large Web sources to
start with a very loose structure and then acquire some structure when the need
for it is felt.
Further on, we will briey mention issues concerning data guides.
The schema is very large:
Often as a consequence of heterogeneity, the schema would typically be quite
large. This is in contrast with relational databases where the schema was ex-
pected to be orders of magnitude smaller than the data. For instance, suppose
that we are interested in Californian Impressionist Painters. Wemay nd some
data about these painters in many heterogeneous information sources on the
Web, so the schema is probably quite large. But the data itself is not so large.
Note that as a consequence, the user is not exp ected to know all the details of
the schema. Thus, queries over the schema are as imp ortant as standard queries
over the data. Indeed, one cannot separate anymore these two aspects of queries.
The schema is ignored:
Typically, it is useful to ignore the schema for some queries that have more of a
discovery nature. Such queries may consist in simply browsing through the data
or searching for some string or pattern without any precise indication on where it
may occur. Such searching or browsing are typically not p ossible with SQL-like
languages. They pose new challenges: (i) the extension of the query languages;
and (ii) the integration of new optimization techniques such as full-text indexing
[ACC
+
96] or evaluation of generalized path expressions [CCM96].
The schema is rapidly evolving:
In standard database systems, the schema is viewed as almost immutable, schema
updates as rare, and it is well-accepted that schema up dates are very expensive.
Now, in contrast, consider the case of genome data [DOB95]. The schema is
expected to change quite rapidly, at the same speed as exp erimental techniques
are improved or novel techniques introduced. As a consequence, expressive for-
mats such as ASN.1 or ACeDB [TMD92]were preferred to a relational or ob ject
database system approach. Indeed, the fact that schema evolves very rapidly is
often given as the reason for not using database systems in applications that are
managing large quantities of data. (Other reasons include the cost of database
systems and the interoperabili ty with other systems, e.g., Fortran libraries.)
In the context of semi-structured data, wehave to assume that the schema is
very exible and can b e up dated as easily as data which p oses serious challenges
to database technology.
The type of data elements is eclectic:

Citations
More filters
Journal ArticleDOI

Web mining research: a survey

TL;DR: This paper surveys the research in the area of Web mining, point out some confusions regarded the usage of the term Web mining and suggest three Web mining categories, which are then situate some of the research with respect to these three categories.
Journal ArticleDOI

Survey of graph database models

TL;DR: The main objective of this survey is to present the work that has been conducted in the area of graph database modeling, concentrating on data structures, query languages, and integrity constraints.
Journal ArticleDOI

Answering queries using views: A survey

TL;DR: The state of the art on the problem of answering queries using views is surveyed, the algorithms proposed to solve it are described, and the disparate works into a coherent framework are synthesized.
Journal ArticleDOI

The Lorel Query Language for Semistructured Data

TL;DR: The main novelties of the Lorel language are the extensive use of coercion to relieve the user from the strict typing of OQL, which is inappropriate for semistructured data; and powerful path expressions, which permit a flexible form of declarative navigational access and are particularly suitable when the details of the structure are not known to the user.
Journal ArticleDOI

The state of the art in distributed query processing

TL;DR: The paper presents the “textbook” architecture for distributed query processing and a series of techniques that are particularly useful for distributed database systems, and discusses different kinds of distributed systems such as client-server, middleware (multitier), and heterogeneous database systems and shows how query processing works in these systems.
References
More filters
Book

Foundations of databases

TL;DR: This book discusses Languages, Computability, and Complexity, and the Relational Model, which aims to clarify the role of Semantic Data Models in the development of Query Language Design.
Book

Principles of database and knowledge-base systems

TL;DR: This book goes into the details of database conception and use, it tells you everything on relational databases from theory to the actual used algorithms.
Journal ArticleDOI

Federated database systems for managing distributed, heterogeneous, and autonomous databases

TL;DR: In this paper, the authors define a reference architecture for distributed database management systems from system and schema viewpoints and show how various FDBS architectures can be developed, and define a methodology for developing one of the popular architectures of an FDBS.
Proceedings Article

Querying Heterogeneous Information Sources Using Source Descriptions

TL;DR: The Information Manifold is described, an implemented system that provides uniform access to a heterogeneous collection of more than 100 information sources, many of them on the WWW, and algorithms that use the source descriptions to prune effciently the set of information sources for a given query are described.