scispace - formally typeset
Open AccessProceedings ArticleDOI

Regular expression types for XML

Reads0
Chats0
TLDR
The subtyping algorithm developed here is a variant of Aiken and Murphy's set-inclusion constraint solver, to which are added several optimizations and two new properties: the algorithm is provably complete, and it allows a useful "subtagging" relation between nodes with different labels in XML trees.
Abstract
We propose regular expression types as a foundation for XML processing languages. Regular expression types are a natural generalization of Document Type Definitions (DTDs), describing structures in XML documents using regular expression operators (i.e., *, ?, |, etc.) and supporting a simple but powerful notion of subtyping.The decision problem for the subtype relation is EXPTIME-hard, but it can be checked quite efficiently in many cases of practical interest. The subtyping algorithm developed here is a variant of Aiken and Murphy's set-inclusion constraint solver, to which are added several optimizations and two new properties: (1) our algorithm is provably complete, and (2) it allows a useful "subtagging" relation between nodes with different labels in XML trees.

read more

Content maybe subject to copyright    Report

*<7D3@A7BG=4&3<<AG:D/<7/*<7D3@A7BG=4&3<<AG:D/<7/
(16=:/@:G=;;=<A(16=:/@:G=;;=<A
3>/@B;3<B/:&/>3@A( 3>/@B;3<B=4=;>CB3@<4=@;/B7=<(173<13
/<C/@G
'35C:/@F>@3AA7=<)G>3A4=@-#"'35C:/@F>@3AA7=<)G>3A4=@-#"
/@C==A=G/
!G=B=*<7D3@A7BG
3@=;3+=C7::=<
$'(
3<8/;7<&73@13
*<7D3@A7BG=4&3<<AG:D/<7/
01>73@1317AC>3<<32C
=::=EB67A/<2/227B7=</:E=@9A/B6BB>A@3>=A7B=@GC>3<<32C17A.>/>3@A
'31=;;3<2327B/B7=<'31=;;3<2327B/B7=<
/@C==A=G/ 3@=;3+=C7::=</<23<8/;7<&73@13'35C:/@F>@3AA7=<)G>3A4=@-#" /<C/@G

=>G@756B#)67A7AB63/CB6=@AD3@A7=<=4B63E=@9B7A>=AB3263@30G>3@;7AA7=<=4#4=@G=C@
>3@A=</:CA3$=B4=@@327AB@70CB7=<)6323J<7B7D3D3@A7=<E/A>C0:7A6327<
#)@/<A/1B7=<A=<&@=5@/;;7<5
"/<5C/53A/<2(GAB3;A
+=:C;3AAC3 /<C/@G>/53A
&C0:7A63@*'"6BB>2=7/1;=@5
)67A>/>3@7A>=AB32/B(16=:/@:G=;;=<A6BB>A@3>=A7B=@GC>3<<32C17A.>/>3@A
=@;=@37<4=@;/B7=<>:3/A31=<B/1B@3>=A7B=@G>=0=FC>3<<32C

'35C:/@F>@3AA7=<)G>3A4=@-#"'35C:/@F>@3AA7=<)G>3A4=@-#"
0AB@/1B0AB@/1B
,3>@=>=A3
@35C:/@3F>@3AA7=<BG>3A
/A/4=C<2/B7=<4=@AB/B71/::GBG>32-#">@=13AA7<5:/<5C/53A
'35C:/@3F>@3AA7=<BG>3A:793;=ABA163;/:/<5C/53A4=@-#"7<B@=2C13@35C:/@3F>@3AA7=<<=B/B7=<A
AC16/A@3>3B7B7=</:B3@</B7=<I3B1B=23A1@703-#"2=1C;3<BA)63<=D3:BG=4=C@BG>3AGAB3;7A/
A3;/<B71>@3A3<B/B7=<=4AC0BG>7<5/A7<1:CA7=<03BE33<B63A3BA=42=1C;3<BA23<=B320GBE=BG>3A
,357D3A3D3@/:3F/;>:3A7::CAB@/B7<5B63CA34C:<3AA=4B67A4=@;=4AC0BG>7<57<-#">@=13AA7<5
)632317A7=<>@=0:3;4=@B63AC0BG>3@3:/B7=<@32C13AB=B637<1:CA7=<>@=0:3;03BE33<B@33/CB=;/B/
E67167A9<=E<B=03-&)#1=;>:3B3)=/D=72B67A67561=;>:3F7BG7<BG>71/:1/A3AE323D3:=>/
>@/1B71/:/:5=@7B6;B6/BC<:7931:/AA71/:/:5=@7B6;A0/A32=<23B3@;7<7H/B7=<=4B@33/CB=;/B/16319A
B637<1:CA7=<@3:/B7=<0G/B=>2=E<B@/D3@A/:=4B63=@757</:BG>33F>@3AA7=<A)63;/7</2D/<B/53=4B67A
/:5=@7B6;7AB6/B7B1/<3F>:=7BB63>@=>3@BGB6/BBG>33F>@3AA7=<A037<51=;>/@32=4B3<A6/@3>=@B7=<A=4
B637@@3>@3A3<B/B7=<A%C@/:5=@7B6;7A/D/@7/<B=4793</<2#C@>6GAA3B7<1:CA7=<1=<AB@/7<BA=:D3@B=
E6716/@3/2232A3D3@/:<3E7;>:3;3<B/B7=<B316<7?C3A1=@@31B<3AA>@==4A/<2>@3:7;7</@G
>3@4=@;/<13;3/AC@3;3<BA=<A=;3A;/::>@=5@/;A7<B632=;/7<=4BG>32-#">@=13AA7<5
!3GE=@2A!3GE=@2A
>@=5@/;;7<5:/<5C/53A:/<5C/531=<AB@C1BA/<243/BC@3A2/B/BG>3A/<2AB@C1BC@3A:/<5C/53AB63=@G
BG>3AGAB3;A-#"AC0BG>7<5
=;;3<BA=;;3<BA
=>G@756B#)67A7AB63/CB6=@AD3@A7=<=4B63E=@9B7A>=AB3263@30G>3@;7AA7=<=4#4=@
G=C@>3@A=</:CA3$=B4=@@327AB@70CB7=<)6323J<7B7D3D3@A7=<E/A>C0:7A6327<
#)@/<A/1B7=<A=<
&@=5@/;;7<5"/<5C/53A/<2(GAB3;A
+=:C;3AAC3 /<C/@G>/53A
&C0:7A63@*'"6BB>2=7/1;=@5
)67A8=C@</:/@B71:37A/D/7:/0:3/B(16=:/@:G=;;=<A6BB>A@3>=A7B=@GC>3<<32C17A.>/>3@A

Regular Expression Types for XML
HARUO HOSOYA
Kyoto University
hahosoya@kurims.kyoto-u.ac.jp
J
´
ER
ˆ
OME VOUILLON
CNRS and Denis Diderot University
Jerome.Vouillon@pps.jussieu.fr
and
BENJAMIN C. PIERCE
University of Pennsylvania
bcpierce@cis.upenn.edu
We propose regular expression types as a foundation for statically typed XML processing lan-
guages. Regular expression types, like most schema languages for XM L, introduce regular ex-
pression notations such as repetition (*), alternation (|), etc., to describe XML documents. The
novelty of our type system is a semantic presentation of subtyping, as inclusion between the sets
of documents denoted by two types. We give several examples illustrating the usefulness of this
form of subtyping in XML processing.
The decision problem for the subtype relation reduces to the inclusi on problem between tree
automata, which is known to be exptime-complete. To avoid this high complexity in typical cases,
we develop a pr actical algorithm that, unlike classical algorithms based on determinization of tree
automata, checks the inclusion relation by a top-down traversal of the original type expressions.
The main advantage of this algorithm is that it can exploit the property that type expressions
being compared often share portions of their representations. Our algorithm is a variant of Aiken
and Mur phy’s set-inclusion constraint solver, to which are added several new im plementation tech-
niques, correctness proofs, and preliminary performance measurements on some small programs
in the domain of typed XML processing.
Categories and Subject Descriptors: D.3.3 [Programming Languages]: Language Constructs
and Features—data types and structures
General Terms: Languages, Theory
Additional Key Words and Phrases: Type systems, XML, subtyping
Authors’ address: H. Hosoya, Research Institute for Mathematical Sciences, Kyoto University
Oiwake-cho, Ki tashirakawa, Sakyo-ku, Kyoto 606–8502, Japan. J. Vouillon, PPS, Universit´e
Denis Diderot, Case 7014, 2 Place Jussieu, F-75251 PARIS Cedex 05, France. B. C. Pierce,
Department of Computer and Information Science, University of Pennsylvania, 200 South 33rd
St., Philadelphia, PA 19104, USA.
Permission to make digital/hard copy of all or part of this material without fee for personal
or classroom use provided that the copies are not made or distr ibuted for profit or commercial
advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and
notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish,
to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.
c
1999 ACM 0164-0925/99/0100-0111 $00.75
ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TDB, Month Year, Pages 1–??.

2 · Haruo Hosoya et al.
1. INTRODUCTION
XML [Bray et al. 2000] is an emerging standard format for tree-structured data.
One of the reasons for its popularity is the existence of a number of schema lan-
guages, including DTDs [Bray et al. 2000], XML-Schema [Fallside 2001], DSD [Klar-
lund et al. 2000], and RELAX [Murata 2001], that can be used to define “types”
(or schemas”) describing structural constraints on data and thereby improve the
safety of data processing and exchange.
However, the use of types in mainstre am XML processing technology is often
limited to checking only data, not programs. Typically, an XML processing program
first reads an XML document and checks that it conforms to a given type using a
validating parser. The program then uses either a generic tree manipulation library
such as DOM [DOM 2001] or a dedicated XML language such as XSLT [Clark 1999]
or XML-QL [Deutsch et al. 1998]. Since these tools make no systematic connection
between the program and the types of the documents it manipulates, they provide
no compile-time guarantee that the documents produced by the program will always
conform to an intended type.
In this article, we propose regular ex pression types as a foundation for statically
typed processing of XML documents. Regular expression types capture (and gen-
eralize) the regular expression notations (*, ?, |, etc.) commonly found in schema
languages for XML, and support a natural semantic notion of subtyping.
We have used regular expression types in the design of a domain-specific language
called XDuce (“transduce”) for XML processing [Hosoya and Pierce 2000; 2001]. In
the present article, however, our focus is on the structure of the types themselves,
their role in describing transformatio ns on XML documents, and the algorithmic
problems they pose. Interested readers are invited to visit the XDuce home page
http://xduce.sourceforge.net
for more information on the language as a whole.
As a simple example of regular expression types, consider the definitions
type Addrbook = addrbook[Person*]
type Person = person[Name,Email*,Tel?]
type Name = name[String]
type Email = email[String]
type Tel = tel[String]
corresponding to the following set of DTD declara tions:
<!ELEMENT addrbook person*>
<!ELEMENT person (name,email*,tel?)>
<!ELEMENT name #PCDATA>
<!ELEMENT email #PCDATA>
<!ELEMENT tel #PCDATA>
Type constructors of the form l abel[...] classify tree nodes with the tag label
(i.e., XML structures of the form <label>...</label>). Thus, the inhabitants
of the types Name, Email, and Tel are all strings w ith a n appropriate identifying
label. Types may also involve the regular expression oper ators * (repetition) and ?
(optional occurrence), as well as | (alternation). Thus, the type Addrbook describ e s
ACM Transactions on Programming Languages and Syste ms, Vol. TBD, No. TDB, Month Year.

Regular Exression Types for XML · 3
a la bel addrbook whose c ontent is zero or more repetitions of subtrees of type
Person. Likewise, the type Person describes a label person whose content is
a Na me subtree, zero o r more Email subtrees, a nd an optional Tel subtree. An
instance of the type Addrbook is the following XML document:
<addrbook>
<person> <name> Haruo Hosoya </name>
<email> hahosoya@upenn </email>
<email> haruo@u-tokyo </email> </person>
<person> <name> Jerome Vouillon </name>
<email> vouillon@upenn </email>
<tel> 123-456-789 </tel> </person>
</addrbook>
We define s ubtyping between regular expression types in a semantic fashion. A
type, in general, denotes a set of documents; subtyping is simply inclusion be-
tween the sets denoted by two types. For instance, consider again the Person type
definition from above
type Person = person[Name,Email*,Tel?]
and the following variant:
type Person2 = person[(Name | Email | Tel)*]
Elements of the Person type can have one name, zero or more emails, and zero or
one tels in this order, while the Perso n2 type allows any number of such nodes
in any order. Therefo re Person2 describes strictly more documents, which implies
that Person is a subtype of Person2. Such subtype inclusions can be quite useful in
programming. For e xample, suppose that we originally have a value of type Person.
The above inclusion allows us to proc e ss this value using code that does not care
about the ordering among the name, emai l, and tel nodes. (Such a situation might
arise, for exa mple, if we want to display the child nodes in a linear format, where
we would naturally write a single loop over the s e quence of child nodes with a case
branch for each tag.)
Note that if we replaced Person with a more conventional type
person(name(string) × email(string) list × tel(string) option)
(using sum, product, and ML-like list and option types, plus unary c ovariant con-
structors person, name, email, and te l) and Person2, analogous ly, with
person((name(string) + email(string) + tel(string)) list)
the conventional s ubtyping of sum and product ty pes would not yield the inclusion
above. In general, the subtype relation obtained from our definition is quite a bit
more permissive than conventional subtyping. We g ive some further examples in
Section 2.
Regular expre ssion types exactly correspond to tree automata [Comon et al.
1999]—a finite-state machine model for accepting trees. It is easy to construct,
from a given type, a tree automaton that accepts just the set of trees denoted by
the type (Appendix A). Therefore the subtyping problem can be reduced to the
ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TDB, Month Year.

Citations
More filters
Book

Types and Programming Languages

TL;DR: This text provides a comprehensive introduction both to type systems in computer science and to the basic theory of programming languages, with a variety of approaches to modeling the features of object-oriented languages.
Journal ArticleDOI

Taxonomy of XML schema languages using formal language theory

TL;DR: This work presents a formal framework for XML schema languages based on regular tree grammars that helps to describe, compare, and implement such schema languages in a rigorous manner.
Journal ArticleDOI

XDuce: A statically typed XML processing language

TL;DR: The principles of XDuce's design are surveyed, examples illustrating its key features are developed, its foundations in the theory of regular tree automata are described, and a complete formal definition of its core is presented, along with a proof of type safety.
Proceedings ArticleDOI

CDuce: an XML-centric general-purpose language

TL;DR: This work presents the functional language CDuce, discusses some design issues, and shows its adequacy for working with XML documents, including a dispatch algorithm that demonstrates how static type information can be used to obtain very efficient compilation schemas.
Proceedings ArticleDOI

Typechecking for XML transformers

TL;DR: The main result of the paper is that typechecking for k-pebble transducers is decidable, and therefore, typechecking can be performed for a broad range of XML transformation languages, including XML-QL and a fragment of XSLT.
References
More filters
Book

Introduction to Automata Theory, Languages, and Computation

TL;DR: This book is a rigorous exposition of formal languages and models of computation, with an introduction to computational complexity, appropriate for upper-level computer science undergraduates who are comfortable with mathematical arguments.
Journal Article

Extensible Markup Language (XML).

TL;DR: XML is an extremely simple dialect of SGML which is completely described in this document, to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML.
Journal ArticleDOI

Extensible markup language

TL;DR: XML is the lingua franca of the wireless Web and is already being used for a host of server-server communication applications, which make it possible for different data servers to easily exchange information.
Book

Tree Automata Techniques and Applications

TL;DR: The goal of this book is to provide a textbook which presents the basics ofTree automata and several variants of tree automata which have been devised for applications in the aforementioned domains.
Proceedings Article

DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases

TL;DR: The theoretical foundations of DataGuides are presented along with an algorithm for their creation and an overview of incremental maintenance, and performance results based on the implementation of dataGuides in the Lore DBMS for semistructured data are provided.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What contributions have the authors mentioned in the paper "Regular expression types for xml" ?

Hosoya et al. this paper propose regular expression types as a foundation for statically typed processing of XML documents. 

In the future, the authors hope to incorporate other standard features from functional programming, such as higher-order functions and parametric polymorphism. For function types, their current approach—define subtyping by inclusion of the semantics of types and reduce it to the decidability of tree automata inclusion—does not easily extend simply because functions are not trees. Also for polymorphism, their current scheme needs to be substantially extended since usual tree automata do not have any concept corresponding to “ type variables. ” A promising direction might be to incorporate ideas from tree set automata [ Gilleron et al. 1999 ], though the authors have not gone far. 

In particular, the authors can exploit reflexivity (T <: T) in order to decide subtype relations by looking at only a part of the whole input type expressions. 

One of the reasons for its popularity is the existence of a number of schema languages, including DTDs [Bray et al. 2000], XML-Schema [Fallside 2001], DSD [Klarlund et al. 2000], and RELAX [Murata 2001], that can be used to define “types” (or “schemas”) describing structural constraints on data and thereby improve the safety of data processing and exchange. 

In the future, the authors hope to incorporate other standard features from functional programming, such as higher-order functions and parametric polymorphism. 

In the type system studied by Buneman, Davidson, Fernandez, and Suciu [Buneman et al. 1997], types are graph structures and their conformance and subtype relations are defined in terms of graph simulation (which is weaker than the inclusion relation). 

To further improve the speed of equality tests, the authors use hash consing, which associates each type expression with its integer hash value, so that equality can be quickly checked in most cases by comparing their hash values. 

The authors have proposed regular expression types for XML processing, arguing that setinclusion-based subtyping and subtagging yield useful expressive power in this domain. 

The cost is that XML values and their corresponding schemas must somehow be “injected” into the value and type spaces of the host language; this usually involves adding more layers of tagging than were present in the original XML documents, which inhibits subtyping. 

Although schema languages for XML do not treat static verification of programs, the type structures in these languages and regular expression types are worth discussing. 

For function types, their current approach—define subtyping by inclusion of the semantics of types and reduce it to the decidability of tree automata inclusion—does not easily extend simply because functions are not trees. 

Since the authors use only union and equality for the operations on such sets, a suitable representation is a sorted list, which allows us to perform these two operations in linear time. 

A recent example of the embedding approach is Wallace and Runciman’s proposal to use Haskell as a host language [Wallace and Runciman 1999] for XML processing. 

By incorporating several optimization techniques, their algorithm runs at acceptable speeds on several applications involving fairly large types, such as the complete DTD for HTML documents. 

Although their type system stems from Haskell’s, they attain additional flexibility required in XML processing by incorporating, instead of subtyping, extensible records and variants based on row polymorphism.