What have the authors stated for future works in "Regular expression types for xml" ?

In the future, the authors hope to incorporate other standard features from functional programming, such as higher-order functions and parametric polymorphism. For function types, their current approach—define subtyping by inclusion of the semantics of types and reduce it to the decidability of tree automata inclusion—does not easily extend simply because functions are not trees. Also for polymorphism, their current scheme needs to be substantially extended since usual tree automata do not have any concept corresponding to “ type variables. ” A promising direction might be to incorporate ideas from tree set automata [ Gilleron et al. 1999 ], though the authors have not gone far.

What is the way to decide subtype relations?

In particular, the authors can exploit reflexivity (T <: T) in order to decide subtype relations by looking at only a part of the whole input type expressions.

What are the main features of functional programming?

In the future, the authors hope to incorporate other standard features from functional programming, such as higher-order functions and parametric polymorphism.

What is the type system used by Buneman, Fernandez, and Suciu?

In the type system studied by Buneman, Davidson, Fernandez, and Suciu [Buneman et al. 1997], types are graph structures and their conformance and subtype relations are defined in terms of graph simulation (which is weaker than the inclusion relation).

What is the way to improve equality tests?

To further improve the speed of equality tests, the authors use hash consing, which associates each type expression with its integer hash value, so that equality can be quickly checked in most cases by comparing their hash values.

What is the main argument for the proposed regular expression types?

The authors have proposed regular expression types for XML processing, arguing that setinclusion-based subtyping and subtagging yield useful expressive power in this domain.

What is the cost of a type system for XML?

The cost is that XML values and their corresponding schemas must somehow be “injected” into the value and type spaces of the host language; this usually involves adding more layers of tagging than were present in the original XML documents, which inhibits subtyping.

What are the schema languages for XML?

Although schema languages for XML do not treat static verification of programs, the type structures in these languages and regular expression types are worth discussing.

What is the way to extend the semantics of functions?

For function types, their current approach—define subtyping by inclusion of the semantics of types and reduce it to the decidability of tree automata inclusion—does not easily extend simply because functions are not trees.

What is the representation of equality?

Since the authors use only union and equality for the operations on such sets, a suitable representation is a sorted list, which allows us to perform these two operations in linear time.

What is the recent example of the embedding approach?

A recent example of the embedding approach is Wallace and Runciman’s proposal to use Haskell as a host language [Wallace and Runciman 1999] for XML processing.

How does the algorithm run on large types?

By incorporating several optimization techniques, their algorithm runs at acceptable speeds on several applications involving fairly large types, such as the complete DTD for HTML documents.

What is the main difference between Haskell and XML?

Although their type system stems from Haskell’s, they attain additional flexibility required in XML processing by incorporating, instead of subtyping, extensible records and variants based on row polymorphism.

(Open Access) Regular expression types for XML (2000) | Haruo Hosoya

Q: What contributions have the authors mentioned in the paper "Regular expression types for xml" ?

Hosoya et al. this paper propose regular expression types as a foundation for statically typed processing of XML documents.

Q: What is the type system used by Buneman, Fernandez, and Suciu?

In the type system studied by Buneman, Davidson, Fernandez, and Suciu [Buneman et al. 1997], types are graph structures and their conformance and subtype relations are defined in terms of graph simulation (which is weaker than the inclusion relation).

Q: What is the way to improve equality tests?

To further improve the speed of equality tests, the authors use hash consing, which associates each type expression with its integer hash value, so that equality can be quickly checked in most cases by comparing their hash values.

Q: What is the main argument for the proposed regular expression types?

The authors have proposed regular expression types for XML processing, arguing that setinclusion-based subtyping and subtagging yield useful expressive power in this domain.

Q: What is the cost of a type system for XML?

The cost is that XML values and their corresponding schemas must somehow be “injected” into the value and type spaces of the host language; this usually involves adding more layers of tagging than were present in the original XML documents, which inhibits subtyping.

Q: What are the schema languages for XML?

Although schema languages for XML do not treat static verification of programs, the type structures in these languages and regular expression types are worth discussing.

*<7D3@A7BG=4&3<<AG:D/<7/*<7D3@A7BG=4&3<<AG:D/<7/

(16=:/@:G=;;=<A(16=:/@:G=;;=<A

3>/@B;3<B/:&/>3@A( 3>/@B;3<B=4=;>CB3@<4=@;/B7=<(173<13

/<C/@G

'35C:/@F>@3AA7=<)G>3A4=@-#"'35C:/@F>@3AA7=<)G>3A4=@-#"

/@C==A=G/

!G=B=*<7D3@A7BG

3@=;3+=C7::=<

$'(

3<8/;7<&73@13

*<7D3@A7BG=4&3<<AG:D/<7/

01>73@1317AC>3<<32C

=::=EB67A/<2/227B7=</:E=@9A/B6BB>A@3>=A7B=@GC>3<<32C17A.>/>3@A

'31=;;3<2327B/B7=<'31=;;3<2327B/B7=<

/@C==A=G/ 3@=;3+=C7::=</<23<8/;7<&73@13'35C:/@F>@3AA7=<)G>3A4=@-#" /<C/@G



=>G@756B#)67A7AB63/CB6=@AD3@A7=<=4B63E=@9B7A>=AB3263@30G>3@;7AA7=<=4#4=@G=C@

>3@A=</:CA3$=B4=@@327AB@70CB7=<)6323J<7B7D3D3@A7=<E/A>C0:7A6327<

#)@/<A/1B7=<A=<&@=5@/;;7<5

"/<5C/53A/<2(GAB3;A

+=:C;3AAC3 /<C/@G>/53A

&C0:7A63@*'"6BB>2=7/1;=@5

)67A>/>3@7A>=AB32/B(16=:/@:G=;;=<A6BB>A@3>=A7B=@GC>3<<32C17A.>/>3@A

=@;=@37<4=@;/B7=<>:3/A31=<B/1B@3>=A7B=@G>=0=FC>3<<32C

'35C:/@F>@3AA7=<)G>3A4=@-#"'35C:/@F>@3AA7=<)G>3A4=@-#"

0AB@/1B0AB@/1B

,3>@=>=A3

@35C:/@3F>@3AA7=<BG>3A

/A/4=C<2/B7=<4=@AB/B71/::GBG>32-#">@=13AA7<5:/<5C/53A

'35C:/@3F>@3AA7=<BG>3A:793;=ABA163;/:/<5C/53A4=@-#"7<B@=2C13@35C:/@3F>@3AA7=<<=B/B7=<A

AC16/A@3>3B7B7=</:B3@</B7=<I3B1B=23A1@703-#"2=1C;3<BA)63<=D3:BG=4=C@BG>3AGAB3;7A/

A3;/<B71>@3A3<B/B7=<=4AC0BG>7<5/A7<1:CA7=<03BE33<B63A3BA=42=1C;3<BA23<=B320GBE=BG>3A

,357D3A3D3@/:3F/;>:3A7::CAB@/B7<5B63CA34C:<3AA=4B67A4=@;=4AC0BG>7<57<-#">@=13AA7<5

)632317A7=<>@=0:3;4=@B63AC0BG>3@3:/B7=<@32C13AB=B637<1:CA7=<>@=0:3;03BE33<B@33/CB=;/B/

E67167A9<=E<B=03-&)#1=;>:3B3)=/D=72B67A67561=;>:3F7BG7<BG>71/:1/A3AE323D3:=>/

>@/1B71/:/:5=@7B6;B6/BC<:7931:/AA71/:/:5=@7B6;A0/A32=<23B3@;7<7H/B7=<=4B@33/CB=;/B/16319A

B637<1:CA7=<@3:/B7=<0G/B=>2=E<B@/D3@A/:=4B63=@757</:BG>33F>@3AA7=<A)63;/7</2D/<B/53=4B67A

/:5=@7B6;7AB6/B7B1/<3F>:=7BB63>@=>3@BGB6/BBG>33F>@3AA7=<A037<51=;>/@32=4B3<A6/@3>=@B7=<A=4

B637@@3>@3A3<B/B7=<A%C@/:5=@7B6;7A/D/@7/<B=4793</<2#C@>6GAA3B7<1:CA7=<1=<AB@/7<BA=:D3@B=

E6716/@3/2232A3D3@/:<3E7;>:3;3<B/B7=<B316<7?C3A1=@@31B<3AA>@==4A/<2>@3:7;7</@G

>3@4=@;/<13;3/AC@3;3<BA=<A=;3A;/::>@=5@/;A7<B632=;/7<=4BG>32-#">@=13AA7<5

!3GE=@2A!3GE=@2A

>@=5@/;;7<5:/<5C/53A:/<5C/531=<AB@C1BA/<243/BC@3A2/B/BG>3A/<2AB@C1BC@3A:/<5C/53AB63=@G

BG>3AGAB3;A-#"AC0BG>7<5

=;;3<BA=;;3<BA

=>G@756B#)67A7AB63/CB6=@AD3@A7=<=4B63E=@9B7A>=AB3263@30G>3@;7AA7=<=4#4=@

G=C@>3@A=</:CA3$=B4=@@327AB@70CB7=<)6323J<7B7D3D3@A7=<E/A>C0:7A6327<

#)@/<A/1B7=<A=<

&@=5@/;;7<5"/<5C/53A/<2(GAB3;A

+=:C;3AAC3 /<C/@G>/53A

&C0:7A63@*'"6BB>2=7/1;=@5

)67A8=C@</:/@B71:37A/D/7:/0:3/B(16=:/@:G=;;=<A6BB>A@3>=A7B=@GC>3<<32C17A.>/>3@A

Regular Expression Types for XML

HARUO HOSOYA

Kyoto University

hahosoya@kurims.kyoto-u.ac.jp

OME VOUILLON

CNRS and Denis Diderot University

Jerome.Vouillon@pps.jussieu.fr

and

BENJAMIN C. PIERCE

University of Pennsylvania

bcpierce@cis.upenn.edu

We propose regular expression types as a foundation for statically typed XML processing lan-

guages. Regular expression types, like most schema languages for XM L, introduce regular ex-

pression notations such as repetition (*), alternation (|), etc., to describe XML documents. The

novelty of our type system is a semantic presentation of subtyping, as inclusion between the sets

of documents denoted by two types. We give several examples illustrating the usefulness of this

form of subtyping in XML processing.

The decision problem for the subtype relation reduces to the inclusi on problem between tree

automata, which is known to be exptime-complete. To avoid this high complexity in typical cases,

we develop a pr actical algorithm that, unlike classical algorithms based on determinization of tree

automata, checks the inclusion relation by a top-down traversal of the original type expressions.

The main advantage of this algorithm is that it can exploit the property that type expressions

being compared often share portions of their representations. Our algorithm is a variant of Aiken

and Mur phy’s set-inclusion constraint solver, to which are added several new im plementation tech-

niques, correctness proofs, and preliminary performance measurements on some small programs

in the domain of typed XML processing.

Categories and Subject Descriptors: D.3.3 [Programming Languages]: Language Constructs

and Features—data types and structures

General Terms: Languages, Theory

Additional Key Words and Phrases: Type systems, XML, subtyping

Authors’ address: H. Hosoya, Research Institute for Mathematical Sciences, Kyoto University

Oiwake-cho, Ki tashirakawa, Sakyo-ku, Kyoto 606–8502, Japan. J. Vouillon, PPS, Universit´e

Denis Diderot, Case 7014, 2 Place Jussieu, F-75251 PARIS Cedex 05, France. B. C. Pierce,

Department of Computer and Information Science, University of Pennsylvania, 200 South 33rd

St., Philadelphia, PA 19104, USA.

Permission to make digital/hard copy of all or part of this material without fee for personal

or classroom use provided that the copies are not made or distr ibuted for proﬁt or commercial

advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and

notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish,

to post on servers, or to redistribute to lists requires prior speciﬁc permission and/or a fee.

 1999 ACM 0164-0925/99/0100-0111 $00.75

ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TDB, Month Year, Pages 1–??.

2 · Haruo Hosoya et al.

1. INTRODUCTION

XML [Bray et al. 2000] is an emerging standard format for tree-structured data.

One of the reasons for its popularity is the existence of a number of schema lan-

guages, including DTDs [Bray et al. 2000], XML-Schema [Fallside 2001], DSD [Klar-

lund et al. 2000], and RELAX [Murata 2001], that can be used to deﬁne “types”

(or “ schemas”) describing structural constraints on data and thereby improve the

safety of data processing and exchange.

However, the use of types in mainstre am XML processing technology is often

limited to checking only data, not programs. Typically, an XML processing program

ﬁrst reads an XML document and checks that it conforms to a given type using a

validating parser. The program then uses either a generic tree manipulation library

such as DOM [DOM 2001] or a dedicated XML language such as XSLT [Clark 1999]

or XML-QL [Deutsch et al. 1998]. Since these tools make no systematic connection

between the program and the types of the documents it manipulates, they provide

no compile-time guarantee that the documents produced by the program will always

conform to an intended type.

In this article, we propose regular ex pression types as a foundation for statically

typed processing of XML documents. Regular expression types capture (and gen-

eralize) the regular expression notations (*, ?, |, etc.) commonly found in schema

languages for XML, and support a natural semantic notion of subtyping.

We have used regular expression types in the design of a domain-speciﬁc language

called XDuce (“transduce”) for XML processing [Hosoya and Pierce 2000; 2001]. In

the present article, however, our focus is on the structure of the types themselves,

their role in describing transformatio ns on XML documents, and the algorithmic

problems they pose. Interested readers are invited to visit the XDuce home page

http://xduce.sourceforge.net

for more information on the language as a whole.

As a simple example of regular expression types, consider the deﬁnitions

type Addrbook = addrbook[Person*]

type Person = person[Name,Email*,Tel?]

type Name = name[String]

type Email = email[String]

type Tel = tel[String]

corresponding to the following set of DTD declara tions:

<!ELEMENT addrbook person*>

<!ELEMENT person (name,email*,tel?)>

<!ELEMENT name #PCDATA>

<!ELEMENT email #PCDATA>

<!ELEMENT tel #PCDATA>

Type constructors of the form l abel[...] classify tree nodes with the tag label

(i.e., XML structures of the form <label>...</label>). Thus, the inhabitants

of the types Name, Email, and Tel are all strings w ith a n appropriate identifying

label. Types may also involve the regular expression oper ators * (repetition) and ?

(optional occurrence), as well as | (alternation). Thus, the type Addrbook describ e s

ACM Transactions on Programming Languages and Syste ms, Vol. TBD, No. TDB, Month Year.

Regular Exression Types for XML · 3

a la bel addrbook whose c ontent is zero or more repetitions of subtrees of type

Person. Likewise, the type Person describes a label person whose content is

a Na me subtree, zero o r more Email subtrees, a nd an optional Tel subtree. An

instance of the type Addrbook is the following XML document:

<person> <name> Haruo Hosoya </name>

<email> hahosoya@upenn </email>

<email> haruo@u-tokyo </email> </person>

<person> <name> Jerome Vouillon </name>

<email> vouillon@upenn </email>

</addrbook>

We deﬁne s ubtyping between regular expression types in a semantic fashion. A

type, in general, denotes a set of documents; subtyping is simply inclusion be-

tween the sets denoted by two types. For instance, consider again the Person type

deﬁnition from above

type Person = person[Name,Email*,Tel?]

and the following variant:

type Person2 = person[(Name | Email | Tel)*]

Elements of the Person type can have one name, zero or more emails, and zero or

one tels in this order, while the Perso n2 type allows any number of such nodes

in any order. Therefo re Person2 describes strictly more documents, which implies

that Person is a subtype of Person2. Such subtype inclusions can be quite useful in

programming. For e xample, suppose that we originally have a value of type Person.

The above inclusion allows us to proc e ss this value using code that does not care

about the ordering among the name, emai l, and tel nodes. (Such a situation might

arise, for exa mple, if we want to display the child nodes in a linear format, where

we would naturally write a single loop over the s e quence of child nodes with a case

branch for each tag.)

Note that if we replaced Person with a more conventional type

person(name(string) × email(string) list × tel(string) option)

(using sum, product, and ML-like list and option types, plus unary c ovariant con-

structors person, name, email, and te l) and Person2, analogous ly, with

person((name(string) + email(string) + tel(string)) list)

the conventional s ubtyping of sum and product ty pes would not yield the inclusion

above. In general, the subtype relation obtained from our deﬁnition is quite a bit

more permissive than conventional subtyping. We g ive some further examples in

Section 2.

Regular expre ssion types exactly correspond to tree automata [Comon et al.

1999]—a ﬁnite-state machine model for accepting trees. It is easy to construct,

from a given type, a tree automaton that accepts just the set of trees denoted by

the type (Appendix A). Therefore the subtyping problem can be reduced to the

ACM Transactions on Programming Languages and Systems, Vol. TBD, No. TDB, Month Year.

Regular expression types for XML

Figures

Citations

Types and Programming Languages

Taxonomy of XML schema languages using formal language theory

XDuce: A statically typed XML processing language

CDuce: an XML-centric general-purpose language

Typechecking for XML transformers

References

Introduction to Automata Theory, Languages, and Computation

Extensible Markup Language (XML).

Extensible markup language

Tree Automata Techniques and Applications

DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases

Related Papers (5)

CDuce: an XML-centric general-purpose language

XDuce: A statically typed XML processing language

Tree Automata Techniques and Applications

Introduction to Automata Theory, Languages, and Computation

Extensible Markup Language (XML).

Frequently Asked Questions (15)

Q1. What contributions have the authors mentioned in the paper "Regular expression types for xml" ?

Q2. What have the authors stated for future works in "Regular expression types for xml" ?

Q3. What is the way to decide subtype relations?

Q4. What is the reason for XML’s popularity?

Q5. What are the main features of functional programming?

Q6. What is the type system used by Buneman, Fernandez, and Suciu?

Q7. What is the way to improve equality tests?

Q8. What is the main argument for the proposed regular expression types?

Q9. What is the cost of a type system for XML?

Q10. What are the schema languages for XML?

Q11. What is the way to extend the semantics of functions?

Q12. What is the representation of equality?

Q13. What is the recent example of the embedding approach?

Q14. How does the algorithm run on large types?

Q15. What is the main difference between Haskell and XML?