What contributions have the authors mentioned in the paper "Xml document parsing: operational and performance characteristics" ?

In this paper, the authors compare the operational and performance characteristics of XML processing in four stages: parsing, access, modification, and serialization.

(Open Access) XML Document Parsing: Operational and Performance Characteristics (2008) | Tak Cheung Lam

XML Document

Parsing: Operational

and Performance

Characteristics

Tak Cheung Lam

and Jianxun Jason Ding

Cisco Systems

Jyh-Charn Liu

Texas A&M University

Parsing is an expensive operation that can degrade XML

processing performance. A survey of four representative XML

parsing models—DOM, SAX, StAX, and VTD—reveals their

suitability for different types of applications.

roadly used in database and networking applications, the

Extensible Markup Language is the de facto standard for the

interoperable document format. As XML becomes widespread,

it is critical for application developers to understand the opera-

tional and performance characteristics of XML processing.

As Figure 1 shows, XML processing occurs in four stages: parsing, access,

modification, and serialization. Although parsing is the most expensive

operation,

there are no detailed studies that compare

the processing steps and associated overhead costs of different parsing

models,

tradeoffs in accessing and modifying parsed data, and

XML-based applications’ access and modification requirements.

Figure 1 also illustrates the three-step parsing process. The first two steps,

character conversion and lexical analysis, are usually invariant among dif-

ferent parsing models, while the third step, syntactic analysis, creates data

representations based on the parsing model used.

To help developers make sensible choices for their target applications, we

compared the data representations of four representative parsing models:

document object model (DOM; www.w3.org/DOM), simple API for XML

(SAX; www.saxproject.org), streaming API for XML (StAX; http://jcp.org/

en/jsr/detail?id=173), and virtual token descriptor (VTD; http://vtd-xml.

sourceforge.net). These data representations result in different operational

and performance characteristics.

XML-based database and networking applications have unique require-

ments with respect to access and modification of parsed data. Database

•

C O M P U T I N G P R A C T I C E S

30 Computer

Authorized licensed use limited to: West Virginia University. Downloaded on June 26, 2009 at 12:18 from IEEE Xplore. Restrictions apply.

September 2008

applications must be able to access and modify the doc-

ument structure back and forth; the parsed document

resides in the database server to receive multiple incom-

ing queries and update instructions. Networking appli-

cations rely on one-pass access and modification during

parsing; they pass the unparsed document through the

node to match the parsed queries and update instruc-

tions reside in the node.

XML PARSING STEPS

An XML parser first groups a bit sequence into char-

acters, then groups the characters into tokens, and finally

verifies the tokens and organizes them into certain data

representations for analysis at the access stage.

Character conversion

The first parsing step involves converting a bit sequence

from an XML document to the character sets the host

programming language understands. For example,

documents written in Western, Latin-style alphabets are

usually created in UTF-8, while Java usually reads char-

acters in UTF-16. In most cases, a UTF-8 character can

be converted to UTF-16 by simply padding 8-bit lead-

ing zeros. For example, the parser converts “<” “a” “>”

from “3C 61 3E” to “003C 0061 003E” in hexadecimal

representation. It is possible to avoid such a character

conversion by composing the documents in UTF-16, but

UTF-16 takes twice as much space as UTF-8, which has

tradeoffs in storage and character scanning speed.

Lexical analysis

The second parsing step involves partitioning the

character stream into subsequences called tokens.

Major tokens include a start element, text, and an end

element, as Table 1 shows. A token can itself consist

of multiple tokens. Each token is defined by a regular

expression in the World Wide Web Consortium (W3C)

XML specifications, as shown in Table 2. For exam-

ple, a start element consists of a “<”, followed by an

element name, zero or more attributes preceded by a

space-like character, and a “>”. Each attribute consists

of an attribute name, followed by an “=” enclosed by a

Table 1. XML token examples.

Token Example

Start element

End element <Record>John</Record>

Text <Record>John</Record>

Start element name <Record private = “yes”>

Attribute name <Record private = “yes”>

Attribute value <Record private = “yes”>

a a a a a a a a

c c cc

$ $

Intial

stack

Read

<a>

Read

<b1>

Read

<c>

Read

</c>

Read

</b1>

Read

<b2>

Read

</b2>

Read

</a>

PDAFSM

Final state

Start state Space

Element

name

Start

element

found

Space

Space or char

Char

Element

name

End

element

found

Text

found

Text

EOF

∅

Character sequence

(for example, 003C 0061 003E =

‘<’ ‘a’ ‘>’)

Token sequence

(for example,

‘<a>’ ‘x’ ‘</a>’)

Data representation (parsing model dependent)

(for example, tree, events, integer arrays)

Bit sequence

(for example,

3C 61 3E)

Character

conversion

(for example, pad zeros)

Invariant among

different parsing models

Variant among

different parsing models

Semantic

analysis

Input XML

document

Parsing Access Modification Serialization

(Performance bottleneck) (Performance affected by parsing models)

Output XML

document

Syntactic

analysis

(PDA)

Lexical

analysis

(FSM)

Managed by application

(access, modification, and so on)

Ready

scan

element

∅

Figure 1. XML processing stages and parsing steps. The three-step parsing process is the most expensive operation in XML

processing.

Authorized licensed use limited to: West Virginia University. Downloaded on June 26, 2009 at 12:18 from IEEE Xplore. Restrictions apply.

32 Computer

zero or one space-like character on each side, and then

an attribute value.

A finite-state machine (FSM) processes the character

stream to match the regular expressions. The simplified

FSM in Figure 1 processes the start element, text, and

the end element only, without processing attributes. To

achieve full tokenization, an FSM must evaluate many

conditions that occur at every character. Depending on

the nature of these conditions and the frequency with

which they occur, this can result in a less predictable flow

of instructions and thus potentially low performance

on a general-purpose processor. Proposed tokenization

improvements include assigning priority to transition

rules,

changing instruction sets for “<” and “>”,

and

duplicating the FSM for parallel processing.

Syntactic analysis

The third parsing step involves verifying the tokens’

well-formedness, mainly by ensuring that they have

properly nested tags. The pushdown automaton (PDA)

in Figure 1 verifies the nested structure using the follow-

ing transition rules:

The PDA initially pushes a “$” symbol to the stack.

If it finds a start element, the PDA pushes it to the

stack.

If it finds an end element, the PDA checks whether

it is equal to the top of the stack.

If yes, the PDA pops the element from the stack.

If the top element is “$”, then the document is

“well-formed.” Done!

Otherwise, the PDA continues to read the next

element.

If no, the document is not “well-formed.” Done!

In the complete well-formedness check, the PDA must

verify more constraints—for example, attribute names

•

of the same element cannot repeat. If schema validation

is required, a more sophisticated PDA checks extra con-

straints such as specific element names, the number of

child elements, and the data type of attribute values.

In accordance with the parsing model, the PDA orga-

nizes tokens into data representations for subsequent

processing. For example, it can produce a tree object

using the following variation of transition rule 2:

If it finds a start element, the PDA checks the top element

before pushing it to the stack.

If the top element is “$”, then this start element is

the root.

Otherwise, this start element becomes the top ele-

ment’s child.

After syntactic analysis, the data representations are

available for access or modification by the application

via various APIs provided by different parsing models,

including DOM, SAX, StAX, and VTD.

PARSING MODEL DATA REPRESENTATIONS

XML parsers use different models to create data rep-

resentations. DOM creates a tree object, VTD creates

integer arrays, and SAX and StAX create a sequence of

events. Both DOM and VTD maintain long-lived struc-

tural data for sophisticated operations in the access and

modification stages, while SAX and StAX do not. DOM

as well as SAX and StAX create objects for their data

representations, while VTD eliminates the object-cre-

ation overhead via integer arrays.

DOM and VTD maintain different types of long-lived

structural data. DOM produces many node objects to

build the tree object. Each node object stores the element

name, attributes, namespaces, and pointers to indicate

the parent-child-sibling relationship. For example, in Fig-

ure 2 the node object stores the element name of Phone

as well as the pointers to its parent (Home), child (1234),

and next sibling (Address). In contrast, VTD creates no

object but stores the original document and produces

arrays of 64-bit integers called VTD records (VRs) and

location caches (LCs). VRs store token positions in the

original document, while LCs store the parent-child-sib-

ling relationship among tokens.

While DOM produces many node objects that include

pointers to indicate the parent-child-sibling relationship,

SAX and StAX associate different objects with differ-

ent events and do not maintain the structures among

objects. For example, the start element event is associ-

ated with three String objects and an Attributes object

for the namespace uniform resource identifier (URI),

local name, qualified name, and attribute list. The end

element event is similar to the start element event with-

out an attribute list. The character event is associated

with an array of characters and two integers to denote

•

Table 2. Regular expressions of XML tokens.

Token Regular expression

Start element ‘<’ Name (S Attribute)* S? ‘<’

End element ‘</’ Name (S Attribute)* S? ‘<’

Attribute Name Eq AttValue

S (0×20 0×90×D 0×A)+

Space-like characters

Eq S? ‘=’ S?

Equal-like characters

Name Some other regular expressions

AttValue Some other regular expressions

* = 0 or more; ? = 0 or 1; + = 1 or more.

Authorized licensed use limited to: West Virginia University. Downloaded on June 26, 2009 at 12:18 from IEEE Xplore. Restrictions apply.

September 2008

the start position and text length. In Figure 2, Phone’s

start element has no attribute and namespace, so SAX

and StAX associate it with two String objects to store its

local and qualified names.

OPERATIONAL AND

PERFORMANCE CHARACTERISTICS

Different data representations result in different opera-

tional and performance characteristics, as summarized

in Tables 3 and 4, respectively. They also affect the choice

of parsing models for various applications, as indicated in

Table 5. We focus on how different data representations

impact three XML processing capabilities: streaming,

access and modification, and hardware acceleration.

Streaming capability

Streaming requires low latency and memory usage,

and usually the parser only needs to extract a small

portion of the document sequentially without knowing

the entire document structure. To understand parsing

models’ impact on streaming capability, it is important

to understand how the parser and application interact

during data access.

DOM and VTD. As Figure 3a shows, DOM and VTD

can access data only after parsing is complete—that

is, when the loop inside the parser program can draw

no more tokens from lexical analysis to construct the

tree or VRs. A large document will significantly delay

data access. Moreover, the two models’ long-lived data

Address

64-bit integers

(token type, offset, length, and so on)

Address

11th St. M Ave.

…

PhonePhone

1234

5678

DOM: Tree object

Life: Long Object: Yes

VTD: Integer arrays

Life: Long Object: No

SAX/StAX: Events

Life: Short Object: Yes

Record

Name Work

John

Home

start document

start element: Record

…

start element: Name

character: John

end element: Name

start element: Home

start element: Phone

character: 1234

end element: Phone

…

end element: record

end document

Address

Home

1234

null

parent

child

nextSibling

prevSibling

Node object

startEvent: Phone

url: null attrList: null

l_name: Phone q_name: Phone

Original document VTD records Location caches

<?xml version = “1.0”?>

<Home>

<Phone>

1234</Phone>

</Home>

<Work>

</Work>

</Record>

version

1.0

Record

Name

John

Name

Home

Phone

1234

Phone

…

Record

…

–1

…

6:7

15:3

23:0:6

33:0:4

38:4

44:0:4

52:0:4

61:0:5

67:4

73:0:5

…

187:0:6

token name token type nested depth

offset:length/

offset:prefex

length:qname

length

33:0:4

53:0:4

121:0:4

token index

–1

61:0:5

130:0:5

1st child index

LC1 (depth = 1)

61:0:5

84:0:7

130:0:5

152:0:7

token index

–1

1st child index

LC2 (depth = 2)

Figure 2. Data representation example. The start element of Phone is represented by “0, 2, 61:0:5” in VTD records. This entry indicates

that there is a token of type 0 (start element) at nested depth 2, and this token’s first character is located at the 61st position of the

original document. This token has a prefix name of length 0, indicating that the token does not use a namespace, and a qualified

name of length 5. The token indices (offset: prefix length: qname length) of all start elements are stored in location caches at certain

nested depths. For example, LC level 2 (LC2) stores the token indices by its first 32-bit field for all start elements at nested depth 2.

The second 32-bit field stores the index of its first child. A token with no child has “–1” in this field. For example, the start element of

Phone is recorded in LC2 as “61:0:5, –1”.

Authorized licensed use limited to: West Virginia University. Downloaded on June 26, 2009 at 12:18 from IEEE Xplore. Restrictions apply.

34 Computer

Table 3. XML processing operational characteristics.

XML processing

stage DOM SAX (push) StAX (pull) VTD

Parsing 1. Extract token as objects. 1. Extract token as objects. 1. Extract token as objects. 1. Do not extract token as

objects (use integers instead).

2. Build tree by objects 2. Create events by objects 2. Create events by objects 2. Build location cache and

(for example, Nodes). (for example, Strings). (for example, Strings). 64-bit VTD records.

3. Not ready for access. 3. Ready for access—go to 3. Ready for access—go to 3. Not ready for access.

step 8 (application handles step 8 (application handles

event). or skips event).

4. Do not destroy any 4. Destroy objects after 4. Destroy objects after 4. Do not destroy any objects.

objects. handling the event. handling or skipping the

event.

5. Repeat from step 1 until 5. Repeat from step 1 until 5. Repeat from step 1 until 5. Repeat from step 1 until

all tokens are processed. all tokens are processed. all tokens are processed. all tokens are processed.

6. (Optional) Destroy the 6. (Optional) Destroy the 6. (Optional) Destroy the 6. Keep the original document

original document after original document after original document after in memory.

building the entire tree. handling all events. handling or skipping all

events.

7. Ready for access. 7. Access is complete—go 7. Access is complete—go 7. Ready for access.

to step 9. to step 9.

Access 8. Back-and-forth access: 8. Sequential access (no 8. Sequential access (skip 8. Back-and-forth access:

Parsing provides sufficient skip): The application creates forward): The application Parsing provides sufficient data

data structures (tree). its own data structure if more creates its own data structures (VTD records and

advanced access or structure if more advanced location caches).

modification is required (go access or modification is

to step 4). required (go to step 4).

Modification 9. Update the tree. 9. Update the data structure 9. Update the data structure 9. Update by making new copy

from step 8. from step 8. of the document.

10. Write the tree in XML 10. Write the data structure 10. Write the data structure 10. The document is already in

format. from step 9 in XML format. from step 9 in XML format. XML format.

11. Destroy the tree. 11. Destroy the data 11. Destroy the data 11. Destroy VTD records and

structure. structure. location cache.

Table 4. XML processing performance characteristics.

Category DOM SAX (push) StAX (pull) VTD

Output Tree object Events (all tokens) Events (interested tokens) Integer array

Parsing (CPU) High Medium Medium Low

Parsing (memory) Intensive Low Low Medium

Access (navigation) Fast (back and forth) Slow (sequential: no skipping) Medium (sequential: Fast (back and forth)

skip forward)

Modification (update) Medium (not incremental) Depends (template/forward) Depends (template/forward) Fast (incremental)

Estimated

throughput, small ~10 Mbytes per second ~20 Mbytes per second ~20 Mbytes per second ~50 Mbytes per second

file (1 Kbyte-15 Kbytes)*

Estimated

throughput, large ~5 Mbytes per second ~20 Mbytes per second ~20 Mbytes per second ~40 Mbytes per second

file (1 Mbyte-15 Mbytes)*

Estimated memory,* large ~7 Mbytes Does not depend on Does not depend on ~1.5 Mbytes

file (1 Mbyte-15 Mbytes) document size document size

* The test platform is a Sony VAIO laptop with a Pentium M 1.7-GHz processor (2-Mbyte integrated L2 cache) and 512-Mbyte DDR2 RAM. The front bus is

clocked at 400 MHz. The OS is Windows XP Professional Edition with Service Pack 2, and the Java virtual machine is version 1.5.0_06.

Authorized licensed use limited to: West Virginia University. Downloaded on June 26, 2009 at 12:18 from IEEE Xplore. Restrictions apply.

XML Document Parsing: Operational and Performance Characteristics

Figures

Citations

DBLP: some lessons learned

Review: Service-oriented middleware: A survey

Towards Medical Data Interoperability Through Collaboration of Healthcare Devices

Performance Evaluation of Continuity of Care Records (CCRs): Parsing Models in a Mobile Health Management System

Using SWE Standards for Ubiquitous Environmental Sensing: A Performance Analysis

References

XML parsing: a threat to database performance

Parallel XML Parsing Using Meta-DFAs

Dual Processor Performance Characterization for XML Application-Oriented Networking

Benchmarking XML Based Application Oriented Network Infrastructure and Services

Related Papers (5)

A Comparative Study and Benchmarking on XML Parsers

A Parallel Approach to XML Parsing

Extensible markup language (XML) performance optimization on a multi-core central processing unit (CPU) through core assignment

XML (extensive markup language) parsing method and implementation method of custom XML structural forms in medical records

New Hybrid Data Model for XML Document Management in Electronic Commerce

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "Xml document parsing: operational and performance characteristics" ?