scispace - formally typeset
Open AccessJournal ArticleDOI

XML Document Parsing: Operational and Performance Characteristics

TLDR
A survey of four representative XML parsing models-DOM, SAX, StAX, and VTD-reveals their suitability for different types of applications.
Abstract
Parsing is an expensive operation that can degrade XML processing performance. A survey of four representative XML parsing models-DOM, SAX, StAX, and VTD-reveals their suitability for different types of applications.

read more

Content maybe subject to copyright    Report

XML Document
Parsing: Operational
and Performance
Characteristics
Tak Cheung Lam
and Jianxun Jason Ding
Cisco Systems
Jyh-Charn Liu
Texas A&M University
Parsing is an expensive operation that can degrade XML
processing performance. A survey of four representative XML
parsing models—DOM, SAX, StAX, and VTD—reveals their
suitability for different types of applications.
B
roadly used in database and networking applications, the
Extensible Markup Language is the de facto standard for the
interoperable document format. As XML becomes widespread,
it is critical for application developers to understand the opera-
tional and performance characteristics of XML processing.
As Figure 1 shows, XML processing occurs in four stages: parsing, access,
modification, and serialization. Although parsing is the most expensive
operation,
1
there are no detailed studies that compare
the processing steps and associated overhead costs of different parsing
models,
tradeoffs in accessing and modifying parsed data, and
XML-based applicationsaccess and modification requirements.
Figure 1 also illustrates the three-step parsing process. The first two steps,
character conversion and lexical analysis, are usually invariant among dif-
ferent parsing models, while the third step, syntactic analysis, creates data
representations based on the parsing model used.
To help developers make sensible choices for their target applications, we
compared the data representations of four representative parsing models:
document object model (DOM; www.w3.org/DOM), simple API for XML
(SAX; www.saxproject.org), streaming API for XML (StAX; http://jcp.org/
en/jsr/detail?id=173), and virtual token descriptor (VTD; http://vtd-xml.
sourceforge.net). These data representations result in different operational
and performance characteristics.
XML-based database and networking applications have unique require-
ments with respect to access and modification of parsed data. Database
C O M P U T I N G P R A C T I C E S
30 Computer
Published by the IEEE Computer Society 0018-9162/08/$25.00 © 2008 IEEE
Authorized licensed use limited to: West Virginia University. Downloaded on June 26, 2009 at 12:18 from IEEE Xplore. Restrictions apply.

September 2008
31
applications must be able to access and modify the doc-
ument structure back and forth; the parsed document
resides in the database server to receive multiple incom-
ing queries and update instructions. Networking appli-
cations rely on one-pass access and modification during
parsing; they pass the unparsed document through the
node to match the parsed queries and update instruc-
tions reside in the node.
XML PARSING STEPS
An XML parser first groups a bit sequence into char-
acters, then groups the characters into tokens, and finally
verifies the tokens and organizes them into certain data
representations for analysis at the access stage.
Character conversion
The first parsing step involves converting a bit sequence
from an XML document to the character sets the host
programming language understands. For example,
documents written in Western, Latin-style alphabets are
usually created in UTF-8, while Java usually reads char-
acters in UTF-16. In most cases, a UTF-8 character can
be converted to UTF-16 by simply padding 8-bit lead-
ing zeros. For example, the parser converts <” a“>
from 3C 61 3E” to “003C 0061 003Ein hexadecimal
representation. It is possible to avoid such a character
conversion by composing the documents in UTF-16, but
UTF-16 takes twice as much space as UTF-8, which has
tradeoffs in storage and character scanning speed.
Lexical analysis
The second parsing step involves partitioning the
character stream into subsequences called tokens.
Major tokens include a start element, text, and an end
element, as Table 1 shows. A token can itself consist
of multiple tokens. Each token is defined by a regular
expression in the World Wide Web Consortium (W3C)
XML specifications, as shown in Table 2. For exam-
ple, a start element consists of a <”, followed by an
element name, zero or more attributes preceded by a
space-like character, and a “>. Each attribute consists
of an attribute name, followed by an =enclosed by a
Table 1. XML token examples.
Token Example
Start element
<Record>John</Record>
End element <Record>John</Record>
Text <Record>John</Record>
Start element name <Record private = yes”>
Attribute name <Record private = yes”>
Attribute value <Record private = yes”>
b
2
b
1
b
1
b
1
b
1
b
1
b
1
b
1
c
a a a a a a a a
c c cc
c
b
2
b
2
b1
a
$
b1
a
$
a
$
b2
a
$
a
$ $
b1
a
$
a
$$
Intial
stack
Read
<a>
Read
<b1>
Read
<c>
Read
</c>
Read
</b1>
Read
<b2>
Read
</b2>
Read
</a>
PDAFSM
Final state
Start state Space
Element
name
Start
element
found
Space
Space
Space
Space
Space or char
Char
Char
Char
Char
Element
name
End
element
found
Text
found
Text
EOF
<
<
>
>
/
Character sequence
(for example, 003C 0061 003E =
‘<’ ‘a’ ‘>’)
Token sequence
(for example,
‘<a>’ ‘x’ ‘</a>’)
Data representation (parsing model dependent)
(for example, tree, events, integer arrays)
Bit sequence
(for example,
3C 61 3E)
Character
conversion
(for example, pad zeros)
Invariant among
different parsing models
Variant among
different parsing models
Semantic
analysis
Input XML
document
Parsing Access Modification Serialization
(Performance bottleneck) (Performance affected by parsing models)
Output XML
document
Syntactic
analysis
(PDA)
Lexical
analysis
(FSM)
Managed by application
(access, modification, and so on)
A
P
I
Ready
to
scan
an
element
Figure 1. XML processing stages and parsing steps. The three-step parsing process is the most expensive operation in XML
processing.
Authorized licensed use limited to: West Virginia University. Downloaded on June 26, 2009 at 12:18 from IEEE Xplore. Restrictions apply.

32 Computer
zero or one space-like character on each side, and then
an attribute value.
A finite-state machine (FSM) processes the character
stream to match the regular expressions. The simplified
FSM in Figure 1 processes the start element, text, and
the end element only, without processing attributes. To
achieve full tokenization, an FSM must evaluate many
conditions that occur at every character. Depending on
the nature of these conditions and the frequency with
which they occur, this can result in a less predictable flow
of instructions and thus potentially low performance
on a general-purpose processor. Proposed tokenization
improvements include assigning priority to transition
rules,
2
changing instruction sets for “<” and “>,
3
and
duplicating the FSM for parallel processing.
4
Syntactic analysis
The third parsing step involves verifying the tokens’
well-formedness, mainly by ensuring that they have
properly nested tags. The pushdown automaton (PDA)
in Figure 1 verifies the nested structure using the follow-
ing transition rules:
The PDA initially pushes a “$symbol to the stack.
If it finds a start element, the PDA pushes it to the
stack.
If it finds an end element, the PDA checks whether
it is equal to the top of the stack.
If yes, the PDA pops the element from the stack.
If the top element is “$”, then the document is
“well-formed.Done!
Otherwise, the PDA continues to read the next
element.
If no, the document is not “well-formed.Done!
In the complete well-formedness check, the PDA must
verify more constraints—for example, attribute names
1.
2.
3.
of the same element cannot repeat. If schema validation
is required, a more sophisticated PDA checks extra con-
straints such as specific element names, the number of
child elements, and the data type of attribute values.
In accordance with the parsing model, the PDA orga-
nizes tokens into data representations for subsequent
processing. For example, it can produce a tree object
using the following variation of transition rule 2:
If it finds a start element, the PDA checks the top element
before pushing it to the stack.
If the top element is “$, then this start element is
the root.
Otherwise, this start element becomes the top ele-
ment’s child.
After syntactic analysis, the data representations are
available for access or modification by the application
via various APIs provided by different parsing models,
including DOM, SAX, StAX, and VTD.
PARSING MODEL DATA REPRESENTATIONS
XML parsers use different models to create data rep-
resentations. DOM creates a tree object, VTD creates
integer arrays, and SAX and StAX create a sequence of
events. Both DOM and VTD maintain long-lived struc-
tural data for sophisticated operations in the access and
modification stages, while SAX and StAX do not. DOM
as well as SAX and StAX create objects for their data
representations, while VTD eliminates the object-cre-
ation overhead via integer arrays.
DOM and VTD maintain different types of long-lived
structural data. DOM produces many node objects to
build the tree object. Each node object stores the element
name, attributes, namespaces, and pointers to indicate
the parent-child-sibling relationship. For example, in Fig-
ure 2 the node object stores the element name of Phone
as well as the pointers to its parent (Home), child (1234),
and next sibling (Address). In contrast, VTD creates no
object but stores the original document and produces
arrays of 64-bit integers called VTD records (VRs) and
location caches (LCs). VRs store token positions in the
original document, while LCs store the parent-child-sib-
ling relationship among tokens.
While DOM produces many node objects that include
pointers to indicate the parent-child-sibling relationship,
SAX and StAX associate different objects with differ-
ent events and do not maintain the structures among
objects. For example, the start element event is associ-
ated with three String objects and an Attributes object
for the namespace uniform resource identifier (URI),
local name, qualified name, and attribute list. The end
element event is similar to the start element event with-
out an attribute list. The character event is associated
with an array of characters and two integers to denote
Table 2. Regular expressions of XML tokens.
Token Regular expression
Start element <’ Name (S Attribute)* S? ‘<
End element </’ Name (S Attribute)* S? ‘<
Attribute Name Eq AttValue
S (0×20 0×90×D 0×A)+
Space-like characters
Eq S? ‘=’ S?
Equal-like characters
Name Some other regular expressions
AttValue Some other regular expressions
* = 0 or more; ? = 0 or 1; + = 1 or more.
Authorized licensed use limited to: West Virginia University. Downloaded on June 26, 2009 at 12:18 from IEEE Xplore. Restrictions apply.

September 2008
33
the start position and text length. In Figure 2, Phones
start element has no attribute and namespace, so SAX
and StAX associate it with two String objects to store its
local and qualified names.
OPERATIONAL AND
PERFORMANCE CHARACTERISTICS
Different data representations result in different opera-
tional and performance characteristics, as summarized
in Tables 3 and 4, respectively. They also affect the choice
of parsing models for various applications, as indicated in
Table 5. We focus on how different data representations
impact three XML processing capabilities: streaming,
access and modification, and hardware acceleration.
Streaming capability
Streaming requires low latency and memory usage,
and usually the parser only needs to extract a small
portion of the document sequentially without knowing
the entire document structure. To understand parsing
models’ impact on streaming capability, it is important
to understand how the parser and application interact
during data access.
DOM and VTD. As Figure 3a shows, DOM and VTD
can access data only after parsing is completethat
is, when the loop inside the parser program can draw
no more tokens from lexical analysis to construct the
tree or VRs. A large document will significantly delay
data access. Moreover, the two models’ long-lived data
Address
64-bit integers
(token type, offset, length, and so on)
Address
11th St. M Ave.
PhonePhone
1234
5678
DOM: Tree object
Life: Long Object: Yes
VTD: Integer arrays
Life: Long Object: No
SAX/StAX: Events
Life: Short Object: Yes
Record
Name Work
John
Home
start document
start element: Record
start element: Name
character: John
end element: Name
start element: Home
start element: Phone
character: 1234
end element: Phone
end element: record
end document
Address
Home
1234
null
parent
child
nextSibling
prevSibling
Node object
startEvent: Phone
url: null attrList: null
l_name: Phone q_name: Phone
Original document VTD records Location caches
<?xml version = “1.0”?>
<Record>
<Name>John</Name>
<Home>
<Phone>
1234</Phone>
<Address>11th St</Address>
</Home>
<Work>
<Phone>5678</Phone>
<Address>M Ave</Address>
</Work>
</Record>
version
1.0
Record
Name
John
Name
Home
Phone
1234
Phone
Record
9
10
0
0
5
1
0
0
5
1
1
–1
–1
0
1
1
1
1
2
2
2
0
6:7
15:3
23:0:6
33:0:4
38:4
44:0:4
52:0:4
61:0:5
67:4
73:0:5
187:0:6
token name token type nested depth
offset:length/
offset:prefex
length:qname
length
33:0:4
53:0:4
121:0:4
token index
–1
61:0:5
130:0:5
1st child index
LC1 (depth = 1)
61:0:5
84:0:7
130:0:5
152:0:7
token index
–1
–1
–1
–1
1st child index
LC2 (depth = 2)
Figure 2. Data representation example. The start element of Phone is represented by “0, 2, 61:0:5” in VTD records. This entry indicates
that there is a token of type 0 (start element) at nested depth 2, and this token’s first character is located at the 61st position of the
original document. This token has a prefix name of length 0, indicating that the token does not use a namespace, and a qualified
name of length 5. The token indices (offset: prefix length: qname length) of all start elements are stored in location caches at certain
nested depths. For example, LC level 2 (LC2) stores the token indices by its first 32-bit field for all start elements at nested depth 2.
The second 32-bit field stores the index of its first child. A token with no child has “–1in this field. For example, the start element of
Phone is recorded in LC2 as “61:0:5, 1”.
Authorized licensed use limited to: West Virginia University. Downloaded on June 26, 2009 at 12:18 from IEEE Xplore. Restrictions apply.

34 Computer
Table 3. XML processing operational characteristics.
XML processing
stage DOM SAX (push) StAX (pull) VTD
Parsing 1. Extract token as objects. 1. Extract token as objects. 1. Extract token as objects. 1. Do not extract token as
objects (use integers instead).
2. Build tree by objects 2. Create events by objects 2. Create events by objects 2. Build location cache and
(for example, Nodes). (for example, Strings). (for example, Strings). 64-bit VTD records.
3. Not ready for access. 3. Ready for accessgo to 3. Ready for accessgo to 3. Not ready for access.
step 8 (application handles step 8 (application handles
event). or skips event).
4. Do not destroy any 4. Destroy objects after 4. Destroy objects after 4. Do not destroy any objects.
objects. handling the event. handling or skipping the
event.
5. Repeat from step 1 until 5. Repeat from step 1 until 5. Repeat from step 1 until 5. Repeat from step 1 until
all tokens are processed. all tokens are processed. all tokens are processed. all tokens are processed.
6. (Optional) Destroy the 6. (Optional) Destroy the 6. (Optional) Destroy the 6. Keep the original document
original document after original document after original document after in memory.
building the entire tree. handling all events. handling or skipping all
events.
7. Ready for access. 7. Access is completego 7. Access is completego 7. Ready for access.
to step 9. to step 9.
Access 8. Back-and-forth access: 8. Sequential access (no 8. Sequential access (skip 8. Back-and-forth access:
Parsing provides sufficient skip): The application creates forward): The application Parsing provides sufficient data
data structures (tree). its own data structure if more creates its own data structures (VTD records and
advanced access or structure if more advanced location caches).
modification is required (go access or modification is
to step 4). required (go to step 4).
Modification 9. Update the tree. 9. Update the data structure 9. Update the data structure 9. Update by making new copy
from step 8. from step 8. of the document.
10. Write the tree in XML 10. Write the data structure 10. Write the data structure 10. The document is already in
format. from step 9 in XML format. from step 9 in XML format. XML format.
11. Destroy the tree. 11. Destroy the data 11. Destroy the data 11. Destroy VTD records and
structure. structure. location cache.
Table 4. XML processing performance characteristics.
Category DOM SAX (push) StAX (pull) VTD
Output Tree object Events (all tokens) Events (interested tokens) Integer array
Parsing (CPU) High Medium Medium Low
Parsing (memory) Intensive Low Low Medium
Access (navigation) Fast (back and forth) Slow (sequential: no skipping) Medium (sequential: Fast (back and forth)
skip forward)
Modification (update) Medium (not incremental) Depends (template/forward) Depends (template/forward) Fast (incremental)
Estimated
5
throughput, small ~10 Mbytes per second ~20 Mbytes per second ~20 Mbytes per second ~50 Mbytes per second
file (1 Kbyte-15 Kbytes)*
Estimated
5
throughput, large ~5 Mbytes per second ~20 Mbytes per second ~20 Mbytes per second ~40 Mbytes per second
file (1 Mbyte-15 Mbytes)*
Estimated memory,* large ~7 Mbytes Does not depend on Does not depend on ~1.5 Mbytes
file (1 Mbyte-15 Mbytes) document size document size
* The test platform is a Sony VAIO laptop with a Pentium M 1.7-GHz processor (2-Mbyte integrated L2 cache) and 512-Mbyte DDR2 RAM. The front bus is
clocked at 400 MHz. The OS is Windows XP Professional Edition with Service Pack 2, and the Java virtual machine is version 1.5.0_06.
Authorized licensed use limited to: West Virginia University. Downloaded on June 26, 2009 at 12:18 from IEEE Xplore. Restrictions apply.

Citations
More filters
Journal ArticleDOI

DBLP: some lessons learned

TL;DR: A review of the evolution of DBLP, where persons play a central role, and discussion of person names may be applicable to many other data bases.
Journal ArticleDOI

Review: Service-oriented middleware: A survey

TL;DR: This paper surveys some of the work in service-oriented middleware (SOM) and discusses the main objectives and characteristics of the surveyed approaches, and highlights the challenges to be addressed when designing and developing SOM solutions that satisfy the requirements of different application domains.
Journal ArticleDOI

Towards Medical Data Interoperability Through Collaboration of Healthcare Devices

TL;DR: MeDIC, a framework of Medical Data Interoperability through Collaboration of healthcare devices is presented, which proves that this collaborative framework not only reduces the uplink traffic but also improves the response time, which is critical in real-time medical applications.
Journal ArticleDOI

Performance Evaluation of Continuity of Care Records (CCRs): Parsing Models in a Mobile Health Management System

TL;DR: The objective of this study was to identify different operational and performance characteristics for those CCR parsing models including the XML DOMparser, the SAX parser, the PULL parser, and the JSON parser with regard to JSON data converted from XML-based CCR.
Journal ArticleDOI

Using SWE Standards for Ubiquitous Environmental Sensing: A Performance Analysis

TL;DR: A performance analysis is presented about the use of SWE standards in smartphone applications to consume and produce environmental sensor data, analysing to what extent the performance problems related to XML can be alleviated by using alternative uncompressed and compressed formats.
References
More filters
Proceedings ArticleDOI

XML parsing: a threat to database performance

TL;DR: Comparing relational database performance shows that the desired response times and transaction rates over XML data can not be achieved without major improvements in XML parsing technology, and identifies research topics which are most promising for XML parser performance in database systems.
Proceedings ArticleDOI

Parallel XML Parsing Using Meta-DFAs

TL;DR: In this paper, a parallel preparsing scan is used to build an outline of the XML document, which is then used to guide the parallel full parse, and a meta-DFA mechanism is proposed to parallelize the preparser itself.
Proceedings ArticleDOI

Dual Processor Performance Characterization for XML Application-Oriented Networking

TL;DR: The results show a significant improvement in dual-core Pentium M processor over Hyperthreaded Xeon processor for AON workload, which will not only provide insight to processor designers, but also help architects of AON devices to select from alternative processors with restrictions to use one or two physical CPUs due to space and power consumption limitations.
Proceedings ArticleDOI

Benchmarking XML Based Application Oriented Network Infrastructure and Services

TL;DR: This work presents AONBench specifications and methodology to benchmark networked XML application servers and appliances, which leverages from existing XML microbenchmarks and uses HTTP for end-to-end communication.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Xml document parsing: operational and performance characteristics" ?

In this paper, the authors compare the operational and performance characteristics of XML processing in four stages: parsing, access, modification, and serialization.