scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Performance enhancement with speculative execution based parallelism for processing large-scale xml-based application data

11 Jun 2009-pp 21-30
TL;DR: The design and implementation of a toolkit for processing large-scale XML datasets that utilizes the capabilities for parallelism that are available in the emerging multi-core architectures is presented.
Abstract: We present the design and implementation of a toolkit for processing large-scale XML datasets that utilizes the capabilities for parallelism that are available in the emerging multi-core architectures. Multi-core processors are expected to be widely available in research clusters and scientific desktops, and it is critical to harness the opportunities for parallelism in the middleware, instead of passing on the task to application programmers. An emerging trend is the use of XML as the data format for many distributed/grid applications, with the size of these documents ranging from tens of megabytes to hundreds of megabytes. Our earlier benchmarking results revealed that most of the widely available XML processing toolkits do not scale well for large sized XML data. A significant transformation is necessary in the design of XML processing for distributed applications so that the overall application turn-around time is not negatively affected by XML processing. We discuss XML processing using PiXiMaL, a parallel processing library for large-scale XML datasets. The parallelization approach is to build a DFA-based parser that recognizes a useful subset of the XML specification, and convert the DFA into an NFA that can be applied to an arbitrary subset of the input. Speculative NFAs are scheduled on available cores in a node to effectively utilize the processing capabilities and achieve overall performance gains. We evaluate the efficacy of this approach in terms of potential speedup that can be achieved for representative XML datasets. We also evaluate the effect of two different memory allocation libraries to quantify the memory-bottleneck as different cores access shared data structures.

Summary (3 min read)

1. INTRODUCTION

  • The use of XML as a data-format for distributed applications is due to its support for extensibility, namespace qualification, and databinding to many programming languages.
  • Scalable processing of XML datasets is an immediate concern as the size of XML data used by applications has steadily increased over the years in both scientific and business applications.
  • While the Web service approach of MCS provides interoperability, it also hurts the performance when compared to use of a standard database for storage and retrieval.
  • The schemas used to describe the common patterns in human DNA sequence variation can have tens of thousands of elements.
  • Multiple cores on the same chip can possibly share various caches, including the translation look-aside buffer (TLB), and the bus.

2.1 Memory Bandwidth and State-Scalability

  • PIXIMAL can also be used to determine the effective memory bandwidth in reading large-scale application documents, and the effect of the complexity of the XML specification on performance.
  • The memory bandwidth and state scalability tests were run on 1U nodes configured with 2× quad core (2.33 Ghz Intel Xeon E5345 CPUs).
  • The balance of the input (input_size (1− split_percent/100)) is divided evenly among the NFA threads.
  • The NFA recalculates each entry of the state array for each byte of input using the same rule as the DFA.
  • DFAs with fewer states demonstrated similar performance, with a tight grouping around 4.5 times speedup with 8 threads on an 8-core machine.

3. PARALLEL XML DATA PROCESSING

  • A deterministic finite automata (DFA)-based lexical scanner is generally used to tokenize the input characters of the file (or string, as in the case of XML) into syntactic tokens that are used later in the parse phase.
  • Every time the scanner recognizes a token, it must perform some action to store the token or pass it to a higher level part of the parser.
  • A DFA-based scanner can be custom-designed to process the subset of XML specification used in defining large-scale data files in applications.
  • The DFA approach does not directly lend itself to parallelism.
  • This approach has thus far been acceptable for small files and desktop-style mass storage devices, because the scanner is fast for small input files.

3.1 Speculative NFA Execution in Piximal

  • The authors parallelization tool, PIXIMAL, is designed for data-sets of applications running on cluster-class hardware, which are much more amenable to parallelization.
  • In these target application cases, data sets defined in XML can be several hundreds of megabytes.
  • The authors parallelization approach can be readily applied to these cases.
  • The parser built around this NFA reads each character of input, traversing along all execution paths, one for each state Si.
  • There is a single correct execution path which is the path started in state.

4. SERIAL NFA TESTS AND TESTING ENVIRONMENT

  • The tests presented here examine the fundamental hypothesis of this work: the extra work required by using an NFA is offset by dividing processing work across multiple threads.
  • The authors run each component of the PIXIMAL processor for a given configuration (split percent and thread count) independently on its element of the input partition, and examine the time each component takes to complete its processing sub-task.
  • In addition to “black box” performance tests, the authors examine the state usage for various inputs.
  • The integer and MIO arrays simply encode a variety of numbers.
  • This tests a hypothesis that if a document has substantially more PCDATA (character data between tags) than tags, then the NFAs’ states will quickly collapse upon detection of the open angle bracket (<) which invalidates a large number of potential execution paths.

4.1 Experimental Environment

  • The serial NFA tests were run on a variety of system architectures, from older SMP machines to newer multi-core systems.
  • Because the tests are serial and do not take advantage of any hardware concurrency, once the results were normalized by calculating the potential speedup, there was little detectable difference.
  • Therefore, the data for the test results presented were collected by running the test on a ten nodes of a cluster of machines with dual-quad-core Intel Xeon E5345 chips clocked at 2.33GHz running Debian etch with Linux kernel 2.6.18.
  • The input is read from local disk, though is expected (by pre-reading the input file before each test) to be in the system cache to eliminate I/O disturbances.
  • Performance analysis was performed and plots were created us- ing R [16].

5. SERIAL NFA RESULTS

  • The DFA the authors have built has eleven functional states and one error state.
  • The best potential speedup achieved on this input over the range of split percents and thread counts tests was 2.04 times the DFA baseline, splitting 34% of the input for the initial DFA and dividing the rest of the input evenly between the remaining 7 NFAs.
  • Nearly all characters of input are in content sections.
  • Maximal performance in figure 11 is more uniform and hovers around 3.0-3.2.
  • Figure 13 shows the performance difference when running fully concurrent PIXIMAL with and without a specialized malloc implementation on a SOAP-encoded array of 25000 strings.

6. CONCLUSIONS

  • The form of the input XML data affects the overall gains with the PIXIMAL approach.
  • Based on their tests on a variety of CPU configurations, the authors conclude that even with current memory and I/O subsystems, processing large scale data files can potentially provide performance improvements.
  • If this is considered too restrictive for designers of XML datasets, simply requiring that each, for example, millionth character occur in a text section will be useful.
  • This would allow the input to be divided such that the NFA would be known to start processing only in the content state, greatly reducing the amount of work it needs to do.
  • For XML data sets that primarily consist of arrays of strings, a greater overall speedup can be obtained.

Did you find this useful? Give us your feedback

Figures (12)

Content maybe subject to copyright    Report

Performance Enhancement with Speculative Execution
Based Parallelism for Processing Large-scale XML-based
Application Data
Michael R. Head
mike@cs.binghamton.edu
Madhusudhan Govindaraju
mgovinda@cs.binghamton.edu
Grid Computing Research Laboratory
Computer Science Department
Binghamton University
Binghamton, NY, USA
ABSTRACT
We present the design and implementation of a toolkit for process-
ing large-scale XML datasets that utilizes the capabilities for par-
allelism that are available in the emerging multi-core architectures.
Multi-core processors are expected to be widely available in re-
search clusters and scientific desktops, and it is critical to harness
the opportunities for parallelism in the middleware, instead of pass-
ing on the task to application programmers. An emerging trend is
the use of XML as the data format for many distributed/grid ap-
plications, with the size of these documents ranging from tens of
megabytes to hundreds of megabytes. Our earlier benchmarking
results revealed that most of the widely available XML processing
toolkits do not scale well for large sized XML data. A significant
transformation is necessary in the design of XML processing for
distributed applications so that the overall application turn-around
time is not negatively affected by XML processing. We discuss
XML processing using PiXiMaL, a parallel processing library for
large-scale XML datasets. The parallelization approach is to build
a DFA-based parser that recognizes a useful subset of the XML
specification, and convert the DFA into an NFA that can be applied
to an arbitrary subset of the input. Speculative NFAs are scheduled
on available cores in a node to effectively utilize the processing ca-
pabilities and achieve overall performance gains. We evaluate the
efficacy of this approach in terms of potential speedup that can be
achieved for representative XML datasets. We also evaluate the
effect of two different memory allocation libraries to quantify the
memory-bottleneck as different cores access shared data structures.
Categories and Subject Descriptors
D.1.3 [Concurrent Programming]: Parallel Programming;
F.1.2 [Modes of Computation]: Parallelism and concurrency
General Terms
Algorithms, Languages, Performance
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
HPDC’09, June 11–13, 2009, Munich, Germany.
Copyright 2009 ACM 978-1-60558-587-1/09/06 ...$5.00.
Keywords
Chip-level multiprocessing, Parallel techniques, XML Datasets
1. INTRODUCTION
XML is now widely used as an application data format. The
use of XML as a data-format for distributed applications is due
to its support for extensibility, namespace qualification, and data-
binding to many programming languages. Scalable processing of
XML datasets is an immediate concern as the size of XML data
used by applications has steadily increased over the years in both
scientific and business applications. For example, recognizing the
increasing role of XML in representation and storage of scientific
data, XDF, the eXtensible Data Format for Scientific Data, is be-
ing developed at GSFC’s Astronomical Data Center (ADC), to de-
scribe an XML mark-up language for documents containing major
classes of scientific data. This effort is expected to define a generic
XML representation to accommodate the diverse needs of various
scientific applications. The MetaData Catalog Service (MCS) [17]
provides access via a Web service interface to store and retrieve de-
scriptive information (metadata) on millions of data items. While
the Web service approach of MCS provides interoperability, it also
hurts the performance when compared to use of a standard database
for storage and retrieval. Scientific applications such as Mesoscale
meteorology [6] depend on the orchestration of several workflows,
defined in XML format. The international HapMap project aims to
develop a haplotype of the human genome. The schemas used to
describe the common patterns in human DNA sequence variation
can have tens of thousands of elements. The XML files in the pro-
tein sequence database are close to a gigabyte in size. The eBay
Web service specification has a few thousand elements and a few
hundred complex type definitions. Communication with eBay via
the SOAP protocol requires processing of large XML documents.
The emergence of Chip Multi Processors (CMPs), also called
multi-core processors, provides both opportunities and challenges
for designing an XML processing toolkit tailored for large-size
XML datasets. Compared to classic symmetric multi-processing
systems (SMPs) of independent chips, the communication costs of
on-chip shared secondary cache in CMPs is considerably less, pro-
viding opportunities for performance gains in fine-grained multi-
threaded parallel code. CMPs provide special advantages due to
locality. The individual cores are more closely connected together
than in an SMP system. Multiple cores on the same chip can possi-
bly share various caches, including the translation look-aside buffer
(TLB), and the bus. An important design consideration is that off-

chip memory access and latency can be the choking point in CMP
processors.
Our earlier work on benchmarking XML processing showed that
for most XML toolkits scalability is adversely affected as the size
of the XML datasets increase [10, 11]. These toolkits are typically
designed to process small-sized XML datasets. The recent trends
and announcements from major vendors indicate that the number
of cores per chip will steadily increase in the near future. The per-
formance limitation of existing XML toolkits will likely be exac-
erbated on multi-core processors because performance gains need
to be mainly achieved by adding more parallelism rather than serial
processing speed. Additionally, scalable processing of XML data is
now of critical importance in scientific applications where the size
of XML can exceed hundreds of megabytes. As a result, our focus
is on harnessing the benefits of fine grained parallelism, exploiting
SMP programming techniques to process large-scale XML-based
application documents, and design of algorithms that scale well
with increase in number of processing cores.
Many parallel compilation ideas have been discussed in the lit-
erature years [2, 5, 8, 12], studying both compilers that generate
parallel code as well as those that divide work across multiple pro-
cessors. With the popularization of multi-core processors and the
disparity between processor and memory speed, we expect that
substantial benefits can be uncovered by utilizing more cores dur-
ing XML document processing. In this paper we use the PIXIMAL
toolkit to evaluate the best parallelization strategies for various data
structures used in distributed applications. We also compare and
contrast the effect of memory allocation libraries on synchroniza-
tion and management costs as multiple-threads compete for access
to the main memory.
The specific contributions of our work include:
We present a multi-threaded parallelization technique to pro-
cess large-scale XML data.
We present a framework that helps evaluate how the size and
data-types in an XML document affects the distribution of
the data to various threads in a multi-core environment.
We compare the use of GNU libc 2.7 and Google’s thread
caching malloc libraries for use in a multi-core environment
where use of shared data structures can invoke expensive
synchronization algorithms, affecting overall application per-
formance.
We present the scalability of PIXIMAL in terms of speedup
achieved as the size of the XML input data and processing
threads are increased.
We study the usage of various states in the processing au-
tomaton to provide insights on why the performance varies
for differently shaped input data files.
2. RELATED WORK
A wide range of implementations of XML parsers is available,
including Xerces (DOM and SAX) [22], gSOAP [20], Piccolo [14],
Libxml [21], VTD-XML [23], Qt4 [18], and Expat [4]. XML pri-
marily uses UTF-8 as the representation format for data and var-
ious studies have shown that this representation format can hin-
der the overall application performance. Sending commonly used
data structures via standard implementations of XML based pro-
tocols, such as SOAP, incurs severe performance overheads, mak-
ing it difficult for applications to adopt Web services based dis-
tributed middleware [11]. Several novel efforts to analyze the bot-
tlenecks and address the performance at various stages of a Web
services call stack have been discussed in the literature [1, 3, 7,
20]. These optimizations, which are tailored just for the uni-core
case, include: (1) the gSOAP parser [20] uses look-aside buffers to
efficiently parse frequently encountered XML constructs; (2) pars-
ing of XML schemas has been improved with the use of schema-
specific parsing along with trie data structures so that frequently
used XML tags are parsed only once [3, 19]; (3) gSOAP uses a
performance aware compiler to efficiently parse XML constructs
that map to C/C++ types. It uses a single-pass schema-specific
recursive-descent parser for XML decoding and dual pass encoding
of the application’s object graphs in XML [20]; and (4) VTD-XML
parser achieves performance improvement via incremental update,
hardware acceleration, and native XML indexing.
Recent work by Zhang et al [24] has demonstrated that it is pos-
sible to achieve high performance serialized parsing. They have
developed a table driven parser that combines the parsing and vali-
dating an XML document in a very efficient way. While this tech-
nique works well for serial processing, it is not tailored for process-
ing on multi-core nodes, especially for very large document sizes.
In our previous work in this area, we focused just on the memory
bandwidth in multi-core architectures when multiple threads oper-
ate concurrently to read large input files [9].
A related project, MetaDFA [13, 15] toolkit, presents a paral-
lelization approach that chiefly uses a two-stage DOM parser. It
conducts pre-parsing to find the tag structure of the input before, or
possibly pipelined with, a parallelized DOM builder run on its out-
put (a list of document offsets of start and end tags). Our toolkit,
PIXIMAL, however, generates SAX events and thus serves a dif-
ferent class of applications than MetaDFA. Additionally, PIXIMAL
conducts parsing work dynamically, and generates as output a se-
quence of SAX events. This results in larger number of DFA states,
and more opportunities for optimizations for different class of ap-
plication data files.
2.1 Memory Bandwidth and State-Scalability
PIXIMAL can also be used to determine the effective memory
bandwidth in reading large-scale application documents, and the
effect of the complexity of the XML specification on performance.
A thorough description and analysis of the effective memory band-
width of the PIXIMAL approach is presented in another venue [9].
In this section, we present a summary of research findings on these
two topics.
The memory bandwidth and state scalability tests were run on
1U nodes configured with 2× quad core (2.33 Ghz Intel Xeon
E5345 CPUs). Each node has 8 gigabytes of RAM and run a 64 bit
distribution of Debian 4.0, using Linux kernel 2.6.18. The filesys-
tem in use in the test directory here is xfs.
As an N-way parallel parser would concurrently be reading us-
ing N different threads, we conducted tests to check whether the
memory subsystem can provide substantial bandwidth when se-
quentially reading from a very large input. This test has two param-
eters: split_percent and thread count. The split_percent is partic-
ular to the PIXIMAL approach: it denotes the percent of input that
is directed at the DFA thread. The number of threads defines the
number of concurrent automata: 1 DFA and number_of_threads 1
NFAs. The balance of the input (input_size (1 split_percent/100))
is divided evenly among the NFA threads. In the case that num-
ber_of_threads = 1, split_percent is overridden to be 100% in order
to ensure that the entire input is read.
The results of these tests demonstrated that there was plenty of
memory bandwidth to effectively read the input concurrently from
as many as six cores of an eight core machine.
The speculative threads in a parser built using NFAs will have
substantially more work than the DFA thread. This test models
an aspect of that extra workload the number of states that the

NFA must initially consider to examine the affect of language
complexity on the efficacy of this approach.
This test has one more parameter than the memory bandwidth
test: the size (number of states) of the DFA. Here, the PIXIMAL
DFA is modeled as a thread that has a state_number which is ini-
tialized to 0 and takes values between 0 and d fa_size 1 . The
next state_number is calculated for each byte of input by looking
up the current state_number and current byte in a two dimensional
array. The NFAs are modeled by threads that start with an array of
d f a_size 1 start values, each initialized to a number between 1
and d f a_size 1. An NFA will never start in the state designated
by 0, because that is a start state that is only valid before the DFA
begins reading. The NFA recalculates each entry of the state array
for each byte of input using the same rule as the DFA.
The results of this test showed that the number of states in the
DFA is inversely proportional with maximal speedup. Further, the
curve between these two variables has a very steep portion around
DFA sizes of between 6-8. DFAs with fewer states demonstrated
similar performance, with a tight grouping around 4.5 times speedup
with 8 threads on an 8-core machine. DFAs with more states had
a similar grouping at a much lower speedup around 2. The 6 state
DFA performed between the two groupings, with speedup around
3.5 with 8 threads. The more complex the DFA, the more work it
can do (i.e., it can recognize a language of greater complexity). It
is desirable for the DFA to do as much work as possible because
in the table-driven implementation, it has a very low per-byte pro-
cessing cost. On the other hand, more states in the DFA leads to
a greater number of paths through the NFA, which limits the ben-
efit of this parallelization approach. This test aids in quantifying
just how much extra work is done by the NFA and how that affects
overall performance.
3. PARALLEL XML DATA PROCESSING
A deterministic finite automata (DFA)-based lexical scanner is
generally used to tokenize the input characters of the file (or string,
as in the case of XML) into syntactic tokens that are used later in
the parse phase. The DFA based lexical scanner is sometimes hand-
coded, and frequently generated by a tool such as flex. Every time
the scanner recognizes a token, it must perform some action to store
the token or pass it to a higher level part of the parser. The various
token types and keywords of XML, used in distributed applications,
can be defined as regular expressions. A DFA-based scanner can be
custom-designed to process the subset of XML specification used
in defining large-scale data files in applications. The DFA model
for processing is efficient: each character in the input XML docu-
ment is read only once, minimizing the overhead on a per-character
basis.
The DFA approach does not directly lend itself to parallelism. It
is required to start at the beginning of the input and process all the
characters sequentially. As there is no way to determine in which
state the DFA will be in after processing a certain section of the
input, it is not possible to simply split the input in two (or more
sections) and process the different sections independently. Due to
this reason, all the widely used XML parsers are limited to a serial-
ized indivisible scanner. This approach has thus far been acceptable
for small files and desktop-style mass storage devices, because the
scanner is fast for small input files. Additionally, this approach
blends well with desktop mass storage access algorithms that work
well reading from a single stream from disk.
3.1 Speculative NFA Execution in Piximal
Our parallelization tool, PIXIMAL, is designed for data-sets of
applications running on cluster-class hardware, which are much
more amenable to parallelization. In these target application cases,
data sets defined in XML can be several hundreds of megabytes.
Unlike the desktop case, in such applications mass storage is more
likely to be arranged in higher performance configurations (e.g.,
RAID, NAS, SAN) which can more efficiently feed multiple data
streams to concurrent threads. Our parallelization approach can be
readily applied to these cases.
The speculative execution approach of PIXIMAL is to divide the
input XML document, P, into N substrings, P
1
, P
2
, ...P
N
. The pro-
cessing on substring P
1
is carried out using the standard DFA-based
lexical analyzer, as a DFA can only be run at the starting state using
the first character of an input string. This DFA instance is termed
the “initial DFA. The other processing units in a multi-core proces-
sor are utilized by concurrently executing N 1 speculative scan-
ners on the remaining substrings P
2
, P
3
, ...P
N
. The processing is
speculative as it is not possible to determine the start state for the
L
DFA
, except for P
1
. As a result, we have added a transformation
module to the PIXIMAL framework that can be applied to create a
scanner, which can be applied to any of the substrings.
The DFA above is transformed into an NFA, L
NFA
containing
precisely the same state nodes, transitions, and final states as L
DFA
.
One significant change is made: each state node, with the excep-
tion of the error state, is marked as a start state. The parser built
around this NFA reads each character of input, traversing along all
execution paths, one for each state S
i
. If a given transition triggers
an action (such as triggering a StartElement SAX event in the user
code), that action is stored into an action list A
S
i
for that execution
path, since it cannot be triggered immediately.
There is a single correct execution path which is the path started
in state S
k
, the state that the L
DFA
would have been in had it parsed
the input up to the beginning of this input substring. S
k
will be
known when the DFA or NFA running on the input behind it is
complete and, if it is an NFA, knows its own correct execution
path. Once S
k
is known, the actions in action list A
S
k
can be trig-
gered, after some minor fix-up to merge the parser state from the
previous automaton and the first action in this automaton’s action
list. This is necessary because the NFA may have started in the
middle of a token, or more complexly, in the middle of an XML
tag, which contains several tokens: a tag name and zero or more at-
tribute name/value pairs. This fix-up is minor and a function to the
number of automata used, as opposed to the size of the input, so can
be viewed as a O(1) cost once the number of available computing
cores is set.
4. SERIAL NFA TESTS AND TESTING EN-
VIRONMENT
The tests presented here examine the fundamental hypothesis of
this work: the extra work required by using an NFA is offset by
dividing processing work across multiple threads. We run each
component (automaton) of the PIXIMAL processor for a given con-
figuration (split percent and thread count) independently on its el-
ement of the input partition, and examine the time each component
takes to complete its processing sub-task. We run the test on sev-
eral classes of homogeneously configured systems and average the
results for equivalent cases. Equivalent cases here are those that
are taken from the same class of computer systems running the
same configuration and occur on the same subsequence of input.
For each configuration, we calculate the maximum time over all
automaton runs. The maximum time here represents the minimal
time the complete parser would take to process the full input when
running those automata concurrently on independent processors,
minus the fixup time which is small. Each component performs all

0: Initial State
1: Enter Tag State
2: Start Tag State
3: Attribute Name State
4: Attribute EQ State
5: Attribute Value State
6: Attribute Interstitial State
7: Content State
8: End Tag State
9: End Tag Rest State
10: End Tag Interstitial State
Figure 1: Symbolic names of each DFA state, for reference
when examining figures 5 and 9.
the work it must do in a multi-threaded PIXIMAL run, from read-
ing input, to traversing the state table, to storing actions for each
live execution path. However, the work is all done sequentially, in
a single thread, to isolate each NFA in its own execution environ-
ment and obtain the best possible timing in the absence of other
processes.
We present these results as potential speedup, which is calculated
using the usual calculation for speedup by dividing the baseline
time by the maximum time found above (
T
1
T
N
). We call these tests
the serial NFA tests, as they measure the best potential speedup,
using measurements taken from the serialized form of the parser.
In addition to “black box” performance tests, we examine the
state usage for various inputs. Comparing state usage information
is helpful in understanding why the performance varies for differ-
ently shaped input.
Some tests presented here use a collection of SOAP request doc-
uments, each of these encodes an array of certain type and length
to demonstrate performance with respect to varying input size. The
documents encode arrays of integers, strings, and “mesh interface
objects” (MIOs a complex type combining two integer values
with a floating point value, often used in scientific computing). The
array lengths range from 10 elements up to 50,000 elements. This
allows us to examine documents ranging in size from a few hun-
dred bytes to tens of megabytes. The integer and MIO arrays sim-
ply encode a variety of numbers. The string array encodes strings
which are many times longer than the representations of the inte-
gers. This tests a hypothesis that if a document has substantially
more PCDATA (character data between tags) than tags, then the
NFAs’ states will quickly collapse upon detection of the open angle
bracket (<) which invalidates a large number of potential execution
paths.
Another potential bottleneck of the PIXIMAL approach is the re-
quirement that each NFA needs to frequently allocate memory to
store actions along all live execution paths. malloc(3), unless it is
specially written, may be a hidden synchronization point, in order
to protect access to the shared heap resource, that reduces concur-
rency. In addition to the serial tests described above, we tested
PIXIMAL itself, with multiple NFAs running in concurrent threads,
in the presence of the default GNU libc 2.7 malloc implementa-
tion as well as Google’s Thread Caching malloc implementation to
quantify the memory bottleneck in multi-core systems.
4.1 Experimental Environment
The serial NFA tests were run on a variety of system architec-
tures, from older SMP machines to newer multi-core systems. Be-
cause the tests are serial and do not take advantage of any hardware
concurrency, once the results were normalized by calculating the
2 3 4 5 6 7 8
0.0 0.5 1.0 1.5 2.0 2.5
Thread Count
Potential Speedup
Max Speedup
Mean Speedup
Min Speedup
Figure 2: Potential scalability for XML input encoding an ar-
ray of 10,000 integer values. The number of threads available is
the independent variable here. A slight speedup is possible by
adding more threads for this class of input.
potential speedup, there was little detectable difference. Therefore,
the data for the test results presented were collected by running the
test on a ten nodes of a cluster of machines with dual-quad-core
Intel Xeon E5345 chips clocked at 2.33GHz running Debian etch
with Linux kernel 2.6.18. The input is read from local disk, though
is expected (by pre-reading the input file before each test) to be in
the system cache to eliminate I/O disturbances.
The malloc tests were run on a separate machine with a single
quad-core Intel Xeon E5320 clocked at 1.86GHz, running Ubuntu
8.04LTS with Linux kernel 2.6.24. The input for this test is also
pre-read to avoid noise from the I/O subsystem.
Performance analysis was performed and plots were created us-
ing R [16].
5. SERIAL NFA RESULTS
Figure 1 presents the symbolic names used in figures 5 and 9.
The DFA we have built has eleven functional states and one (unla-
beled) error state.
Figures 2, 3, and 5 present results for a representative input case:
a SOAP-encoded array of 10,000 integers. Figure 2 presents poten-
tial speedup (the time it takes for a DFA to parse the input divided
by the maximal time of each NFA component to parse its subse-
quence of the input) on this file. The “Max Speedup” line repre-
sents the potential speedup from the best possible selection of split
percent, of those given in the range of test values, for each thread
count. Similarly, the “Min Speedup” represents the speedup asso-
ciated with the worst possible selection of split percent for each
thread count. Of interest here is that there is a potential speedup
available in all cases. The best potential speedup achieved on this
input over the range of split percents and thread counts tests was
2.04 times the DFA baseline, splitting 34% of the input for the
initial DFA and dividing the rest of the input evenly between the
remaining 7 NFAs. Using four threads, there is a maximal speedup
of 1.59 times the baseline, with 60% of the input being processed
by the initial DFA, with the 3 NFAs each processing approximately
13%. It is particularly important to note that many split percent
selections will lead to negative performance: input splitting greatly
affects the performance. Figure 3 presents the same data as figure
2 along a different axis, tracking split percent rather than thread

0 20 40 60 80 100
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Split Percent
Potential Speedup
Max Speedup
Mean Speedup
Min Speedup
Figure 3: Similar to figure 2, this graph presents the result on
speedup of varying the split percent parameter when process-
ing an encoded array of 10,000 integers. Maximal and mini-
mal speedups for each selection of split percent are shown. The
range of values comes from varying the number of threads.
Thread Count
2
3
4
5
6
7
8
Split Percent
20
40
60
80
Potential Speedup
0.5
1.0
1.5
2.0
Figure 4: The “exploded” view of the data presented in figures
2 and 3 (arrays of 10,000 integers). All points in the parameter
space are presented to give a better view of the space.
count. Here the shape of the data is quite different. Some of the
same high points are present here: the split percent of 34% is nat-
urally still the global maximum, and the speedup at 60% is high
here, too. Not all split percents have an associated thread count
that provides any speedup. This again indicates that the partition-
ing of the input is critical to achieve performance gains with this
approach.
Figure 5 depicts a histogram of the states used when parsing the
encoded array of 10,000 integers. This gives some indication of
why potential speedup caps out around 2.0 for this input. Most
characters in this input are in PCDATA (content) sections, DFA
state 7, which can be discerned by using figure 1. However, there
is a significant number of characters which trigger state 1, the enter
tag state. These represent open angle brackets in the input, and each
one leads to an action (either a Start Element or and End Element
SAX event). NFAs must store each one of these actions, so even in
the best case, there is a linear relation between the amount of work
the NFA must do and the number of times the DFA enters state 1.
0 1 2 3 4 5 6 7 8 9 10
DFA State
Frequency
0 20000 40000 60000
Figure 5: Histogram of DFA state usage when parsing the XML
encoded array of 10,000 integers. State 7 reprensents charac-
ters in PCDATA sections (text between tags). Other state names
are described in figure 1.
Figure 6 presents the potential speedup achievable on an input
SOAP-encoding of 10,000 strings in XML as a function of the num-
ber of threads scheduled. Compared with figure 2, the results reach
a much higher global maximum and has a much greater rate of in-
crease. The maximal performance achieved is found when process-
ing 26% of the input with the initial DFA and dividing the remain-
der of the input evenly across 7 NFAs. The potential speedup over
the baseline mean of DFA runs on the entire input is 3.17 times. It
is also noteworthy that even the mean speedup is greater than 1 for
many cases here.
Similarly to figure 3, figure 7 displays the potential speedup
when reading an array of 10,000 strings as a function of a pre-
determined split percent. The results are much smoother for strings
than for integers. Naturally, the high point here is the same as in
figure 6, 26% with 8 threads, with a clear trend of results sloping
up from both sides. This strongly indicates that 26% is nearly the
optimal split percent. Further, this indicates that on this input, the
NFA is doing roughly
26
74
7
2.5 times as much work as the DFA
when the work is divided well.
Figure 9 indicates why the performance is so much more regular.
The distribution of node usage is, by design, substantially different
from the integer case. Nearly all characters of input are in con-
tent sections. Further, the actual file is frequently punctuated by
tags. The input has long content sections and short elements, be-
cause it represents an array of lengthy strings. This means that the
NFA will, with greater probability, start on a character in a con-
tent section and will quickly be able to eliminate most of the in-
correct execution paths when the open angle bracket character is
read, which will happen in a short amount of time. Thus, it is easy
to “luck into” a good division of work due to the structure of the
document. In the integer case, where content sections are shorter,
there is a greater probability that the NFA will be started at some
point in a tag where it is not possible to determine, for example,
whether the correct execution path started in a content state or a tag
state, because the close angle bracket character may legally appear
in content sections. Thus, it does not lead to a contradiction in the
way that encountering an open angle bracket does. Performance
is similar for the MIO array input because its XML representation
more closely matches the representation of integer arrays.

Citations
More filters
Journal ArticleDOI
TL;DR: A unified view of the research efforts aimed at SOAP performance enhancement is provided, covering almost every phase of SOAP processing, ranging over message parsing, serialization, deserialization, compression, multicasting, security evaluation, and data/instruction-level processing.
Abstract: The web services (WS) technology provides a comprehensive solution for representing, discovering, and invoking services in a wide variety of environments, including Service Oriented Architectures (SOA ) and grid computing systems. At the core of WS technology lie a number of XML-based standards, such as the Simple Object Access Protocol (SOAP), that have successfully ensured WS extensibility, transparency, and interoperability. Nonetheless, there is an increasing demand to enhance WS performance, which is severely impaired by XML's verbosity. SOAP communications produce considerable network traffic, making them unfit for distributed, loosely coupled, and heterogeneous computing environments such as the open Internet. Also, they introduce higher latency and processing delays than other technologies, like Java RMI and CORBA. WS research has recently focused on SOAP performance enhancement. Many approaches build on the observation that SOAP message exchange usually involves highly similar messages (those created by the same implementation usually have the same structure, and those sent from a server to multiple clients tend to show similarities in structure and content). Similarity evaluation and differential encoding have thus emerged as SOAP performance enhancement techniques. The main idea is to identify the common parts of SOAP messages, to be processed only once, avoiding a large amount of overhead. Other approaches investigate nontraditional processor architectures, including micro- and macrolevel parallel processing solutions, so as to further increase the processing rates of SOAP/XML software toolkits. This survey paper provides a concise, yet comprehensive review of the research efforts aimed at SOAP performance enhancement. A unified view of the problem is provided, covering almost every phase of SOAP processing, ranging over message parsing, serialization, deserialization, compression, multicasting, security evaluation, and data/instruction-level processing.

61 citations


Cites background from "Performance enhancement with specul..."

  • ...Experimental results on Piximal’s macrolevel parallelization technique show that securing additional resources for each thread by distributing the workload to a cluster of machines using MapReduce can increase performance [23], [30], [31]....

    [...]

  • ...1) memory bandwidth, which could become a bottleneck [31], and 2) the amount of computation required to parse the...

    [...]

  • ...[23], [30], [31] have addressed cluster computing, a....

    [...]

  • ...Hence, the authors in [23], [30], [31] also address macrolevel parallelism....

    [...]

  • ...[30], [31] presents a parallelized SAX parsing solution,...

    [...]

Journal ArticleDOI
TL;DR: This survey paper provides a concise and comprehensive review of the methods related to XML-based semi-structured semantic analysis and disambiguation, and describes current and potential application scenarios that can benefit from XML semantic analysis.
Abstract: Since the last two decades, XML has gained momentum as the standard for web information management and complex data representation. Also, collaboratively built semi-structured information resources, such as Wikipedia, have become prevalent on the Web and can be inherently encoded in XML. Yet most methods for processing XML and semi-structured information handle mainly the syntactic properties of the data, while ignoring the semantics involved. To devise more intelligent applications, one needs to augment syntactic features with machine-readable semantic meaning. This can be achieved through the computational identification of the meaning of data in context, also known as (a.k.a.) automated semantic analysis and disambiguation, which is nowadays one of the main challenges at the core of the Semantic Web. This survey paper provides a concise and comprehensive review of the methods related to XML-based semi-structured semantic analysis and disambiguation. It is made of four logical parts. First, we briefly cover traditional word sense disambiguation methods for processing flat textual data. Second, we describe and categorize disambiguation techniques developed and extended to handle semi-structured and XML data. Third, we describe current and potential application scenarios that can benefit from XML semantic analysis, including: data clustering and semantic-aware indexing, data integration and selective dissemination, semantic-aware and temporal querying, web and mobile services matching and composition, blog and social semantic network analysis, and ontology learning. Fourth, we describe and discuss ongoing challenges and future directions, including: the quantification of semantic ambiguity, expanding XML disambiguation context, combining structure and content, using collaborative/social information sources, integrating explicit and implicit semantic analysis, emphasizing user involvement, and reducing computational complexity.

41 citations

Proceedings ArticleDOI
11 Dec 2009
TL;DR: This work has adapted the Hadoop implementation to determine the threshold data sizes and computation work required per node, for a distributed solution to be effective and presents both a parallel and distributed approach to analyze how the scalability and performance requirements of large-scale XML-based data processing can be achieved.
Abstract: An emerging trend is the use of XML as the data format for many distributed scientific applications, with the size of these documents ranging from tens of megabytes to hundreds of megabytes. Our earlier benchmarking results revealed that most of the widely available XML processing toolkits do not scale well for large sized XML data. A significant transformation is necessary in the design of XML processing for scientific applications so that the overall application turn-around time is not negatively affected. We present both a parallel and distributed approach to analyze how the scalability and performance requirements of large-scale XML-based data processing can be achieved. We have adapted the Hadoop implementation to determine the threshold data sizes and computation work required per node, for a distributed solution to be effective. We also present an analysis of parallelism using our Piximal toolkit for processing large-scale XML datasets that utilizes the capabilities for parallelism that are available in the emerging multi-core architectures. Multi-core processors are expected to be widely available in research clusters and scientific desktops, and it is critical to harness the opportunities for parallelism in the middleware, instead of passing on the task to application programmers. Our parallelization approach for a multi-core node is to employ a DFA-based parser that recognizes a useful subset of the XML specification, and convert the DFA into an NFA that can be applied to an arbitrary subset of the input. Speculative NFAs are scheduled on available cores in a node to effectively utilize the processing capabilities and achieve overall performance gains. We evaluate the efficacy of this approach in terms of potential speedup that can be achieved for representative XML data sets.

22 citations


Cites background from "Performance enhancement with specul..."

  • ...Our past work demonstrates that the level of speedup obtainable using micro-parallelization techniques is limited: other system resources, such as memory bandwidth become bottlenecks [ 21 ]....

    [...]

Journal ArticleDOI
TL;DR: This paper proposes general parallelism techniques for holistic twig join algorithms to process queries against Extensible Markup Language (XML) databases on a multi‐core system.
Abstract: Purpose – The purpose of this paper is to propose general parallelism techniques for holistic twig join algorithms to process queries against Extensible Markup Language (XML) databases on a multi‐core system.Design/methodology/approach – The parallelism techniques comprised data and task parallelism. As for data parallelism, the paper adopted the stream‐based partitioning for XML to partition XML data as the basis of parallelism on multiple CPU cores. The XML data partitioning was performed in two levels. The first level was to create buckets for creating data independence and balancing loads among CPU cores; each bucket was assigned onto a CPU core. Within each bucket, the second level of XML data partitioning was performed to create finer partitions for providing finer parallelism. Each CPU core performed the holistic twig join algorithm on each finer partition of its own in parallel with other CPU cores. In task parallelism, the holistic twig join algorithm was decomposed into two main tasks, which wer...

10 citations


Cites background from "Performance enhancement with specul..."

  • ...Most of the works allocated their partitions statically on different CPU cores (Head and Govindaraju, 2009; Kim and Yoo, 2009; Li et al., 2009; Lu et al., 2006; Pan et al., 2007a, b; Shah et al., 2009), while Lu and Gannon (2007) applied dynamic allocation by work stealing....

    [...]

Patent
18 May 2011
TL;DR: In this paper, a huge XML document is converted into SDML format which can be processed with high degree of parallelism to achieve high performance, and also SDML can be used as a standalone protocol for data representation.
Abstract: The present invention relates to the field of high performance computation. Particularly, the invention relates to converting a huge XML document into SDML format which can be processed with high degree of parallelism to achieve high performance. In addition also SDML can be used as a standalone protocol for data representation. SDML deals with one time write and many times read. Further, SDML files can be splitted on number of lines which makes it easier to distribute among multi cores and even distributing across servers.

3 citations

References
More filters
Book
01 Jan 1972
TL;DR: It is the hope that the algorithms and concepts presented in this book will survive the next generation of computers and programming languages, and that at least some of them will be applicable to fields other than compiler writing.
Abstract: From volume 1 Preface (See Front Matter for full Preface) This book is intended for a one or two semester course in compiling theory at the senior or graduate level. It is a theoretically oriented treatment of a practical subject. Our motivation for making it so is threefold. (1) In an area as rapidly changing as Computer Science, sound pedagogy demands that courses emphasize ideas, rather than implementation details. It is our hope that the algorithms and concepts presented in this book will survive the next generation of computers and programming languages, and that at least some of them will be applicable to fields other than compiler writing. (2) Compiler writing has progressed to the point where many portions of a compiler can be isolated and subjected to design optimization. It is important that appropriate mathematical tools be available to the person attempting this optimization. (3) Some of the most useful and most efficient compiler algorithms, e.g. LR(k) parsing, require a good deal of mathematical background for full understanding. We expect, therefore, that a good theoretical background will become essential for the compiler designer. While we have not omitted difficult theorems that are relevant to compiling, we have tried to make the book as readable as possible. Numerous examples are given, each based on a small grammar, rather than on the large grammars encountered in practice. It is hoped that these examples are sufficient to illustrate the basic ideas, even in cases where the theoretical developments are difficult to follow in isolation. From volume 2 Preface (See Front Matter for full Preface) Compiler design is one of the first major areas of systems programming for which a strong theoretical foundation is becoming available. Volume I of The Theory of Parsing, Translation, and Compiling developed the relevant parts of mathematics and language theory for this foundation and developed the principal methods of fast syntactic analysis. Volume II is a continuation of Volume I, but except for Chapters 7 and 8 it is oriented towards the nonsyntactic aspects of compiler design. The treatment of the material in Volume II is much the same as in Volume I, although proofs have become a little more sketchy. We have tried to make the discussion as readable as possible by providing numerous examples, each illustrating one or two concepts. Since the text emphasizes concepts rather than language or machine details, a programming laboratory should accompany a course based on this book, so that a student can develop some facility in applying the concepts discussed to practical problems. The programming exercises appearing at the ends of sections can be used as recommended projects in such a laboratory. Part of the laboratory course should discuss the code to be generated for such programming language constructs as recursion, parameter passing, subroutine linkages, array references, loops, and so forth.

1,727 citations


"Performance enhancement with specul..." refers methods in this paper

  • ...The MetaData Catalog Service (MCS) [17] provides access via a Web service interface to store and retrieve de­scriptive information (metadata) on millions of data items....

    [...]

Proceedings ArticleDOI
24 Jul 2002
TL;DR: A high-performance SOAP implementation and a schema-specific parser based on the results of this investigation are presented and a multiprotocol approach that uses SOAP to negotiate faster binary protocols between messaging participants is recommended.
Abstract: The growing synergy between Web Services and Grid-based technologies will potentially enable profound, dynamic interactions between scientific applications dispersed in geographic, institutional, and conceptual space. Such deep interoperability requires the simplicity, robustness, and extensibility for which SOAP was conceived, thus making it a natural lingua franca. Concomitant with these advantages, however is a degree of inefficiency that may limit the applicability of SOAP to some situations. We investigate the limitations of SOAP for high-performance scientific computing. We analyze the processing of SOAP messages, and identify the issues of each stage. We present a high-performance SOAP implementation and a schema-specific parser based on the results of our investigation. After our SOAP optimizations are implemented, the most significant bottleneck is ASCII/double conversion. Instead of handling this using extensions to SOAP we recommend a multiprotocol approach that uses SOAP to negotiate faster binary protocols between messaging participants.

309 citations

Proceedings ArticleDOI
15 Nov 2003
TL;DR: The Metadata Catalog Service (MCS) as mentioned in this paper provides a mechanism for storing and accessing descriptive metadata and allows users to query for data items based on desired attributes, such as attributes.
Abstract: Advances in computational, storage and network technologies as well as middle ware such as the Globus Toolkit allow scientists to expand the sophistication and scope of data-intensive applications. These applications produce and analyze terabytes and petabytes of data that are distributed in millions of files or objects. To manage these large data sets efficiently, metadata or descriptive information about the data needs to be managed. There are various types of metadata, and it is likely that a range of metadata services will exist in Grid environments that are specialized for particular types of metadata cataloguing and discovery. In this paper, we present the design of a Metadata Catalog Service (MCS) that provides a mechanism for storing and accessing descriptive metadata and allows users to query for data items based on desired attributes. We describe our experience in using the MCS with several applications and present a scalability study of the service.

258 citations

Proceedings Article
01 Jan 2002
TL;DR: The design of a Metadata Catalog Service (MCS) is presented that provides a mechanism for storing and accessing descriptive metadata and allows users to query for data items based on desired attributes and a scalability study of the service is presented.
Abstract: Advances in computational, storage and network technologies as well as middle ware such as the Globus Toolkit allow scientists to expand the sophistication and scope of data-intensive applications. These applications produce and analyze terabytes and petabytes of data that are distributed in millions of files or objects. To manage these large data sets efficiently, metadata or descriptive information about the data needs to be managed. There are various types of metadata, and it is likely that a range of metadata services will exist in Grid environments that are specialized for particular types of metadata cataloguing and discovery. In this paper, we present the design of a Metadata Catalog Service (MCS) that provides a mechanism for storing and accessing descriptive metadata and allows users to query for data items based on desired attributes. We describe our experience in using the MCS with several applications and present a scalability study of the service.

177 citations


"Performance enhancement with specul..." refers methods in this paper

  • ...An emerging trend is the use of XML as the data format for many distributed/grid ap­plications, with the size of these documents ranging from tens of megabytes to hundreds of megabytes....

    [...]

Proceedings ArticleDOI
28 Sep 2006
TL;DR: The design and implementation of an initial preparsing phase to determine the structure of the XML document, followed by a full, parallel parse, which shows that the approach applies to real-world, production quality parsers.
Abstract: A language for semi-structured documents, XML has emerged as the core of the Web services architecture, and is playing crucial roles in messaging systems, databases, and document processing However, the processing of XML documents has a reputation for poor performance, and a number of optimizations have been developed to address this performance problem from different perspectives, none of which have been entirely satisfactory In this paper, we present a seemingly quixotic, but novel approach: parallel XML parsing Parallel XML parsing leverages the growing prevalence of multicore architectures in all sectors of the computer market, and yields significant performance improvements This paper presents our design and implementation of parallel XML parsing Our design consists of an initial preparsing phase to determine the structure of the XML document, followed by a full, parallel parse The results of the preparsing phase are used to help partition the XML document for data parallel processing Our parallel parsing phase is a modification of the libxml2 in Veillard, D (2004) XML parser, which shows that our approach applies to real-world, production quality parsers Our empirical study shows our parallel XML parsing algorithm can improved the XML parsing performance significantly and scales well

164 citations

Frequently Asked Questions (2)
Q1. What have the authors contributed in "Performance enhancement with speculative execution based parallelism for processing large-scale xml-based application data" ?

The authors present the design and implementation of a toolkit for processing large-scale XML datasets that utilizes the capabilities for parallelism that are available in the emerging multi-core architectures. The authors discuss XML processing using PiXiMaL, a parallel processing library for large-scale XML datasets. The authors evaluate the efficacy of this approach in terms of potential speedup that can be achieved for representative XML datasets. 

In future work the authors plan to explore pre-fetching and piped implementation techniques that can enhance the performance of PIXIMAL. The authors will further study the scalability of PIXIMAL as processors with multiple cores ( greater than 8 ) are available for research and testing purposes on grid infrastructures. The authors will study the effect of operating system-level caching on the processing of large documents that may be read more than one time. The authors will develop algorithms for optimal layouts of DFA tables in memory to efficiently process frequently occurring transitions.