scispace - formally typeset
Search or ask a question

Showing papers on "Simple API for XML published in 2010"


Journal ArticleDOI
TL;DR: jmzML, a Java API for the Proteomics Standards Initiative mzML data standard, can handle arbitrarily large files in minimal memory, allowing easy and efficient processing of mz ML files using the Java programming language.
Abstract: We here present jmzML, a Java API for the Proteomics Standards Initiative mzML data standard. Based on the Java Architecture for XML Binding and XPath-based XML indexer random-access XML parser, jmzML can handle arbitrarily large files in minimal memory, allowing easy and efficient processing of mzML files using the Java programming language. jmzML also automatically resolves internal XML references on-the-fly. The library (which includes a viewer) can be downloaded from http://jmzml.googlecode.com.

51 citations


Journal ArticleDOI
TL;DR: The proposed model is verified in a couple of scenarios for distributed manufacturing planning that involves feature mapping from CAD file, process selection for several part designs integrated with scheduling and simulation of the FMS model using alternative routings.
Abstract: An efficient model for communications between CAD, CAPP, and CAM applications in distributed manufacturing planning environment has been seen as key ingredient for CIM. Integration of design model with process and scheduling information in real-time is necessary in order to increase product quality, reduce the cost, and shorten the product manufacturing cycle. This paper describes an approach to integrate key product realization activities using neutral data representation. The representation is based on established standards for product data exchange and serves as a prototype implementation of these standards. The product and process models are based on object-oriented representation of geometry, features, and resulting manufacturing processes. Relationships between objects are explicitly represented in the model (for example, feature precedence relations, process sequences, etc.). The product model is developed using XML-based representation for product data required for process planning and the process model also uses XML representation of data required for scheduling and FMS control. The procedures for writing and parsing XML representations have been developed in object-oriented approach, in such a way that each object from object-oriented model is responsible for storing its own data into XML format. Similar approach is adopted for reading and parsing of the XML model. Parsing is performed by a stack of XML handlers, each corresponding to a particular object in XML hierarchical model. This approach allows for very flexible representation, in such a way that only a portion of the model (for example, only feature data, or only the part of process plan for a single machine) may be stored and successfully parsed into another application. This is very useful approach for direct distributed applications, in which data are passed in the form of XML streams to allow real-time on-line communication. The feasibility of the proposed model is verified in a couple of scenarios for distributed manufacturing planning that involves feature mapping from CAD file, process selection for several part designs integrated with scheduling and simulation of the FMS model using alternative routings.

34 citations


Proceedings ArticleDOI
21 Feb 2010
TL;DR: The design of the first complete field programmable gate array (FPGA) accelerator capable of XML well-formed checking, schema validation, and tree construction at a throughput of 1 cycle per byte (CPB) is detailed.
Abstract: Extensible Markup Language (XML) is playing an increasing important role in web services and database systems. However, the task of XML parsing is often the bottleneck, and as a result, the target of acceleration using custom hardware or multicore CPUs. In this paper, we detail the design of the first complete field programmable gate array (FPGA) accelerator capable of XML well-formed checking, schema validation, and tree construction at a throughput of 1 cycle per byte (CPB). This is a significant advancement from 40 CPB, the best previous reported commercial result. We demonstrate our design on a Xilinx Virtex-5 board, which successfully saturates a 1 Gbps Ethernet link.

30 citations


Proceedings ArticleDOI
22 Mar 2010
TL;DR: This paper proposes a novel, end-to-end parallelization framework that determines the optimal way of parallelizing an XML query, based on a statistics-based approach that relies both on the query specifics and the data statistics.
Abstract: The wide availability of commodity multi-core systems presents an opportunity to address the latency issues that have plaqued XML query processing. However, simply executing multiple XML queries over multiple cores merely addresses the throughput issue: intra-query parallelization is needed to exploit multiple processing cores for better latency. Toward this effort, this paper investigates the parallelization of individual XPath queries over shared-address space multi-core processors. Much previous work on parallelizing XPath in a distributed setting failed to exploit the shared memory parallelism of multi-core systems. We propose a novel, end-to-end parallelization framework that determines the optimal way of parallelizing an XML query. This decision is based on a statistics-based approach that relies both on the query specifics and the data statistics. At each stage of the parallelization process, we evaluate three alternative approaches, namely, data-, query-, and hybrid-partitioning. For a given XPath query, our parallelization algorithm uses XML statistics to estimate the relative efficiencies of these different alternatives and find an optimal parallel XPath processing plan. Our experiments using well-known XML documents validate our parallel cost model and optimization framework, and demonstrate that it is possible to accelerate XPath processing using commodity multi-core systems.

27 citations


Journal ArticleDOI
TL;DR: This work presents the design philosophy, implementation, and various applications of an XML-based genetic programming (GP) framework, which contributes to the achievements of fast prototyping of GP by using the standard built-in API of DOM parsers for manipulating the genetic programs.
Abstract: We present the design philosophy, implementation, and various applications of an XML-based genetic programming (GP) framework (XGP). The key feature of XGP is the distinct representation of genetic programs as DOM parsing trees featuring corresponding flat XML text. XGP contributes to the achievements of: (i) fast prototyping of GP by using the standard built-in API of DOM parsers for manipulating the genetic programs, (ii) human readability and modifiability of the genetic representations, (iii) generic support for the representation of the grammar of a strongly typed GP using W3C-standardized XML schema; (iv) inherent inter-machine migratability of the text-based genetic representation (i.e., the XML text) in the distributed implementations of GP.

23 citations


Patent
29 Nov 2010
TL;DR: In this article, the authors propose an architecture that extends conventional computer programming languages that compile into an instance of an extensible markup language (XML) document object model (DOM) to provide support for XML literals in the underlying programming language.
Abstract: An architecture that that extends conventional computer programming languages that compile into an instance of an extensible markup language (XML) document object model (DOM) to provide support for XML literals in the underlying programming language. This architecture facilitates a convenient short cut by replacing the complex explicit construction required by conventional systems to create an instance of a DOM with a concise XML literal for which conventional compilers can translate into the appropriate code. The architecture allows these XML literals to be embedded with expressions, statement blocks or namespaces to further enrich the power and versatility. In accordance therewith, context information describing the position and data types that an XML DOM can accept can be provided to the programmer via, for example, an integrated development environment. Additionally, the architecture supports escaping XML identifiers, a reification mechanism, and a conversion mechanism to convert between collections and singletons.

21 citations


Proceedings ArticleDOI
21 Sep 2010
TL;DR: An implementation of a three-way XML merge algorithm that is faster, uses less memory and is more precise than existing tools is presented and a graphical interface for visualizing and resolving conflicts is provided.
Abstract: XML has become the standard document representation for many popular tools in various domains. When multiple authors collaborate to produce a document, they must be able to work in parallel and periodically merge their efforts into a single work. While there exist a small number of three-way XML merging tools, their performance could be improved in several areas and they lack any form of user interface for resolving conflicts.In this paper, we present an implementation of a three-way XML merge algorithm that is faster, uses less memory and is more precise than existing tools. It uses a specialized versioning tree data structure that supports node identity and change detection. The algorithm applies the traditional three-way merge found in GNU diff3 to the children of changed nodes. The editing operations it supports are addition, deletion, update, and move. A graphical interface for visualizing and resolving conflicts is also provided. An evaluation experiment was conducted comparing the proposed algorithm with three other tools on randomly generated XML data.

20 citations


Patent
20 Jan 2010
TL;DR: In this paper, a method and apparatus for efficiently searching and navigating XML data stored in a relational database is provided for efficient searching and navigation of XML data, which includes identifying a reference address to within an XML tree index entry and storing the address in an xmltable index.
Abstract: A method and apparatus is provided for efficiently searching and navigating XML data stored in a relational database. When storing a collection of XML documents, certain scalar elements may be shredded and stored in a relational table, whereas unstructured data may be stored as a CLOB or BLOB column. The approach includes identifying a reference address to within an XML tree index entry and storing the address in an xmltable index. The tree index entry allows for navigation in all axes. A path-based expression may be evaluated in the context of the reference address of the LOB. The result of the evaluation identifies another XML tree index entry containing a LOB locator used to retrieve the content from the document. The tree index, node index, and secondary function indexes are used together to enhance the performance of querying the XML data.

14 citations


Patent
01 Jun 2010
TL;DR: In this paper, column values that are to be stored for shredded XML documents are separately analyzed for a XML document to determine whether to store a particular column in column-major format or row major format, and what compression technique to use.
Abstract: A database server exploits the power of compression and a form of storing relational data referred to as column-major format, to store XML documents in shredded form. The column values that are to be stored for shredded XML documents are separately analyzed for a XML document to determine whether to store a particular column in column-major format or row-major format, and what compression technique to use, if any.

13 citations


Posted Content
Mustafa Atay1, Yezhou Sun1, Dapeng Liu1, Shiyong Lu1, Farshad Fotouhi 
TL;DR: In this article, an efficient linear algorithm for mapping XML data to relational data is proposed, which can be easily adapted to other inlining algorithms and is based on our previous proposed inlining algorithm.
Abstract: XML has emerged as the standard for representing and exchanging data on the World Wide Web. It is critical to have efficient mechanisms to store and query XML data to exploit the full power of this new technology. Several researchers have proposed to use relational databases to store and query XML data. While several algorithms of schema mapping and query mapping have been proposed, the problem of mapping XML data to relational data, i.e., mapping an XML INSERT statement to a sequence of SQL INSERT statements, has not been addressed thoroughly in the literature. In this paper, we propose an efficient linear algorithm for mapping XML data to relational data. This algorithm is based on our previous proposed inlining algorithm for mapping DTDs to relational schemas and can be easily adapted to other inlining algorithms.

13 citations


01 Jan 2010
TL;DR: A method for the semiautomatic transition from the design models of a Web application to a running implementation using the XML publishing framework Cocoon which provides a very flexible way to generate documents comprising XSLT and XSP processors.
Abstract: In this paper we present a method for the semiautomatic transition from the design models of a Web application to a running implementation. The design phase consists of constructing a set of UML models such as the conceptual model, the navigation model and the presentation model. We use the UML extension mechanisms, i.e. stereotypes, tagged values and OCL constraints, thereby defining a UML Profile for the Web application domain. We show how these design models can automatically be mapped to XML documents with a structure conforming to their respective XML Schema definitions. Further on we demonstrate techniques how XML documents for the conceptual model are automatically mapped to conceptual DOM objects (Document Object Model). DOM objects corresponding to interactional objects are automatically derived from conceptual DOM objects and/or other interactional DOM objects. The XSLT mechanism serves to transform the logical presentation objects representing the user interface to physical presentation objects, e.g. HTML or WAP pages. Finally we present a production system architecture for Web applications using the XML publishing framework Cocoon which provides a very flexible way to generate documents comprising XSLT and XSP (eXtensible server pages) processors.

Proceedings ArticleDOI
06 Jun 2010
TL;DR: This work believes that the key contribution of this system is an improved schema-based clustering storage strategy efficient for both XML querying and updating, and powered by a novel memory management technique.
Abstract: We present a native XML database management system, Sedna, which is implemented from scratch as a full-featured database management system for storing large amounts of XML data. We believe that the key contribution of this system is an improved schema-based clustering storage strategy efficient for both XML querying and updating, and powered by a novel memory management technique. We position our approach with respect to state-of-the-art methods.

Book ChapterDOI
01 Apr 2010
TL;DR: This work proposes object-level matching semantics called Interested Single Object (ISO) and Interested Related Object (IRO) to capture single object and multiple objects as user’s search targets respectively, and design a novel relevance oriented ranking framework for the matching results.
Abstract: Keyword search is widely recognized as a convenient way to retrieve information from XML data. In order to precisely meet users’ search concerns, we study how to effectively return the targets that users intend to search for. We model XML document as a set of interconnected object-trees, where each object contains a subtree to represent a concept in the real world. Based on this model, we propose object-level matching semantics called Interested Single Object (ISO) and Interested Related Object (IRO) to capture single object and multiple objects as user’s search targets respectively, and design a novel relevance oriented ranking framework for the matching results. We propose efficient algorithms to compute and rank the query results in one phase. Finally, comprehensive experiments show the efficiency and effectiveness of our approach, and an online demo of our system on DBLP data is available at http://xmldb.ddns.comp.nus.edu.sg.

Journal ArticleDOI
TL;DR: A new method of XML document clustering by a global criterion function, considering the weight of common structures, is proposed, which extracts representative structures of frequent patterns from schemaless XML documents using a sequential pattern mining algorithm.

Proceedings Article
01 Jan 2010
TL;DR: CluX uses a grammar for sharing similar substructures within the XML tree structure and a cluster-based heuristics for greedily selecting the best compression options in the grammar, which makes CluX a promising technique for XML data exchange whenever the exchanged data volume is a bottleneck in enterprise information systems.
Abstract: XML has become the de facto standard for data exchange in enterprise information systems. But whenever XML data is stored or processed, e.g. in form of a DOM tree representation, the XML markup causes a huge blow-up of the memory consumption compared to the data, i.e., text and attribute values, contained in the XML document. In this paper, we present CluX, an XML compression approach based on clustering XML sub-trees. CluX uses a grammar for sharing similar substructures within the XML tree structure and a cluster-based heuristics for greedily selecting the best compression options in the grammar. Thereby, CluX allows for storing and exchanging XML data in a space efficient and still queryable way. We evaluate different strategies for XML structure sharing, and we show that CluX often compresses better than XMill, Gzip, and Bzip2, which makes CluX a promising technique for XML data exchange whenever the exchanged data volume is a bottleneck in enterprise information systems.

01 Jan 2010
TL;DR: XWT permits to answer XPath queries more efficiently than using the uncompressed version of the documents and is also competitive when comparing it with inverted indexes over the XML document (if both structures use the same space).
Abstract: This paper presents a structure we call XML Wavelet Tree (XWT) to represent any XML document in a compressed and self-indexed form. Therefore, any query or procedure that could be performed over the original document can be performed more efficiently over the XWT representation because it is shorter and has some indexing properties. In fact, XWT permits to answer XPath queries more efficiently than using the uncompressed version of the documents. XWT is also competitive when comparing it with inverted indexes over the XML document (if both structures use the same space).

Proceedings ArticleDOI
01 Mar 2010
TL;DR: A weighted similarity measurement approach for detecting the similarity between the homogeneous xml documents is suggested and a new clustering model is proposed that is implemened using open source technology java and validated experimentally.
Abstract: XML (eXtensible Markup Language) have been adopted by number of software vendors today, it became the standard for data interchange over the web and is platform and application independent also. A XML document is consists of number of attributes like document data, structure and style sheet etc. Clustering is method of creating groups of similar objects. In this paper a weighted similarity measurement approach for detecting the similarity between the homogeneous xml documents is suggested. Using this similarity measurement a new clustering technique is also proposed. The method of calculating similarity of document's structure and styling is given by number of researchers, mostly which are based on tree edit distances. And for calculating the distance between document's contents there are number of text and other similarity techniques like cosine, jaccord, tf-idf etc. In this paper both of the similarity techniques are combined to propose a new distance measurement technique for calculating the distance between a pair of homogeneous XML documents. The proposed clustering model is implemened using open source technology java and is validated experimentally. Given a collection of XML documents distances between documents is calculated and stored in the java collections, and then these distances are used to cluster the XML documents.

Proceedings ArticleDOI
21 May 2010
TL;DR: A processing model for storing and building XML document in data transfer between XML and relational database, where a XML document is parsed and its elements are stored in a single table of database instead, thus leveraging the workload of DOM building to memory by the algorithm called Tree-Branch inter growth.
Abstract: The processing of XML documents has been regarded as the performance bottleneck in most systems and applications. A number of techniques have been developed to improve the performance of XML processing, ranging from the schema-specific model to the streaming-based model to the hardware acceleration. These methods only address parsing and scheduling the XML document in memory [1]. Although there are a few of works have discussed the efficiency of the data read-write between XML and Relational Database, they constructed the DOM and reading relational database synchronously and neglected the differences of pace between DOM (a general format of XML document in memory) building and relational database reading[2], which will reduce the performance of the entire system. In this paper, we present a processing model for storing and building XML document in data transfer between XML and relational database. In this model, a XML document is parsed and its elements are stored in a single table of database instead, it is not necessary to read the nodes according to their hierarchical structure, thus leveraging the workload of DOM building to memory by the algorithm called Tree-Branch inter growth. To show the feasibility and effectiveness of our approaches, we present our C# implementation of XML processing in this paper. Our empirical study shows our algorithm can improve the XML document processing performance significantly.

Proceedings ArticleDOI
15 Dec 2010
TL;DR: This work has focused on the vertical fragmentation design of the XML documents, and two fragmentation models have been proposed: query based fragmentation and structure-and size-based fragmentation.
Abstract: As XML document is distributed across the web, it can be considered like a distributed repository of XML documents and is subjected to distribution design. However, there is no adequate works on XML document distribution design. To address the shortcomings in XML document fragmentation design, in this work, we have focused on the vertical fragmentation design of the XML documents. Two fragmentation models have been proposed: query based fragmentation and structure-and size-based fragmentation. For the query based fragmentation model, vertical fragmentation techniques are proposed using the bond energy algorithm and graphical based algorithm. We have implemented both algorithms and evaluated their performance. The performance of our fragmentation algorithms are compared with centralized and fully replicated XML documents, where better results are obtained. The Structure- and size-based Fragmentation model and its implementation algorithms are also evaluated and encouraging results are achieved.

Proceedings ArticleDOI
05 Jul 2010
TL;DR: This paper proposes a novel modulo-based labeling scheme that uses modular arithmetic operations and numbering theory to label the XML tree, and shows that it supersedes other XML labeling schemes by having a smaller space size for the node label regardless of the fan-out or the depth of the tree.
Abstract: XML is becoming the de facto standard for exchanging and querying documents over the Web. Many XML query languages such as XQuery and XPath use label paths to traverse the irregularly structured XML data. Several labeling schemes have been proposed to identify the structural relationships in the tree, as well as to support the incremental updates at a low cost. In this paper, we conduct a comprehensive survey for labeling XML trees, and classify these schemes according to their labeling mechanism. We also propose a novel modulo-based labeling scheme that uses modular arithmetic operations and numbering theory to label the XML tree. Our algorithm labels nodes in the tree in a way, similar to the encryption-decryption function using modular multiplication and a prime modulo. We show that our algorithm supersedes other XML labeling schemes by having a smaller space size for the node label regardless of the fan-out or the depth of the tree, and completely eliminates the need to re-label the whole XML tree in case of future insertions.

Proceedings ArticleDOI
TL;DR: A novel XML-based document format for web publishing, called CEBX, is proposed, which has optimized document content organization, physical structure and protection scheme to support web publishing.
Abstract: Although many XML-based document formats are available for printing or publishing on the Internet, none of them is well designed to support both high quality printing and web publishing. Therefore, we propose a novel XML-based document format for web publishing, called CEBX, in this paper. The proposed format is a fixed-layout document supporting high quality printing, which has optimized document content organization, physical structure and protection scheme to support web publishing. There are four noteworthy features of CEBX documents: (1) CEBX provides original fixed layout by graphic units for printing quality. (2) The content in CEBX document can be reflowed to fit the display device basing on the content blocks and additional fluid information. (3) XML Document Archiving model (XDA), the packaging model used in CEBX, supports document linearization and incremental edit well. (4) By introducing a segment-based content protection scheme into CEBX, some part of a document can be previewed directly while the remaining part is protected effectively such that readers only need to purchase partial content of a book that they are interested in. This will be very helpful to document distribution and support flexible business models such as try-beforebuy, on-demand reading, superdistribution, etc.

Journal IssueDOI
TL;DR: This paper presents a project focussed on designing a general-purpose query language in support of mining XML data, and reports the results of a first bunch of experiments showing that a good trade-off between expressiveness and efficiency in XML DM is not a chimera.
Abstract: With the spreading of XML sources, mining XML data can be an important objective in the near future. This paper presents a project focussed on designing a general-purpose query language in support of mining XML data. In our framework, raw data, mining models and domain knowledge are represented by way of XML documents and stored inside native XML databases. Data mining (DM) tasks are expressed in an extension of XQuery. Special attention is given to the frequent pattern discovery problem, and a way of exploiting domain-dependent optimizations and efficient data structures as deeper as possible in the extraction process is presented. We report the results of a first bunch of experiments, showing that a good trade-off between expressiveness and efficiency in XML DM is not a chimera. Copyright © 2009 John Wiley & Sons, Ltd.

Book ChapterDOI
13 Sep 2010
TL;DR: An effort to evaluate basic XML data management trade-offs for current commercial systems is reported on, including a simple micro-benchmark that methodically evaluates the impact of query characteristics on the comparison of shredded and native XML.
Abstract: As we approach the ten-year anniversary of the first working draft of the XQuery language, one finds XML storage and query support in a number of commercial database systems. For many XML use cases, database vendors now recommend storing and indexing XML natively and using XQuery or SQL/XML to query and update XML directly. If the complexity of the XML data allows, shredding and reconstructing XML to/from relational tables is still an alternative as well, and might in fact outperform native XML processing. In this paper we report on an effort to evaluate these basic XML data management trade-offs for current commercial systems. We describe EXRT (Experimental XML Readiness Test), a simple micro-benchmark that methodically evaluates the impact of query characteristics on the comparison of shredded and native XML. We describe our experiences and preliminary results from EXRT'ing pressure on the XML data management facilities offered by two relational databases and one XML database system.

Proceedings ArticleDOI
22 Mar 2010
TL;DR: This paper introduces an XML keyword search method that provides high precision, recall, and ranking quality for data-centric XML, even when long text fields are present, and presents algorithms to compute NTPCs efficiently.
Abstract: Users who are unfamiliar with database query languages can search XML data sets using keyword queries. Current approaches for supporting such queries are either for text-centric XML, where the structure is very simple and long text fields predominate; or data-centric, where the structure is very rich. However, long text fields are becoming more common in data-centric XML, and existing approaches deliver relatively poor precision, recall, and ranking for such data sets. In this paper, we introduce an XML keyword search method that provides high precision, recall, and ranking quality for data-centric XML, even when long text fields are present. Our approach is based on a new group of structural relationships called normalized term presence correlation (NTPC). In a one-time setup phase, we compute the NTPCs for a representative DB instance, then use this information to rank candidate answers for all subsequent queries, based on each answer's structure. Our experiments with 65 user-supplied queries over two real-world XML data sets show that NTPC-based ranking is always as effective as the best previously available XML keyword search method for data-centric data sets, and provides better precision, recall, and ranking than previous approaches when long text fields are present. As the straightforward approach for computing NTPCs is too slow, we also present algorithms to compute NTPCs efficiently.

Journal ArticleDOI
01 Sep 2010
TL;DR: This work has implemented an XML schema transformation toolkit within IBM Master Data Management Server (MDM) that includes an extendible schema matching algorithm that was designed with evolving XML schemas in mind and takes advantage of hierarchical structure of XML.
Abstract: Database systems often use XML schema to describe the format of valid XML documents. Usually, this format is determined when the system is designed. Sometimes, in an already functioning system, a need arises to change the XML schemas. In such a situation, the system has to transform the old XML documents so that they conform to the new format and that as little information as possible is lost in the process. This process is called schema evolution.We have implemented an XML schema transformation toolkit within IBM Master Data Management Server (MDM). MDM uses XML documents to describe products that an enterprise may be offering to its clients. In this work we focus on evolving schemas rather than on integrating separate or heterogeneous data sources. Our solution includes an extendible schema matching algorithm that was designed with evolving XML schemas in mind and takes advantage of hierarchical structure of XML. It also includes a data transformation and migration method appropriate for environments where migration is performed in an abstraction layer above the DBMS. Finally, we describe a novel way of extending an XSLT editor with an XSLT visualization feature to allow the user's input and evaluation of the transformation.

Patent
30 Jul 2010
TL;DR: In this paper, a method for selecting user desirable content from web pages is presented, which includes receiving a web page, representing the web page as a Document Object Module (DOM) tree, computing visual and coordinate information of each DOM node within the DOM tree, determining the desirable DOM path, determining a desirable DOM node from the path, and selecting a single DOM node with the highest final score.
Abstract: A method for selecting user desirable content from web pages includes receiving a web page, representing the web page as a Document Object Module (DOM) tree, computing visual and coordinate information of each Document Object Module (DOM) node within the Document Object Module (DOM) tree, determining the desirable Document Object Module (DOM) path, determining the desirable Document Object Module (DOM) node from the desirable Document Object Module (DOM) path, and selecting a single Document Object Module (DOM) node with the highest final score. The single Document Object Module (DOM) node with the highest final score is selected as the user desirable content of the webpage.

Proceedings ArticleDOI
22 Nov 2010
TL;DR: Experimental results of the approach using BM25E model for retrieval large-scale XML collection with Score Sharing that allow to assign parent score by sharing score from leaf node to their parents by a Top-Down Scheme approach to improve efficiency on response time are reported.
Abstract: In this paper, we report experimental results of our approach using BM25E model for retrieval large-scale XML collection, to improve the effectiveness of XML Retrieval. This model is commonly used in the information retrieval community. We propose new algorithm using Score Sharing that allow to assign parent score by sharing score from leaf node to their parents by a Top-Down Scheme approach. In order to improve efficiency on response time, The Score Sharing algorithm processing time on 10,000 leaf nodes is around 0.135 ms. per topic after getting the result list from Zettair. The Zettair is able to process on average time per topic using less than 1 second then the processing time is up to 1 second per topic and our experiment show that the BM25E with Score Sharing improve iP[0.10] by 24.40% and MAiP by 31.89% over the original BM25E. In addition, our algorithm able to handle both elements level and document level by only setting parameter.

Proceedings ArticleDOI
01 Dec 2010
TL;DR: BFilter is proposed, which evaluates user queries that use backward matching branch points to delay further matching processes until branch points match in the XML document and user query and has better performance than the well-known YFilter for complex queries.
Abstract: In publish/subscribe systems, XML message filtering performed at application layer is an important operation for XML message multicast. As a specific case of content-based multicast in application layer, XML message multicast depends on the data filtering and matching processes and the forwarding and routing schemes. As the XML data emerges in transition, XML message filtering and matching becomes more and more desirable. BFilter, proposed in this paper, conducts the XML message filtering and matching by leveraging branch points in both the XML document and user query. It evaluates user queries that use backward matching branch points to delay further matching processes until branch points match in the XML document and user query. In this way, XML message filtering can be performed more efficiently as the probability of mismatching is reduced. A number of experiments have been conducted and the results demonstrate that BFilter has better performance than the well-known YFilter for complex queries.

Patent
Bian Li1, Yuan Li1, Chang H. Liu1, Xiaoyi Wang1, Yunting Wang1, Shuo Wu1, Kang Xu1 
27 Sep 2010
TL;DR: XPath evaluation in an XML data repository includes parsing an input Xpath query using a simple path file to generate an execution tree about the XPath query, where the simple file includes an XML file that is generated based on the hierarchical architecture of a plurality of XML files in the data repository, and the names of the nodes in the generated XML file are generated by recording the tag information of respective nodes in a plurality as mentioned in this paper.
Abstract: XPath evaluation in an XML data repository includes parsing an input XPath query using a simple path file to generate an execution tree about the XPath query, where the simple path file includes an XML file that is generated based on the hierarchical architecture of a plurality of XML files in the data repository, and the names of the nodes in the generated XML file are generated by recording the tag information of respective nodes in the plurality of XML files in the data repository. Execution of an execution tree for the data repository generates a final evaluation result.

Proceedings ArticleDOI
09 Jul 2010
TL;DR: A parallel solution to XML query application through the combination of parallel XML parsing and parallel XML query is presented, which makes use of multi-core environment through parallelization of key execution stages in query process.
Abstract: Since various XML query applications have come to the fore recently, performance optimization becomes the research hotspot. With the popularity of multi-core computing condition, parallelization appears as an important optimization measure. The paper presents a parallel solution to XML query application through the combination of parallel XML parsing and parallel XML query. The XML parsing is based on arbitrary XML data partition and parallel sub-tree construction with the final merging procedure. After XML parsing, the region encodings of XML data are obtained for relation matrix construction in that the XPath evaluation in query procedure is based on relation matrix. The matrix construction procedure and query primitives are parallelized to boost performance. As a whole, our solution makes use of multi-core environment through parallelization of key execution stages in query process. The key execution stages are verified by experiment and the whole effect of the solution is presented.