scispace - formally typeset
Search or ask a question

Showing papers on "Simple API for XML published in 2007"


Journal ArticleDOI
TL;DR: This survey considers two classes of major XML query processing techniques: the relational approach and the native approach, which could result in higher query processing performance and also significantly reduce system reengineering costs.
Abstract: Extensible markup language (XML) is emerging as a de facto standard for information exchange among various applications on the World Wide Web. There has been a growing need for developing high-performance techniques to query large XML data repositories efficiently. One important problem in XML query processing is twig pattern matching, that is, finding in an XML data tree D all matches that satisfy a specified twig (or path) query pattern Q. In this survey, we review, classify, and compare major techniques for twig pattern matching. Specifically, we consider two classes of major XML query processing techniques: the relational approach and the native approach. The relational approach directly utilizes existing relational database systems to store and query XML data, which enables the use of all important techniques that have been developed for relational databases, whereas in the native approach, specialized storage and query processing systems tailored for XML data are developed from scratch to further improve XML query performance. As implied by existing work, XML data querying and management are developing in the direction of integrating the relational approach with the native approach, which could result in higher query processing performance and also significantly reduce system reengineering costs.

191 citations


Proceedings ArticleDOI
Charu C. Aggarwal1, Na Ta2, Jianyong Wang2, Jianhua Feng2, Mohammed J. Zaki 
12 Aug 2007
TL;DR: This paper proposes an effective clustering algorithm for XML data which uses substructures of the documents in order to gain insights about the important underlying structures and proposes new ways of using multiple sub-structuralinformation in XML documents to evaluate the quality of intermediate cluster solutions.
Abstract: XML has become a popular method of data representation both on the web and in databases in recent years. One of the reasons for the popularity of XML has been its ability to encode structural information about data records. However, this structural characteristic of data sets also makes it a challenging problem for a variety of data mining problems. One such problem is that of clustering, in which the structural aspects of the data result in a high implicit dimensionality of the data representation. As a result, it becomes more difficult to cluster the data in a meaningful way. In this paper, we propose an effective clustering algorithm for XML data which uses substructures of the documents in order to gain insights about the important underlying structures. We propose new ways of using multiple sub-structuralinformation in XML documents to evaluate the quality of intermediate cluster solutions, and guide the algorithms to a final solution which reflects the true structural behavior in individual partitions. We test the algorithm on a variety of real and synthetic data sets.

127 citations


Patent
17 Oct 2007
TL;DR: In this article, a device control system including at least one device operable by the system, a processor, and two or more communication components, each communication component including an XML parser for parsing the XML document and extracting the message data.
Abstract: A device control system including at least one device operable by the system, at least one processor, software executing on the at least one processor for receiving message data and determining a corresponding XML document type, software executing on the at least one processor for generating a XML document based on the XML document type, the XML document including the message data, software executing on the processor for packetizing the XML document, and two or more communication components, each communication component including an XML parser for parsing the XML document and extracting the message data.

113 citations


Patent
01 Nov 2007
TL;DR: In this article, an application program interface (API) is provided for requesting, storing, and accessing data within a health integration network, which facilitates secure and seamless access to the centrally-stored data by offering authentication/authorization, as well as the ability to receive requests in an extensible language format, such as XML, and returns resulting data in XML format.
Abstract: An application program interface (API) is provided for requesting, storing, and otherwise accessing data within a health integration network. The API facilitates secure and seamless access to the centrally-stored data by offering authentication/authorization, as well as the ability to receive requests in an extensible language format, such as XML, and returns resulting data in XML format. The data can also have transformation, style and/or schema information associated with it which can be returned in the resulting XML and/or applied to the data beforehand by the API. The API can be utilized in many environment architectures including XML over HTTP and a software development kit (SDK).

93 citations


Proceedings ArticleDOI
11 Jun 2007
TL;DR: This paper has developed an application-oriented and domain-specific benchmark called "Transaction Processing over XML" (TPoX), which exercises all aspects of XML databases, including storage, indexing, logging, transaction processing, and concurrency control.
Abstract: XML database functionality has been emerging in "XML-only" databases as well as in the major relational database products. Yet, there is no industry standard XML database benchmark to evaluate alternative implementations. The research community has proposed several benchmarks which are all useful in their respective scope, such as evaluating XQuery processors. However, they do not aim to evaluate a database system in its entirety and do not represent all relevant characteristics of a real-world XML application. Often they only define read-only single-user tests on a single XML document. We have developed an application-oriented and domain-specific benchmark called "Transaction Processing over XML" (TPoX). It exercises all aspects of XML databases, including storage, indexing, logging, transaction processing, and concurrency control. Based on our analysis of real XML applications, TPoX simulates a financial multi-user workload with XML data conforming to the FIXML standard. In this paper we describe TPoX and present early performance results. We also make its implementation publicly available.

83 citations


Journal ArticleDOI
TL;DR: This work presents a schema clustering process by organising the heterogeneous XML schemas into various groups by considering not only the linguistic and the context of the elements but also the hierarchical structural similarity.
Abstract: With the growing popularity of XML as the data representation language, collections of the XML data are exploded in numbers. The methods are required to manage and discover the useful information from them for the improved document handling. We present a schema clustering process by organising the heterogeneous XML schemas into various groups. The methodology considers not only the linguistic and the context of the elements but also the hierarchical structural similarity. We support our findings with experiments and analysis.

66 citations


Proceedings ArticleDOI
15 Apr 2007
TL;DR: A definition for cube adapted for XML data warehouse is proposed, including a suitably generalized specification mechanism, and the results of an extensive performance evaluation experiment gauging the behavior of alternative algorithms for cube computation are presented.
Abstract: With increasing amounts of data being exchanged and even generated or stored in XML, a natural question is how to perform OLAP on XML data, which can be structurally heterogeneous (e.g., parse trees) and/or marked-up text documents. A core operator for OLAP is the data cube. While the relational cube can be extended in a straightforward way to XML, we argue such an extension would not address the specific issues posed by XML. While in a relational warehouse, facts are flat records and dimensions may have hierarchies, in an XML warehouse, both facts and dimensions may be hierarchical. Second, XML is flexible: (a) an element may have missing or repeated subelements; (b) different instances of the same element type may have different structure. We identify the challenges introduced by these features of XML for cube definition and computation. We propose a definition for cube adapted for XML data warehouse, including a suitably generalized specification mechanism. We define a cube lattice over the aggregates so defined. We then identify properties of this cube lattice that can be leveraged to allow optimized computation of the cube. Finally, we present the results of an extensive performance evaluation experiment gauging the behavior of alternative algorithms for cube computation.

63 citations


Book ChapterDOI
03 Sep 2007
TL;DR: This paper proposes a novel fragmentation technique that founds on structural constraints of XML documents (size, tree-width, and tree-depth) and on special-purpose structure histograms able to meaningfully summarize XML documents that allows it to predict bounding intervals of structural properties of output (XML) fragments for efficient query processing of distributed XML data.
Abstract: Fragmentation techniques for XML data are gaining momentum within both distributed and centralized XML query engines and pose novel and unrecognized challenges to the community. Albeit not novel, and clearly inspired by the classical divide et impera principle, fragmentation for XML trees has been proved successful in boosting the querying performance, and in cutting down the memory requirements. However, fragmentation considered so far has been driven by semantics, i.e. built around query predicates. In this paper, we propose a novel fragmentation technique that founds on structural constraints of XML documents (size, tree-width, and tree-depth) and on special-purpose structure histograms able to meaningfully summarize XML documents. This allows us to predict bounding intervals of structural properties of output (XML) fragments for efficient query processing of distributed XML data. An experimental evaluation of our study confirms the effectiveness of our fragmentation methodology on some representative XML data sets.

54 citations


Patent
23 Apr 2007
TL;DR: In this paper, the system provides a Wrapper class for the XML Document class and the Element class, which can be used to access external components as required by a user application.
Abstract: Systems and methods for loading XML documents on demand are described. The system provides a Wrapper class for the XML Document class and the Element class. A user application then utilizes the Wrapper class in the same way that the Element class and Document class would be used to access any element in the XML Document. The Wrapper class loads external components as required. The external component retrieval is completely transparent to the user application and the user application is able to access the entire XML document as if it were completely loaded into a DOM object in memory. Accordingly, each element is accessible in a random manner. In one configuration, the XML document components or external components are stored in a database in a BLOB field as a Digital Document. The system uses external components to efficiently use resources as compared to systems using Xlink and external entities.

48 citations


Proceedings ArticleDOI
25 Jun 2007
TL;DR: A stealing-based dynamic load-balancing mechanism, called ThreadCrew, by which multiple threads are able to process the disjointed parts of the XML document in parallel with balanced load distribution, and a novel mechanism to trace the stealing actions is provided.
Abstract: A language for semi-structured documents, XML has emerged as the core of the web services architecture, and is playing crucial roles in messaging systems, databases, and document processing. However, the processing of XML documents has been regarded as the performance bottleneck in most systems and applications. On the other side, the multicore processor, emerged as a solution for the clock-speed limitation of the modern CPUs, has been growingly prevalent. Leveraging the parallelism provided by the multicorere source to speedup the software execution is becoming the trend of the software development. In this paper, we present a parallel processing model for the XML document. The model is not designed just for a specific XML processing task, instead, it is a general model, by which we are able to explore various parallel XML document processing. The kernel of the model is a stealing-based dynamic load-balancing mechanism, called ThreadCrew, by which multiple threads are able to process the disjointed parts of the XML document in parallel with balanced load distribution. The model also provides a novel mechanism to trace the stealing actions, thus the equivalent sequential result can be gotten by gluing the multiple parallel-running results together. To show the feasibility and effectiveness of our approaches, we present our C# implementation of parallel XML serialization in this paper. Our empirical study shows our parallel XML serialization algorithm can improved the XML serializing performance significantly on a multicore machine.

48 citations


Proceedings ArticleDOI
15 Apr 2007
TL;DR: A new synopsis for XML documents is introduced which can be effectively used to estimate the selectivity of complex path queries, based on a lossy compression of the document tree that underlies the XML document, and can be computed in one pass from the document.
Abstract: Estimating the selectivity of queries is a crucial problem in database systems. Virtually all database systems rely on the use of selectivity estimates to choose amongst the many possible execution plans for a particular query. In terms of XML databases, the problem of selectivity estimation of queries presents new challenges: many evaluation operators are possible, such as simple navigation, structural joins, or twig joins, and many different indexes are possible. A new synopsis for XML documents is introduced which can be effectively used to estimate the selectivity of complex path queries. The synopsis is based on a lossy compression of the document tree that underlies the XML document, and can be computed in one pass from the document. It has several advantages over existing approaches: (1) it allows one to estimate the selectivity of queries containing all XPath axes, including the order-sensitive ones, (2) the estimator returns a range within which the actual selectivity is guaranteed to lie, with the size of this range implicitly providing a confidence measure of the estimate, and (3) the synopsis can be incrementally updated to reflect changes in the XML database.

Proceedings ArticleDOI
10 Dec 2007
TL;DR: In this paper, a parallel preparsing scan is used to build an outline of the XML document, which is then used to guide the parallel full parse, and a meta-DFA mechanism is proposed to parallelize the preparser itself.
Abstract: By leveraging the growing prevalence of multicore CPUs, parallel XML parsing(PXP) can significantly improve the performance of XML, enhancing its suitability for scientific data which is often dominated by floating-point numbers. One approach is to divide the XML document into equal-sized chunks, and parse each chunk in parallel. XML parsing is inherently sequential, however, because the state of an XML parser when reading a given character depends potentially on all preceding characters. In previous work, we addressed this by using a fast preparsing scan to build an outline of the document which we called the skeleton. The skeleton is then used to guide the parallel full parse. The preparse is a sequential phase that limits scalability, however, and so in this paper, we show how the preparse itself can be parallelized using a mechanism we call a meta-DFA. For each state q of the original preparser the meta-DFA incorporates a complete copy of the preparser state machine as a sub-DFA which starts in state q. The meta-DFA thus runs multiple instances of the preparser simultaneously when parsing a chunk, with each possible preparser state at the beginning of a chunk represented by an instance. By pursuing all possibilities simultaneously, the meta-DFA allows each chunk to be preparsed independently in parallel. The parallel full parse following the preparse is performed using libxml2, and outputs DOM trees that are fully compatible with existing applications that use libxml2. Our implementation scales well on a 30 CPU Sun E6500 machine.

Proceedings ArticleDOI
08 May 2007
TL;DR: This paper presents a new storage scheme for XML data that supports all navigational operations in near constant time, and features a small memory footprint that increases cache locality, whilst still supporting standard APIs and necessary database operations, such as queries and updates, efficiently.
Abstract: As XML database sizes grow, the amount of space used for storing the data and auxiliary data structures becomes a major factor in query and update performance. This paper presents a new storage scheme for XML data that supports all navigational operations in near constant time. In addition to supporting efficient queries, the space requirement of the proposed scheme is within a constant factor of the information theoretic minimum, while insertions and deletions can be performed in near constant time as well. As a result, the proposed structure features a small memory footprint that increases cache locality, whilst still supporting standard APIs, such as DOM, and necessary database operations, such as queries and updates, efficiently. Analysis and experiments show that the proposed structure is space and time efficient.

Journal ArticleDOI
01 Jan 2007
TL;DR: This research develops a space efficient DOM parser, called SEDOM, based on a new compression approach and a set of manipulation algorithms, which enable many DOM operations to be performed when the data are in the compressed format, and allow individual parts of a document to be compressed, decompressed and manipulated.
Abstract: In many XML applications, parsing is a key operation. When the processing involves modifying data, random access, and/or in an order different from the one in which elements are stored, a DOM parser has to be used. A major problem with using a DOM parser is memory consumption. The size of a DOM tree created from an XML document may be as large as 10 times of the size of the original document. Maintaining the tree of a big document requires a large amount of memory. It may cause costly swapping. In the worst cases, a DOM parser cannot handle a document at all because of its size. In this research, we develop a space efficient DOM parser, called SEDOM. It is based on a new compression approach and a set of manipulation algorithms, which enable many DOM operations to be performed when the data are in the compressed format, and allow individual parts of a document to be compressed, decompressed and manipulated. It can be used to efficiently manipulate very large XML documents. In this paper, we describe SEDOM, and compare its performance with three existing DOM parsers and an XML compressor.

Patent
05 Oct 2007
TL;DR: In this paper, the authors present a method and system for performing operations on data using XML streams, such as addition, subtraction, multiplication, and division, in XML data.
Abstract: The present invention provides a method and system for performing operations on data using XML streams. An XML schema defines a limited set of operations that may be performed on data. These operations include addition, subtraction, multiplication and division. The operations are placed in an XML stream that conforms to the XML schema. The XML stream may perform one or more of the defined operations on the data. The limited set of operations allows data to be validated and processed without excessive overhead.

Proceedings Article
23 Sep 2007
TL;DR: A novel early pruning approach called Bounding-based XML Filtering or BoXFilter is proposed, based on a new tree-like indexing structure that organizes the queries based on their similarity and provides lower and upper bound estimations needed to prune queries not related to the incoming documents.
Abstract: Publish-subscribe applications are an important class of content-based dissemination systems where the message transmission is defined by the message content, rather than its destination IP address. With the increasing use of XML as the standard format on many Internet-based applications, XML aware pub-sub applications become necessary. In such systems, the messages (generated by publishers) are encoded as XML documents, and the profiles (defined by subscribers) as XML query statements. As the number of documents and query requests grow, the performance and scalability of the matching phase (i.e. matching of queries to incoming documents) become vital. Current solutions have limited or no flexibility to prune out queries in advance. In this paper, we overcome such limitation by proposing a novel early pruning approach called Bounding-based XML Filtering or BoXFilter. The BoXFilter is based on a new tree-like indexing structure that organizes the queries based on their similarity and provides lower and upper bound estimations needed to prune queries not related to the incoming documents. Our experimental evaluation shows that the early profile pruning approach offers drastic performance improvements over the current state-of-the-art in XML filtering.

Journal ArticleDOI
TL;DR: Experiments on large XML documents show that instance patterns allow a significant reduction in storage space, while preserving almost entirely the completeness of the query result, thus overcoming the document size limitation of most current XQuery engines.
Abstract: XML is a rather verbose representation of semistructured data, which may require huge amounts of storage space. We propose a summarized representation of XML data, based on the concept of instance pattern, which can both provide succinct information and be directly queried. The physical representation of instance patterns exploits itemsets or association rules to summarize the content of XML datasets. Instance patterns may be used for (possibly partially) answering queries, either when fast and approximate answers are required, or when the actual dataset is not available, for example, it is currently unreachable. Experiments on large XML documents show that instance patterns allow a significant reduction in storage space, while preserving almost entirely the completeness of the query result. Furthermore, they provide fast query answers and show good scalability on the size of the dataset, thus overcoming the document size limitation of most current XQuery engines.

Patent
27 Mar 2007
TL;DR: In this article, a two-thread SAX parser with a main thread and a parsing thread is implemented, where the main thread controls the parsing thread by sending target content to be searched for and wake-up signals to the parser, and receives the content found by the parser for further processing.
Abstract: A method for processing XML documents using a SAX parser, implemented in a two-thread architecture having a main thread and a parsing thread. The parsing procedure is located in a parsing thread, which implements callback functions of a SAX parser and creates and executes the SAX parser. The main thread controls the parsing thread by sending target content to be searched for and wakeup signals to the parsing thread, and receives the content found by the parsing thread for further processing. In the parsing thread, each time a callback function is invoked by the SAX parser, it is determined whether the target content has been found. If it has, the parsing thread sends the found content to the main thread with a wakeup signal, and enters a sleep mode, whereby further parsing is halted until a wakeup signal with additional target content is received from the main thread.

Journal Article
TL;DR: It is shown how the system can be used to tackle various two-level transformation scenarios, such as XML schema evolution coupled with document migration, and hierarchical-relational data mappings that convert between XML documents and SQL databases.
Abstract: A two-level data transformation consists of a type-level transformation of a data format coupled with value-level transformations of data instances corresponding to that format. We have implemented a system for performing two-level transformations on XML schemas and their corresponding documents, and on SQL schemas and the databases that they describe. The core of the system consists of a combinator library for composing type-changing rewrite rules that preserve structural information and referential constraints. We discuss the implementation of the system's core library, and of its SQL and XML front-ends in the functional language Haskell. We show how the system can be used to tackle various two-level transformation scenarios, such as XML schema evolution coupled with document migration, and hierarchical-relational data mappings that convert between XML documents and SQL databases.

Patent
24 Apr 2007
TL;DR: In this paper, a modification is applied directly to the persistently stored binary form without constructing an object tree or materializing the XML document into a corresponding textual form, taking into account the nature of the binary form in which the document is encoded, including identifying where in the binary representation the corresponding actual changes need to be made.
Abstract: An XML document can be represented in a compact binary form that maintains all of the features of XML data in a useable form. In response to a request for a modification (e.g., insert, delete or update a node) to an XML document that is stored in the compact binary form, a certain representation of the requested modification is computed for application directly to the binary form of the document. Thus, the requested modification is applied directly to the persistently stored binary form without constructing an object tree or materializing the XML document into a corresponding textual form. Taking into account the nature of the binary form in which the document is encoded, the bytes that actually require change are identified, including identifying where in the binary representation the corresponding actual changes need to be made.

Proceedings ArticleDOI
21 May 2007
TL;DR: An algorithm for relocating partitioned XML data based on the CPU load of query processing and it is found that there is a performance advantage in the approach for executing distributed query processing of large XML data.
Abstract: We propose an efficient distributed query processing method for large XML data by partitioning and distributing XML data to multiple computation nodes. There are several steps involved in this method; however, we focused particularly on XML data partitioning and dynamic relocation of partitioned XML data in our research. Since the efficiency of query processing depends on both XML data size and its structure, these factors should be considered when XML data is partitioned. Each partitioned XML data is distributed to computation nodes so that the CPU load can be balanced. In addition, it is important to take account of the query workload among each of the computation nodes because it is closely related to the query processing cost in distributed environments. In case of load skew among computation nodes, partitioned XML data should be relocated to balance the CPU load. Thus, we implemented an algorithm for relocating partitioned XML data based on the CPU load of query processing. From our experiments, we found that there is a performance advantage in our approach for executing distributed query processing of large XML data.

Book ChapterDOI
23 Sep 2007
TL;DR: In XPath query evaluation, indices similar to those used in relational database systems - namely, value indices on tags and text values - are first used, together with structural join algorithms, which turn out to be simple and efficient.
Abstract: Supporting efficient access to XML data using XPath [3] continues to be an important research problem [6, 12]. XPath queries are used to specify nodelabeled trees which match portions of the hierarchical XML data. In XPath query evaluation, indices similar to those used in relational database systems - namely, value indices on tags and text values - are first used, together with structural join algorithms [1, 2, 19]. This approach turns out to be simple and efficient. However, the structural containment relationships native to XML data are not directly captured by value indices.

Proceedings ArticleDOI
07 May 2007
TL;DR: An extensive comparative study and benchmarking on the popular XML parsers found in the market today is done and a non-validating SAX based XML parser, xParse, is proposed, which proves the viability of the approach.
Abstract: Due to its flexibility and efficiency in transmission of data, XML has become the emerging standard of data transfer and data exchange across the Internet. XML document must always be checked for well formedness before data transfer and exchange can take place. To choose the right parser for an organization respective system is crucial and critical; since improper parser will lead to degradation in performance and decrease in productivity. In this paper, we do an extensive comparative study and benchmarking on the popular XML parsers found in the market today. In addition, we also propose a non-validating SAX based XML parser, xParse. We implemented our technique and present the performance results, which prove the viability of our approach.

Proceedings ArticleDOI
11 Jun 2007
TL;DR: ReXSA proposes candidate database schemas given an information model of the enterprise data that has the advantage of considering qualitative properties of the information model such as reuse, evolution and performance profiles for deciding how to persist the data.
Abstract: In response to the widespread use of the XML format for document representation and message exchange, major database vendors support XML in terms of persistence, querying and indexing. Specifically, the recently released IBM DB2 9 (for Linux, Unix and Windows) is a hybrid data server with optimized management of both XML and relational data. With the new option of storing and querying XML in a relational DBMS, data architects face the the decision of what portion of their data to persist as XML and what portion as relational data. This problem has not been addressed yet and represents a serious need in the industry. Hence, this paper describes ReXSA, a schema advisor tool that is being prototyped for IBM DB2 9. ReXSA proposes candidate database schemas given an information model of the enterprise data. It has the advantage of considering qualitative properties of the information model such as reuse, evolution and performance profiles for deciding how to persist the data. Finally, we show the viability and practicality of ReXSA by applying it to custom and real usecases.

Proceedings ArticleDOI
20 May 2007
TL;DR: The tool TAXI is presented, which implements the XML-based partition testing approach for the automated generation of XML instances conforming to a given XML schema and provides a set of weighted test strategies to guide the systematic derivation of instances.
Abstract: We present the tool TAXI which implements the XML-based partition testing approach for the automated generation of XML instances conforming to a given XML schema. In addition it provides a set of weighted test strategies to guide the systematic derivation of instances. TAXI can be used for black-box testing of applications accepting in input XML instances and for benchmarking of database management systems.

Book ChapterDOI
18 Jun 2007
TL;DR: Information retrieval and database-related techniques have been jointly applied to effectively tolerate XML data diversity in the evaluation of flexible queries to efficiently addressing structural pattern queries together with predicate support over XML elements content.
Abstract: XML has become a key technology for interoperability, providing a common data model to applications. However, diverse data modeling choices may lead to heterogeneous XML structure and content. In this paper, information retrieval and database-related techniques have been jointly applied to effectively tolerate XML data diversity in the evaluation of flexible queries. Approximate structure and content matching is supported via a straightforward extension to standard XPath syntax. Also, we outline a query execution technique representing a first step toward efficiently addressing structural pattern queries together with predicate support over XML elements content.

Proceedings ArticleDOI
15 Apr 2007
TL;DR: The popularity of XML as a data exchange standard has led to the emergence of powerful XML query languages like XQuery and studies on XML query optimization, but even for data integration, there is a compelling need for performing group-by style aggregate operations.
Abstract: The popularity of XML as a data exchange standard has led to the emergence of powerful XML query languages like XQuery and studies on XML query optimization. Of late, there is considerable interest in analytical processing of XML data. As pointed out by Borkar and Carey, even for data integration, there is a compelling need for performing various group-by style aggregate operations. A core operator needed for analytics is the group-by operator, which is widely used in relational as well as OLAP database applications. XQuery requires group-by operations to be simulated using nesting.

Patent
31 May 2007
TL;DR: In this article, a method and apparatus for creating a Document Object Model of an XML document of predetermined type is presented, comprising a first process for receiving and opening a compressed input file containing the XML document, and a second process for opening and parsing the contents of a relationships file to create a map of name-value pairs and detecting a value for identifying the predetermined type from among a plurality of types of XML documents.
Abstract: A method and apparatus are set forth for creating a Document Object Model of an XML document of predetermined type, comprising a first process for receiving and opening a compressed input file containing the XML document; a second process for opening and parsing the contents of a relationships file to create a map of name-value pairs and detecting a value for identifying the predetermined type from among a plurality of types of XML documents; and a further process for parsing data in the XML document according to the predetermined type, and building the Document Object Model.

Proceedings ArticleDOI
21 Nov 2007
TL;DR: This paper proposes a novel clustering method for XML documents using the weight of frequent structures in XML documents, considering that an XML document as a transaction and the extracted structures from XML documents as items of a transaction.
Abstract: The previous clustering methods of XML document group XML documents with similar structures, measuring structural similarity and distance between XML documents. In this paper, however, we propose a novel clustering method for XML documents using the weight of frequent structures in XML documents, considering that an XML document as a transaction and the extracted structures from XML documents as items of a transaction. Our experiment results show the high speed and cluster cohesion of our clustering method.

Book ChapterDOI
29 Sep 2007
TL;DR: The results of conducted tests show that the proposed scheme attains compression ratios rivaling the best available algorithms, and fast compression, decompression, and query processing.
Abstract: This paper describes a new XML compression scheme that offers both high compression ratios and short query response time. Its core is a fully reversible transform featuring substitution of every word in an XML document using a semi-dynamic dictionary, effective encoding of dictionary indices, as well as numbers, dates and times found in the document, and grouping data within the same structural context in individual containers. The results of conducted tests show that the proposed scheme attains compression ratios rivaling the best available algorithms, and fast compression, decompression, and query processing.