scispace - formally typeset
Book ChapterDOI

XBeGene: Scalable XML Documents Generator by Example Based on Real Data

TLDR
A novel XML By example Generator (XBeGene) for producing synthetic XML data which closely reflect the user’s requirements and high correlation levels between the specified user requirements and the characteristics of the generated XML data are demonstrated.
Abstract
XML datasets of various sizes and properties are needed to evaluate the correctness and efficiency of XML-based algorithms and applications. While several downloadable datasets can be found online, these are predefined by system experts and might not be suitable to evaluate every algorithm. Tools for generating synthetic XML documents underline an alternative solution, promoting flexibility and adaptability in generating synthetic document collections. Nonetheless, the usefulness of existing XML generators remains rather limited due to the restricted levels of expressiveness allowed to users. In this paper, we develop a novel XML By example Generator (XBeGene) for producing synthetic XML data which closely reflect the user’s requirements. Inspired by the query-by-example paradigm in information retrieval, Our generator system i)allows the user to provide her own sample XML documents as input, ii) analyzes the structure, occurrence frequencies, and content distributions for each XML element in the user input documents, and iii) produces synthetic XML documents which closely concur, in both structural and content features, to the user’s input data. The size of each synthetic document as well as that of the entire document collection are also specified by the user. Clustering experiments demonstrate high correlation levels between the specified user requirements and the characteristics of the generated XML data, while timing results confirm our approach’s scalability to large scale document collections.

read more

Citations
More filters
Book ChapterDOI

XQuery Testing from XML Schema Based Random Test Cases

TL;DR: The elements of an XQuery testing tool which makes possible to automatically test XQuery programs, implemented as an oracle able to report whether the XQuery program passes the test, that is, all the test cases satisfy the property, as well as the number of test cases used for testing.
Journal ArticleDOI

Automatic property‐based testing and path validation of XQuery programs

TL;DR: An XQuery property‐based testing tool is presented, which enables to automatically test XQuery programs and a web tool has been developed enabling to test and validate X query programs.
References
More filters
Proceedings Article

DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases

TL;DR: The theoretical foundations of DataGuides are presented along with an algorithm for their creation and an overview of incremental maintenance, and performance results based on the implementation of dataGuides in the Lore DBMS for semistructured data are provided.
Book ChapterDOI

XMark: a benchmark for XML data management

TL;DR: This work provides a framework to assess the abilities of an XML database to cope with a broad range of different query types typically encountered in real-world scenarios and offers a set of queries where each query is intended to challenge a particular aspect of the query processor.
Book ChapterDOI

Efficient computation of frequent and top-k elements in data streams

TL;DR: In this paper, the authors propose an integrated approach for finding the most popular k elements and finding frequent elements in a data stream, which is efficient and exact if the alphabet under consideration is small.
Proceedings Article

Evaluating Structural Similarity in XML Documents

TL;DR: A dynamic programming algorithm is developed that can compute pair-wise distances between documents in the collection, and then use these distances to cluster the documents, and finds that the resulting clusters match the original DTDs almost perfectly.
Proceedings ArticleDOI

Detecting changes in XML documents

TL;DR: This work is motivated by the support for change control in the context of the Xyleme project that is investigating dynamic warehouses capable of storing massive volumes of XML data, and offers a diff algorithm for XML data that runs in average in linear time vs. quadratic time.
Related Papers (5)