Showing papers on "XML published in 2002"

PDF

Open Access

Proceedings Article•DOI•

Storing and querying ordered XML using a relational database system

[...]

Igor Tatarinov¹, Stratis D. Viglas², Kevin Scott Beyer³, Jayavel Shanmugasundaram⁴, Eugene J. Shekita³, Chun Zhang² - Show less +2 more•Institutions (4)

University of Washington¹, University of Wisconsin-Madison², IBM³, Cornell University⁴

03 Jun 2002

TL;DR: This paper shows that XML's ordered data model can indeed be efficiently supported by a relational database system, and proposes three order encoding methods that can be used to represent XML order in the relational data model, and also proposes algorithms for translating ordered XPath expressions into SQL using these encoding methods.

...read moreread less

Abstract: XML is quickly becoming the de facto standard for data exchange over the Internet. This is creating a new set of data management requirements involving XML, such as the need to store and query XML documents. Researchers have proposed using relational database systems to satisfy these requirements by devising ways to "shred" XML documents into relations, and translate XML queries into SQL queries over these relations. However, a key issue with such an approach, which has largely been ignored in the research literature, is how (and whether) the ordered XML data model can be efficiently supported by the unordered relational data model. This paper shows that XML's ordered data model can indeed be efficiently supported by a relational database system. This is accomplished by encoding order as a data value. We propose three order encoding methods that can be used to represent XML order in the relational data model, and also propose algorithms for translating ordered XPath expressions into SQL using these encoding methods. Finally, we report the results of an experimental study that investigates the performance of the proposed order encoding methods on a workload of ordered XML queries and updates.

...read moreread less

2,402 citations

Proceedings Article•DOI•

Holistic twig joins: optimal XML pattern matching

[...]

Nicolas Bruno¹, Nick Koudas², Divesh Srivastava²•Institutions (2)

Columbia University¹, AT&T Labs²

03 Jun 2002

TL;DR: This paper proposes a novel holistic twig join algorithm, TwigStack, that uses a chain of linked stacks to compactly represent partial results to root-to-leaf query paths, which are then composed to obtain matches for the twig pattern.

...read moreread less

Abstract: XML employs a tree-structured data model, and, naturally, XML queries specify patterns of selection predicates on multiple elements related by a tree structure. Finding all occurrences of such a twig pattern in an XML database is a core operation for XML query processing. Prior work has typically decomposed the twig pattern into binary structural (parent-child and ancestor-descendant) relationships, and twig matching is achieved by: (i) using structural join algorithms to match the binary relationships against the XML database, and (ii) stitching together these basic matches. A limitation of this approach for matching twig patterns is that intermediate result sizes can get large, even when the input and output sizes are more manageable.In this paper, we propose a novel holistic twig join algorithm, TwigStack, for matching an XML query twig pattern. Our technique uses a chain of linked stacks to compactly represent partial results to root-to-leaf query paths, which are then composed to obtain matches for the twig pattern. When the twig pattern uses only ancestor-descendant relationships between elements, TwigStack is I/O and CPU optimal among all sequential algorithms that read the entire input: it is linear in the sum of sizes of the input lists and the final result list, but independent of the sizes of intermediate results. We then show how to use (a modification of) B-trees, along with TwigStack, to match query twig patterns in sub-linear time. Finally, we complement our analysis with experimental results on a range of real and synthetic data, and query twig patterns.

...read moreread less

1,014 citations

Proceedings Article•DOI•

Structural joins: a primitive for efficient XML query pattern matching

[...]

Shurug Al-Khalifa¹, H. V. Jagadish, Nick Koudas¹, Jignesh M. Patel¹, Divesh Srivastava¹, Yuqing Wu¹ - Show less +2 more•Institutions (1)

University of Michigan¹

07 Aug 2002

TL;DR: It is shown that, in some cases, tree-merge algorithms can have performance comparable to stack-tree algorithms, in many cases they are considerably worse, and this behavior is explained by analytical results that demonstrate that, on sorted inputs, the stack- tree algorithms have worst-case I/O and CPU complexities linear in the sum of the sizes of inputs and output, while the tree-MERge algorithms do not have the same guarantee.

...read moreread less

Abstract: XML queries typically specify patterns of selection predicates on multiple elements that have some specified tree structured relationships. The primitive tree structured relationships are parent-child and ancestor-descendant, and finding all occurrences of these relationships in an XML database is a core operation for XML query processing. We develop two families of structural join algorithms for this task: tree-merge and stack-tree. The tree-merge algorithms are a natural extension of traditional merge joins and the multi-predicate merge joins, while the stack-tree algorithms have no counterpart in traditional relational join processing. We present experimental results on a range of data and queries using the TIMBER native XML query engine built on top of SHORE. We show that while, in some cases, tree-merge algorithms can have performance comparable to stack-tree algorithms, in many cases they are considerably worse. This behavior is explained by analytical results that demonstrate that, on sorted inputs, the stack-tree algorithms have worst-case I/O and CPU complexities linear in the sum of the sizes of inputs and output, while the tree-merge algorithms do not have the same guarantee.

...read moreread less

895 citations

Book Chapter•DOI•

XMark: a benchmark for XML data management

[...]

Albrecht Schmidt, Florian Waas¹, Martin L. Kersten, Michael J. Carey, Ioana Manolescu², Ralph Busse - Show less +2 more•Institutions (2)

Microsoft¹, French Institute for Research in Computer Science and Automation²

20 Aug 2002

TL;DR: This work provides a framework to assess the abilities of an XML database to cope with a broad range of different query types typically encountered in real-world scenarios and offers a set of queries where each query is intended to challenge a particular aspect of the query processor.

...read moreread less

Abstract: While standardization efforts for XML query languages have been progressing, researchers and users increasingly focus on the database technology that has to deliver on the new challenges that the abundance of XML documents poses to data management: validation, performance evaluation and optimization of XML query processors are the upcoming issues. Following a long tradition in database research, we provide a framework to assess the abilities of an XML database to cope with a broad range of different query types typically encountered in real-world scenarios. The benchmark can help both implementors and users to compare XML databases in a standardized application scenario. To this end, we offer a set of queries where each query is intended to challenge a particular aspect of the query processor. The overall workload we propose consists of a scalable document database and a concise, yet comprehensive set of queries which covers the major aspects of XML query processing ranging from textual features to data analysis queries and ad hoc queries. We complement our research with results we obtained from running the benchmark on several XML database platforms. These results are intended to give a first baseline and illustrate the state of the art.

...read moreread less

822 citations

Journal Article•DOI•

Annotea: an open RDF infrastructure for shared Web annotations

[...]

José Kahan¹, Marja-Riitta Koivunen², Eric Prud'hommeaux², Ralph R. Swick²•Institutions (2)

French Institute for Research in Computer Science and Automation¹, Massachusetts Institute of Technology²

05 Aug 2002-Computer Networks

TL;DR: The paper presents the overall design of Annotea and describes some of the issues the project has faced and how it has solved them, including combining RDF with XPointer, XLink, and HTTP.

...read moreread less

565 citations

Proceedings Article•DOI•

Accelerating XPath location steps

[...]

Torsten Grust¹•Institutions (1)

University of Konstanz¹

03 Jun 2002

TL;DR: This work is a proposal for a database index structure that has been specifically designed to support the evaluation of XPath queries, capable to support all XPath axes and able to start traversals from arbitrary context nodes in an XML document.

...read moreread less

Abstract: This work is a proposal for a database index structure that has been specifically designed to support the evaluation of XPath queries. As such, the index is capable to support all XPath axes (including ancestor, following, preceding-sibling, descendant-or-self, etc.). This feature lets the index stand out among related work on XML indexing structures which had a focus on regular path expressions (which correspond to the XPath axes children and descendant-or-self plus name tests). Its ability to start traversals from arbitrary context nodes in an XML document additionally enables the index to support the evaluation of path traversals embedded in XQuery expressions. Despite its flexibility, the new index can be implemented and queried using purely relational techniques, but it performs especially well if the underlying database host provides support for R-trees. A performance assessment which shows quite promising results completes this proposal.

...read moreread less

531 citations

Book•

XML Schema

[...]

Eric van der Vlist

15 Jun 2002

TL;DR: This book explains XML Schema foundations, a variety of different styles for writing schemas, simple and complex types, datatypes and facets, keys, extensibility, documentation, design choices, best practices, and limitations.

...read moreread less

Abstract: The W3C's XML Schema offers a powerful set of tools for defining acceptable XML document structures and content. While schemas are powerful, that power comes with substantial complexity. This book explains XML Schema foundations, a variety of different styles for writing schemas, simple and complex types, datatypes and facets, keys, extensibility, documentation, design choices, best practices, and limitations. Complete with references, a glossary, and examples throughout.

...read moreread less

525 citations

Book Chapter•DOI•

Translating web data

[...]

Lucian Popa¹, Yannis Velegrakis², Mauricio A. Hernández¹, Renée J. Miller², Ronald Fagin¹ - Show less +1 more•Institutions (2)

IBM¹, University of Toronto²

20 Aug 2002

TL;DR: A novel framework for mapping between any combination of XML and relational schemas is presented, in which a high-level, user-specified mapping is translated into semantically meaningful queries that transform source data into the target representation.

...read moreread less

Abstract: We present a novel framework for mapping between any combination of XML and relational schemas, in which a high-level, user-specified mapping is translated into semantically meaningful queries that transform source data into the target representation. Our approach works in two phases. In the first phase, the high-level mapping, expressed as a set of inter-schema correspondences, is converted into a set of mappings that capture the design choices made in the source and target schemas (including their hierarchical organization as well as their nested referential constraints). The second phase translates these mappings into queries over the source schemas that produce data satisfying the constraints and structure of the target schema, and preserving the semantic relationships of the source. Nonnull target values may need to be invented in this process. The mapping algorithm is complete in that it produces all mappings that are consistent with the schema constraints. We have implemented the translation algorithm in Clio, a schema mapping tool, and present our experience using Clio on several real schemas.

...read moreread less

495 citations

Book•

Document Object Model

[...]

Joe Marini

14 Aug 2002

TL;DR: The Document Object Model: Processing Structured Documents will help you flatten your learning curve, standardize programming, reuse code, and reduce development time.

...read moreread less

Abstract: From the Publisher: Here's a practical guide to using the W3C's standardized DOM interfaces to process XML and HTML documents Learn the concepts, design, theory, and origins of the DOM Use the DOM to inspect, navigate, and manipulate a document's nodes and content; then learn to build useful applications that can easily be ported to any DOM-compliant implementation without re-coding Get easy-to-follow advice on using the DOM in real-world scenarios such as manipulating document content, creating user interfaces, and offloading processing to the client side The Document Object Model: Processing Structured Documents will help you flatten your learning curve, standardize programming, reuse code, and reduce development time

...read moreread less

483 citations

Proceedings Article•

Evaluating Structural Similarity in XML Documents

[...]

Andrew Nierman, H. V. Jagadish

01 Jan 2002

TL;DR: A dynamic programming algorithm is developed that can compute pair-wise distances between documents in the collection, and then use these distances to cluster the documents, and finds that the resulting clusters match the original DTDs almost perfectly.

...read moreread less

Abstract: XML documents on the web are often found without DTDs, particularly when these documents have been created from legacy HTML. Yet having knowledge of the DTD can be valuable in querying and manipulating such documents. Recent work (cf. [10]) has given us a means to (re-)construct a DTD to describe the structure common to a given set of document instances. However, given a collection of documents with unknown DTDs, it may not be appropriate to construct a single DTD to describe every document in the collection. Instead, we would wish to partition the collection into smaller sets of “similar” documents, and then induce a separate DTD for each such set. It is this partitioning problem that we address in this paper. Given two XML documents, how can one measure structural (DTD) similarity between the two? We define a tree edit distance based measure suited to this task, taking into account XML issues such as optional and repeated sub-elements. We develop a dynamic programming algorithm to find this distance for any pair of documents. We validate our proposed distance measure experimentally. Given a collection of documents derived from multiple DTDs, we can compute pair-wise distances between documents in the collection, and then use these distances to cluster the documents. We find that the resulting clusters match the original DTDs almost perfectly, and demonstrate performance superior to alternatives based on previous proposals for measuring similarity of trees. The overall algorithm runs in time that is quadratic in document collection size, and quadratic in the combined size of the two documents involved in a given pair-wise distance calculation.

...read moreread less

479 citations

Proceedings Article•DOI•

Detecting changes in XML documents

[...]

Gregory Cobena¹, Serge Abiteboul, Amélie Marian²•Institutions (2)

French Institute for Research in Computer Science and Automation¹, Columbia University²

26 Feb 2002

TL;DR: This work is motivated by the support for change control in the context of the Xyleme project that is investigating dynamic warehouses capable of storing massive volumes of XML data, and offers a diff algorithm for XML data that runs in average in linear time vs. quadratic time.

...read moreread less

Abstract: We present a diff algorithm for XML data This work is motivated by the support for change control in the context of the Xyleme project that is investigating dynamic warehouses capable of storing massive volumes of XML data Because of the context, our algorithm has to be very efficient in terms of speed and memory space even at the cost of some loss of quality Also, it considers, besides insertions, deletions and updates (standard in diffs), a move operation on subtrees that is essential in the context of XML Intuitively, our diff algorithm uses signatures to match (large) subtrees that were left unchanged between the old and new versions Such exact matchings are then possibly propagated to ancestors and descendants to obtain more matchings It also uses XML specific information such as ID attributes We provide a performance analysis of the algorithm We show that it runs in average in linear time vs quadratic time for previous algorithms We present experiments on synthetic data that confirm the analysis Since this problem is NP-hard, the linear time is obtained by trading some quality We present experiments (again on synthetic data) that show that the output of our algorithm is reasonably close to the optimal in terms of quality Finally we present experiments on a small sample of XML pages found on the Web

...read moreread less

Journal Article•DOI•

TIMBER: A native XML database

[...]

H. V. Jagadish¹, Shurug Al-Khalifa¹, Adriane Chapman¹, Laks V. S. Lakshmanan², Andrew Nierman¹, Stelios Paparizos¹, Jignesh M. Patel¹, Divesh Srivastava³, Nuwee Wiwatwattana¹, Yuqing Wu¹, Cong Yu¹ - Show less +7 more•Institutions (3)

University of Michigan¹, University of British Columbia², AT&T³

12 Dec 2002

TL;DR: The overall design and architecture of the Timber XML database system currently being implemented at the University of Michigan is described, believing that the key intellectual contribution of this system is a comprehensive set-at-a-time query processing ability in a native XML store.

...read moreread less

Abstract: This paper describes the overall design and architecture of the Timber XML database system currently being implemented at the University of Michigan. The system is based upon a bulk algebra for manipulating trees, and natively stores XML. New access methods have been developed to evaluate queries in the XML context, and new cost estimation and query optimization techniques have also been developed. We present performance numbers to support some of our design decisions. We believe that the key intellectual contribution of this system is a comprehensive set-at-a-time query processing ability in a native XML store, with all the standard components of relational query processing, including algebraic rewriting and a cost-based optimizer.

...read moreread less

Proceedings Article•DOI•

Declarative composition and peer-to-peer provisioning of dynamic Web services

[...]

Boualem Benatallah¹, Marlon Dumas¹, Quan Z. Sheng¹, Anne H. H. Ngu²•Institutions (2)

University of New South Wales¹, Queensland University of Technology²

07 Aug 2002

TL;DR: This paper describes the design and implementation of a system through which existing Web services can be declaratively composed, and the resulting composite Services can be executed following a peer-to-peer paradigm, within a dynamic environment.

...read moreread less

Abstract: The development of new services through the integration of existing ones has gained a considerable momentum as a means to create and streamline business-to-business collaborations. Unfortunately, as Web services are often autonomous and heterogeneous entities, connecting and coordinating them in order to build integrated services is a delicate and time-consuming task. In this paper, we describe the design and implementation of a system through which existing Web services can be declaratively composed, and the resulting composite services can be executed following a peer-to-peer paradigm, within a dynamic environment. This system provides tools for specifying composite services through. statecharts, data conversion rules, and provider selection, policies. These specifications are then translated into XML documents that can be interpreted by peer-to-peer inter-connected software components, in order to provision the composite service without requiring a central authority.

...read moreread less

Patent•

Schema-based services for identity-based data access

[...]

Mark H. Lucovsky¹, Shaun D. Pierce¹, Ramu Movva¹, Jagadeesh Kalki¹, David Benjamin Auerbach¹, Peter S. Ford¹, Yun-Qi Yuan¹, Yi-Wen Guu¹, George Samuel John¹, William Raymond Hoffman¹, Jay C. Jacobs¹, Paul A. Steckler¹, Walter C. Hsueh¹, Kendall D. Keil¹, Burra Gopal¹, Steven D. White¹, Paul J. Leach¹, Richard B. Ward¹, Philip Michael Smoot¹, Lijiang Fang¹, Michael B. Taylor¹, Suresh Kannan¹, Winnie C. Wu¹ - Show less +19 more•Institutions (1)

Microsoft¹

14 Mar 2002

TL;DR: In this article, a schema-based service for Internet access to per-user services data is proposed, where access to data is based on each user's identity and each user manipulates (e.g., reads or writes) data in the logical document by data access requests through defined methods.

...read moreread less

Abstract: A schema-based service for Internet access to per-user services data, wherein access to data is based on each user's identity. The service includes a schema that defines rules and a structure for each user's data, and also includes methods that provide access to the data in a defined way. The services schema thus corresponds to a logical document containing the data for each user. The user manipulates (e.g., reads or writes) data in the logical document by data access requests through defined methods. In one implementation, the services schemas are arranged as XML documents, and the services provide methods that control access to the data based on the requesting user's identification, defined role and scope for that role. In this way, data can be accessed by its owner, and shared to an extent determined by the owner.

...read moreread less

Patent•

System and method for the delivery of electronic books

[...]

George Hay, Gerald Rasmussen

30 May 2002

TL;DR: In this paper, an electronic book on a computer readable medium, e.g., a CD or the like, has pre-recorded audio and visual text seamlessly linked together via a linking file (preferably, in XML format) such that a reader can switch back and forth "at will" between visually reading on a display screen and/or listening to the book being read aloud by an actual narrator.

...read moreread less

Abstract: An electronic book on a computer readable medium, e.g., a CD or the like, has “real-life” pre-recorded audio (preferably, in MP3 format) and visual text (preferably, in RTF format) seamlessly linked together via a linking file (preferably, in XML format) such that a reader can switch back and forth “at will” between visually reading on a computer display screen and/or listening to the book being read aloud by an actual narrator. The computer readable medium includes a reader program installed thereon and an automatic installation program. A novel process for creating the electronic book includes a creator program that may have a similar graphical user interface to the reader program. The electronic book may combine advantages of physical hard-cover books with new e-reading functionality developed by the present inventor.

...read moreread less

Journal Article•DOI•

BioMOBY: An open source biological web services proposal

[...]

Mark Wilkinson¹, Matthew G. Links²•Institutions (2)

National Research Council¹, University of Saskatchewan²

01 Dec 2002-Briefings in Bioinformatics

TL;DR: Native BioMOBY objects are lightweight XML, and make up both the query and the response of a simple object access protocol (SOAP) transaction.

...read moreread less

Abstract: BioMOBY is an Open Source research project which aims to generate an architecture for the discovery and distribution of biological data through web services; data and services are decentralised, but the availability of these resources, and the instructions for interacting with them, are registered in a central location called MOBY Central. BioMOBY adds to the web services paradigm, as exemplified by Universal Data Discovery and Integration (UDDI), by having an object-driven registry query system with object and service ontologies. This allows users to traverse expansive and disparate data sets where each possible next step is presented based on the data object currently in-hand. Moreover, a path from the current data object to a desired final data object could be automatically discovered using the registry. Native BioMOBY objects are lightweight XML, and make up both the query and the response of a simple object access protocol (SOAP) transaction.

...read moreread less

Journal Article•DOI•

Secure and selective dissemination of XML documents

[...]

Elisa Bertino¹, Elena Ferrari²•Institutions (2)

University of Milan¹, University of Insubria²

01 Aug 2002-ACM Transactions on Information and System Security

TL;DR: This article defines a formal model of access control policies for XML documents and proposes an approach, based on an extension of the Cryptolope#8482; approach, which essentially allows one to send the same document to all users, and yet to enforce the statedAccess control policies.

...read moreread less

Abstract: XML (eXtensible Markup Language) has emerged as a prevalent standard for document representation and exchange on the Web. It is often the case that XML documents contain information of different sensitivity degrees that must be selectively shared by (possibly large) user communities. There is thus the need for models and mechanisms enabling the specification and enforcement of access control policies for XML documents. Mechanisms are also required enabling a secure and selective dissemination of documents to users, according to the authorizations that these users have. In this article, we make several contributions to the problem of secure and selective dissemination of XML documents. First, we define a formal model of access control policies for XML documents. Policies that can be defined in our model take into account both user profiles, and document contents and structures. We also propose an approach, based on an extension of the Cryptolope™ approach [Gladney and Lotspiech 1997], which essentially allows one to send the same document to all users, and yet to enforce the stated access control policies. Our approach consists of encrypting different portions of the same document according to different encryption keys, and selectively distributing these keys to the various users according to the access control policies. We show that the number of encryption keys that have to be generated under our approach is minimal and we present an architecture to support document distribution.

...read moreread less

Proceedings Article•DOI•

Efficient filtering of XML documents with XPath expressions

[...]

Chee-Yong Chan¹, Pascal Felber², Minos Garofalakis², Rajeev Rastogi²•Institutions (2)

Bell Labs¹, Alcatel-Lucent²

07 Aug 2002

TL;DR: This paper proposes a novel index structure, termed XTrie, that supports the efficient filtering of XML documents based on XPath expressions and offers several novel features that, it believes, make it especially attractive for large-scale publish/subscribe systems.

...read moreread less

Abstract: We propose a novel index structure, termed XTrie, that supports the efficient filtering of XML documents based on XPath expressions. Our XTrie index structure offers several novel features that make it especially attractive for large scale publish/subscribe systems. First, XTrie is designed to support effective filtering based on complex XPath expressions (as opposed to simple, single-path specifications). Second, our XTrie structure and algorithms are designed to support both ordered and unordered matching of XML data. Third, by indexing on sequences of element names organized in a trie structure and using a sophisticated matching algorithm, XTrie is able to both reduce the number of unnecessary index probes as well as avoid redundant matchings, thereby providing extremely efficient filtering. Our experimental results over a wide range of XML document and XPath expression workloads demonstrate that our XTrie index structure outperforms earlier approaches by wide margins.

...read moreread less

Proceedings Article•DOI•

From XML schema to relations: a cost-based approach to XML storage

[...]

Philip Bohannon¹, Juliana Freire¹, Prasan Roy¹, Jérôme Siméon¹•Institutions (1)

Bell Labs¹

07 Aug 2002

TL;DR: LegionDB as discussed by the authors is a cost-based XML storage mapping engine that explores a space of possible XML-to-relational mappings and selects the best mapping for a given application.

...read moreread less

Abstract: As Web applications manipulate an increasing amount of XML, there is a growing interest in storing XML data in relational databases. Due to the mismatch between the complexity of XML's tree structure and the simplicity of flat relational tables, there are many ways to store the same document in an RDBMS, and a number of heuristic techniques have been proposed. These techniques typically define fixed mappings and do not take application characteristics into account. However, a fixed mapping is unlikely to work well for all possible applications. In contrast, LegoDB is a cost-based XML storage mapping engine that explores a space of possible XML-to-relational mappings and selects the best mapping for a given application. LegoDB leverages current XML and relational technologies: (1) it models the target application with an XML Schema, XML data statistics, and an XQuery workload; (2) the space of configurations is generated through XML-Schema rewritings; and (3) the best among the derived configurations is selected using cost estimates obtained through a standard relational optimizer. We describe the LegoDB storage engine and provide experimental results that demonstrate the effectiveness of this approach.

...read moreread less

Journal Article•DOI•

xlinkit: a consistency checking and smart link generation service

[...]

Christian Nentwich¹, Licia Capra¹, Wolfgang Emmerich¹, Anthony Finkelsteiin¹•Institutions (1)

University College London¹

01 May 2002-ACM Transactions on Internet Technology

TL;DR: A novel semantics for first-order logic that produces links instead of truth values is described and a content management strategy is given to validate UML models supplied by industrial partners.

...read moreread less

Abstract: xlinkit is a lightweight application service that provides rule-based link generation and checks the consistency of distributed Web content. It leverages standard Internet technologies, notably XML, XPath, and XLink. xlinkit can be used as part of a consistency management scheme or in applications that require smart link generation, including portal construction and management of large document repositories. In this article we show how consistency constraints can be expressed and checked. We describe a novel semantics for first-order logic that produces links instead of truth values and give an account of our content management strategy. We present the architecture of our service and the results of two substantial case studies that use xlinkit for checking course syllabus information and for validating UML models supplied by industrial partners.

...read moreread less

Proceedings Article•DOI•

YFilter: efficient and scalable filtering of XML documents

[...]

Yanlei Diao¹, Peter M. Fischer, Michael J. Franklin, R. To•Institutions (1)

University of California, Berkeley¹

07 Aug 2002

TL;DR: A filtering engine called YFilter is built, which filters streaming XML documents according to XQuery or XPath queries that involve both path expressions and predicates, and uses a novel NFA-based execution model.

...read moreread less

Abstract: Much of the data exchanged over the Internet will soon be encoded in XML, allowing for sophisticated filtering and content-based routing. We have built a filtering engine called YFilter, which filters streaming XML documents according to XQuery or XPath queries that involve both path expressions and predicates. Unlike previous work, YFilter uses a novel NFA-based execution model. We present the structures and algorithms underlying YFilter, and show its efficiency and scalability under various workloads.

...read moreread less

Book Chapter•DOI•

Efficient structural joins on indexed XML documents

[...]

Shu-Yao Chien¹, Zografoula Vagena², Donghui Zhang², Vassilis J. Tsotras², Carlo Zaniolo¹ - Show less +1 more•Institutions (2)

University of California, Los Angeles¹, University of California, Riverside²

20 Aug 2002

TL;DR: This paper proposes efficient structural join algorithms in the presence of tag indices using B+- trees and an enhancement based on sibling pointers that further improves performance, and presents a structural join algorithm that utilizes R-trees.

...read moreread less

Abstract: Queries on XML documents typically combine selections on element contents, and, via path expressions, the structural relationships between tagged elements. Structural joins are used to find all pairs of elements satisfying the primitive structural relationships specified in the query, namely, parent-child and ancestor-descendant relationships. Efficient support for structural joins is thus the key to efficient implementations of XML queries. Recently proposed node numbering schemes enable the capturing of the XML document structure using traditional indices (such as B+-trees or R-trees). This paper proposes efficient structural join algorithms in the presence of tag indices. We first concentrate on using B+- trees and show how to expedite a structural join by avoiding collections of elements that do not participate in the join. We then introduce an enhancement (based on sibling pointers) that further improves performance. Such sibling pointers are easily implemented and dynamically maintainable. We also present a structural join algorithm that utilizes R-trees. An extensive experimental comparison shows that the B+-tree structural joins are more robust. Furthermore, they provide drastic improvement gains over the current state of the art.

...read moreread less

Patent•

Visual and interactive wrapper generation, automated information extraction from web pages, and translation into xml

[...]

Robert Baumgartner, Sergio Flesca, Georg Gottlob, Marcus Herzog

28 May 2002

TL;DR: A method and a system for information extraction from Web pages formatted with markup languages such as HTML is described in this paper, where each pattern is defined via the (interactive) specification of one or more filters.

...read moreread less

Abstract: A method and a system for information extraction from Web pages formatted with markup languages such as HTML [8]. A method and system for interactively and visually describing information patterns of interest based on visualized sample Web pages [5,6,16-29]. A method and data structure for representing and storing these patterns [1]. A method and system for extracting information corresponding to a set of previously defined patterns from Web pages [2], and a method for transforming the extracted data into XML is described. Each pattern is defined via the (interactive) specification of one or more filters. Two or more filters for the same pattern contribute disjunctively to the pattern definition [3], that is, an actual pattern describes the set of all targets specified by any of its filters. A method and for extracting relevant elements from Web pages by interpreting and executing a previously defined wrapper program of the above form on an input Web page [9-14] and producing as output the extracted elements represented in a suitable data structure. A method and system for automatically translating said output into XML format by exploiting the hierarchical structure of the patterns and by using pattern names as XML tags is described.

...read moreread less

Proceedings Article•DOI•

XGrind: a query-friendly XML compressor

[...]

Pankaj M. Tolani¹, Jayant R. Haritsa¹•Institutions (1)

Indian Institute of Science¹

07 Aug 2002

TL;DR: Performance evaluations over a variety of XML documents and user queries indicate that XGrind simultaneously delivers improved query processing times and reasonable compression ratios.

...read moreread less

Abstract: XML documents are extremely verbose since the "schema" is repeated for every "record" in the document. While a variety of compressors are available to address this problem, they are not designed to support direct querying of the compressed document, a useful feature from a database perspective. In this paper, we propose a new compression tool, called XGrind, that directly supports queries in the compressed domain. A special feature of XGrind is that the compressed document retains the structure of the original document, permitting reuse of the standard XML techniques for processing the compressed document. Performance evaluations over a variety of XML documents and user queries indicate that XGrind simultaneously delivers improved query processing times and reasonable compression ratios.

...read moreread less

Patent•

Method and System for Cross-Platform Form Creation and Deployment

[...]

George Wesley Bradley¹, Jean Louis Brousseau¹, Kevin Paul Matassa¹, Ernest Herscheal James Foster¹, Andrew John Neilson¹, Mark C. Leyden¹, Keith Rolland McLellan¹, Mark Andrew Brooks¹, Zbigniew Rachniowski¹, Anthony Robert Rumsey¹, Nasif Hussain Dawd¹ - Show less +7 more•Institutions (1)

Adobe Systems¹

15 Apr 2002

TL;DR: In this paper, the authors present a system and methods of creating and deploying electronic forms for collecting information from a user using a browser, where the browser may be one of a plurality of browser platforms and the characteristics of forms are entered by a human designer using a form designer by using drag-and-drop operations, and stored in XML template files.

...read moreread less

Abstract: The present invention is directed to systems and methods of creating and deploying electronic forms for collecting information from a user using a browser, where the browser may be one of a plurality of browser platforms. Characteristics of forms are entered by a human designer using a form designer by using drag-and-drop operations, and stored in XML template files. The form may be previewed by the designer. When a user on the Internet (or an intranet) requests a form by a browser, the characteristics of the browser are sensed and a form appropriate for the browser is deployed to the browser by a form server. Information is then captured from the user. The form may also be saved or printed.

...read moreread less

Proceedings Article•DOI•

Labeling dynamic XML trees

[...]

Edith Cohen¹, Haim Kaplan², Tova Milo²•Institutions (2)

AT&T Labs¹, Tel Aviv University²

03 Jun 2002

TL;DR: Algorithms to label the nodes of an XML tree which is subject to insertions and deletions of nodes are presented and it is proved that their algorithms assign the shortest possible labels which satisfy these requirements.

...read moreread less

Abstract: We present algorithms to label the nodes of an XML tree which is subject to insertions and deletions of nodes. The labeling is done such that (1) we label each node immediately when it is inserted and this label remains unchanged, and (2) from a pair of labels alone, we can decide whether one node is an ancestor of the other. This problem arises in the context of XML databases that support queries on the structure of the documents as well us on the changes made to the documents over time. We prove that our algorithms assign the shortest possible labels (up to a constant factor) which satisfy these requirements.We also consider the same problem when "clues" that provide guarantees on possible future insertions are given together with newly inserted nodes. Such clues can be derived from the DTD or from statistics on similar XML trees. We present algorithms that use the clues to assign shorter labels. We also prove that the length of our labels is close to the minimum possible.

...read moreread less

Book Chapter•DOI•

XPath: Looking Forward

[...]

Dan Olteanu¹, Holger Meuss¹, Tim Furche¹, François Bry¹•Institutions (1)

Ludwig Maximilian University of Munich¹

24 Mar 2002

TL;DR: Equivalences of XPath 1.0 location paths involving reverse axes, such as anc and prec, are established and used as rewriting rules in an algorithm for transforming location paths with reverse axes into equivalent reverse-axis-free ones.

...read moreread less

Abstract: The location path language XPath is of particular importance for XML applications since it is a core component of many XML processing standards such as XSLT or XQuery. In this paper, based on axis symmetry of XPath, equivalences of XPath 1.0 location paths involving reverse axes, such as anc and prec, are established. These equivalences are used as rewriting rules in an algorithm for transforming location paths with reverse axes into equivalent reverse-axis-free ones. Location paths without reverse axes, as generated by the presented rewriting algorithm, enable efficient SAX-like streamed data processing of XPath.

...read moreread less

Patent•

Method for dynamically generating a user interface from XML-based documents

[...]

George Lin¹, Xiaodong Xu¹•Institutions (1)

Intel¹

29 Mar 2002

TL;DR: In this article, a method for dynamically generating a graphical user interface (GUI) from XML-based documents is presented, where visual components or display objects for building a GUI are defined, as well as a layout hierarchy describing layout relationships between the display objects, specifying how related display objects are to be laid out relative to each other on a layout window in the GUI.

...read moreread less

Abstract: A method for dynamically generating a graphical user interface (GUI) from XML-based documents. In accordance with the method, visual components or display objects for building a GUI are defined, as well as a layout hierarchy describing layout relationships between the display objects, specifying how related display objects are to be laid out relative to each other on a layout window in the GUI. XML elements in an XML document pertaining to respective display objects are identified. A GUI is generated by rendering the identified display objects on the layout window, wherein the size and the position of each display object is based on layout rules defined by the layout hierarchy and a hierarchical position of the XML element pertaining to the display object within a hierarchy of XML elements of the XML document. The appearance of display objects in the GUI may also be altered through the use layout descriptors.

...read moreread less

Proceedings Article•DOI•

XClust: clustering XML schemas for effective integration

[...]

Mong Li Lee¹, Liang Huai Yang¹, Wynne Hsu¹, Xia Yang¹•Institutions (1)

National University of Singapore¹

04 Nov 2002

TL;DR: XClust is introduced, a novel integration strategy that involves the clustering of DTDs that are similar in structure and semantics and a matching algorithm based on the semantics, immediate descendents and leaf-context similarity of D TD elements is developed.

...read moreread less

Abstract: It is increasingly important to develop scalable integration techniques for the growing number of XML data sources. A practical starting point for the integration of large numbers of Document Type Definitions (DTDs) of XML sources would be to first find clusters of DTDs that are similar in structure and semantics. Reconciling similar DTDs within such a cluster will be an easier task than reconciling DTDs that are different in structure and semantics as the latter would involve more restructuring. We introduce XClust, a novel integration strategy that involves the clustering of DTDs. A matching algorithm based on the semantics, immediate descendents and leaf-context similarity of DTD elements is developed. Our experiments to integrate real world DTDs demonstrate the effectiveness of the XClust approach.

...read moreread less

Journal Article•DOI•

Anatomy of a native XML base management system

[...]

Thorsten Fiebig¹, Sven Helmer², Carl-Christian Kanne, Guido Moerkotte², Julia Neumann², Robert Schiele², Till Westmann - Show less +3 more•Institutions (2)

Software AG¹, University of Mannheim²

12 Dec 2002

TL;DR: This paper gives a tour of Natix, a database management system designed from scratch for storing and processing XML data, showing how to design and optimize areas such as storage, transaction management - comprising recovery and multi-user synchronization - as well as query processing for XML.

...read moreread less

Abstract: Several alternatives to manage large XML document collections exist, ranging from file systems over relational or other database systems to specifically tailored XML base management systems. In this paper we give a tour of Natix, a database management system designed from scratch for storing and processing XML data. Contrary to the common belief that management of XML data is just another application for traditional databases like relational systems, we illustrate how almost every component in a database system is affected in terms of adequacy and performance. We show how to design and optimize areas such as storage, transaction management - comprising recovery and multi-user synchronization - as well as query processing for XML.

...read moreread less

Collapse