scispace - formally typeset
Search or ask a question

Showing papers on "XML published in 2005"


Proceedings ArticleDOI
03 Jan 2005
TL;DR: This paper introduces a novel approach for declaring information object related access restrictions, based on a valid XML encoding, and shows, how the access restrictions can be declared using XACML and Xpath.
Abstract: Web Services, as the new building blocks of today's Internet provide the power to access distributed and heterogeneous information objects, which is the base for more advanced use like in electronic commerce. But, the access to these information objects is not always unrestricted. The owner of the information objects may control access due to different reasons. This paper introduces a novel approach for declaring information object related access restrictions, based on a valid XML encoding. The paper shows, how the access restrictions can be declared using XACML and Xpath. Based on the specified 'fine grained' policies, multiple policies can be applicable. If these policies declare positive and negative permissions for the same subject, policy inconsistencies exist. The paper also focuses on specifying the ground of policy inconsistencies and how to solve them.

731 citations


01 Dec 2005
TL;DR: This document specifies Atom, an XML-based Web content and metadata syndication format that is compatible with HTML, CSS, and other standards-based formats.
Abstract: This document specifies Atom, an XML-based Web content and metadata syndication format. [STANDARDS-TRACK]

326 citations


Proceedings Article
30 Aug 2005
TL;DR: This paper designs a novel holistic twig join algorithm, called TJFast, which to answer a twig query only needs to access the labels of the leaf query nodes, and reports experimental results to show that these algorithms are superior to previous approaches in terms of the number of elements scanned, the size of intermediate results and query performance.
Abstract: Finding all the occurrences of a twig pattern in an XML database is a core operation for efficient evaluation of XML queries. A number of algorithms have been proposed to process a twig query based on region encoding labeling scheme. While region encoding supports efficient determination of structural relationship between two elements, we observe that the information within a single label is very limited. In this paper, we propose a new labeling scheme, called extended Dewey. This is a powerful labeling scheme, since from the label of an element alone, we can derive all the elements names along the path from the root to the element. Based on extended Dewey, we design a novel holistic twig join algorithm, called TJFast. Unlike all previous algorithms based on region encoding, to answer a twig query, TJFast only needs to access the labels of the leaf query nodes. Through this, not only do we reduce disk access, but we also support the efficient evaluation of queries with wildcards in branching nodes, which is very difficult to be answered by algorithms based on region encoding. Finally, we report our experimental results to show that our algorithms are superior to previous approaches in terms of the number of elements scanned, the size of intermediate results and query performance.

309 citations


Journal ArticleDOI
TL;DR: A look at how developers are going back to the future by building Web applications using Ajax (Asynchronous JavaScript and XML), a set of technologies mostly developed in the 1990s.
Abstract: Looks at how developers are going back to the future by building Web applications using Ajax (Asynchronous JavaScript and XML), a set of technologies mostly developed in the 1990s. A key advantage of Ajax applications is that they look and act more like desktop applications. Proponents argue that Ajax applications perform better than traditional Web programs. As an example, Ajax applications can add or retrieve new data for a page it is working with and the page will update immediately without reloading.

300 citations


Proceedings ArticleDOI
Laura M. Haas1, Mauricio A. Hernández1, Howard Ho1, Lucian Popa1, Mary Roth1 
14 Jun 2005
TL;DR: The architecture and algorithms behind Clio are revisited, some implementation issues, optimizations needed for scalability, and general lessons learned in the road towards creating an industrial-strength tool are discussed.
Abstract: Clio, the IBM Research system for expressing declarative schema mappings, has progressed in the past few years from a research prototype into a technology that is behind some of IBM's mapping technology. Clio provides a declarative way of specifying schema mappings between either XML or relational schemas. Mappings are compiled into an abstract query graph representation that captures the transformation semantics of the mappings. The query graph can then be serialized into different query languages, depending on the kind of schemas and systems involved in the mapping. Clio currently produces XQuery, XSLT, SQL, and SQL/XML queries. In this paper, we revisit the architecture and algorithms behind Clio. We then discuss some implementation issues, optimizations needed for scalability, and general lessons learned in the road towards creating an industrial-strength tool.

298 citations


Journal Article
Wang Jun1
TL;DR: This paper is an introduction of the OAI protocol for metadata harvesting and the main technical idea of OAI-PMH is how to implementing the protocol.

296 citations


Journal ArticleDOI
TL;DR: The OME Data Model, expressed in Extensible Markup Language (XML) and realized in a traditional database, is both extensible and self-describing, allowing it to meet emerging imaging and analysis needs.
Abstract: The Open Microscopy Environment (OME) defines a data model and a software implementation to serve as an informatics framework for imaging in biological microscopy experiments, including representation of acquisition parameters, annotations and image analysis results. OME is designed to support high-content cell-based screening as well as traditional image analysis applications. The OME Data Model, expressed in Extensible Markup Language (XML) and realized in a traditional database, is both extensible and self-describing, allowing it to meet emerging imaging and analysis needs.

289 citations


Patent
10 Jun 2005
TL;DR: An intelligent document recognition-based document management system as discussed by the authors includes modules for image capture, image enhancement, image identification, optical character recognition (OCR), data extraction, and quality assurance.
Abstract: An intelligent document recognition-based document management system (Fig. 2) includes modules for image capture (32), image enhancement (32), image identification (34), optical character recognition (36), data extraction (37) and quality assurance (42). The system captures data from electronic documents as diverse as facsimile images, scanned images and images from document management systems. It processes these images and presents the data in, for example, a standard XML format. The document management system processes both structured document images (40) (ones which have a standard format) and unstructured document images (38) (ones which do not have a standard format). The system can extract images directly from a facsimile machine, a scanner or a document management system for processing.

233 citations


Book ChapterDOI
28 Aug 2005
TL;DR: A technique is presented that allows to represent the tree structure of an XML document in an efficient way by “compressing” their tree structure, which allows to directly execute queries without prior decompression.
Abstract: Implementations that load XML documents and give access to them via, e.g., the DOM, suffer from huge memory demands: the space needed to load an XML document is usually many times larger than the size of the document. A considerable amount of memory is needed to store the tree structure of the XML document. Here a technique is presented that allows to represent the tree structure of an XML document in an efficient way. The representation exploits the high regularity in XML documents by “compressing” their tree structure; the latter means to detect and remove repetitions of tree patterns. The functionality of basic tree operations, like traversal along edges, is preserved in the compressed representation. This allows to directly execute queries (and in particular, bulk operations) without prior decompression. For certain tasks like validation against an XML type or checking equality of documents, the representation allows for provably more efficient algorithms than those running on conventional representations.

225 citations


01 Jan 2005
TL;DR: This paper presents a mapping between the data model elements of XML and OWL, and gives account about its implementation within a ready-to-use XSLT framework, as well as its evaluation for common use cases.
Abstract: By now, XML has reached a wide acceptance as data exchange format in E-Business. An efficient collaboration between different participants in E-Business thus, is only possible, when business partners agree on a common syntax and have a common understanding of the basic concepts in the domain. XML covers the syntactic level, but lacks support for efficient sharing of conceptualizations. The Web Ontology Language (OWL [Bec04]) in turn supports the representation of domain knowledge using classes, properties and instances for the use in a distributed environment as the WorldWideWeb. We present in this paper a mapping between the data model elements of XML and OWL. We give account about its implementation within a ready-to-use XSLT framework, as well as its evaluation for common use cases.

208 citations


Patent
03 May 2005
TL;DR: A chart view as discussed by the authors is a component of a data viewer used to retrieve, manipulate, and view documents in the Reusable Data Markup Language (RDML) format, which facilitates the browsing and manipulation of numbers, as opposed to text as in HTML.
Abstract: Methods and systems provide a “chart view” for a markup language referred to as Reusable Data Markup Language (“RDML”). Generally, a chart view comprises the components necessary for automatically manipulating and displaying a graphical display of numerical data contained in RDML markup documents. RDML is a markup language, such as the Hypertext Markup Language (“HTML”) or the Extensible Markup Language (“XML”). Generally, RDML facilitates the browsing and manipulation of numbers, as opposed to text as in HTML, and does so by requiring attributes describing the meaning of the numbers to be attached to the numbers. Upon receiving RDML markup documents, the chart view transforms, formats, manipulates and displays data stored in the markup documents using the attributes describing the meaning of the data. The chart view uses the attributes of the numbers to, for example, facilitate the simultaneous display of different series of numbers of different types on a single chart and automatically display appropriate axis labels, axis titles, chart titles, number precision, etc. A chart view may be a component of a data viewer used to retrieve, manipulate, and view documents in the RDML format.

Proceedings ArticleDOI
14 Jun 2005
TL;DR: This paper develops a method to perform holistic twig pattern matching on XML documents partitioned using various streaming schemes and can process a large class of twig patterns consisting of both ancestor-descendant and parent-child relationships and avoid generating redundant intermediate results.
Abstract: Searching for all occurrences of a twig pattern in an XML document is an important operation in XML query processing. Recently a holistic method TwigStack. [2] has been proposed. The method avoids generating large intermediate results which do not contribute to the final answer and is CPU and I/O optimal when twig patterns only have ancestor-descendant relationships. Another important direction of XML query processing is to build structural indexes [3][8][13][15] over XML documents to avoid unnecessary scanning of source documents. We regard XML structural indexing as a technique to partition XML documents and call it streaming scheme in our paper. In this paper we develop a method to perform holistic twig pattern matching on XML documents partitioned using various streaming schemes. Our method avoids unnecessary scanning of irrelevant portion of XML documents. More importantly, depending on different streaming schemes used, it can process a large class of twig patterns consisting of both ancestor-descendant and parent-child relationships and avoid generating redundant intermediate results. Our experiments demonstrate the applicability and the performance advantages of our approach.

Journal ArticleDOI
TL;DR: The technical contribution of the infrastructure is augmented by several research contributions: the first decomposition of an architecture description language into modules, insights about how to develop new language modules and a process for integrating them, and insights about the roles of different kinds of tools in a modular ADL-based infrastructure.
Abstract: Research over the past decade has revealed that modeling software architecture at the level of components and connectors is useful in a growing variety of contexts. This has led to the development of a plethora of notations for representing software architectures, each focusing on different aspects of the systems being modeled. In general, these notations have been developed without regard to reuse or extension. This makes the effort in adapting an existing notation to a new purpose commensurate with developing a new notation from scratch. To address this problem, we have developed an approach that allows for the rapid construction of new architecture description languages (ADLs). Our approach is unique because it encapsulates ADL features in modules that are composed to form ADLs. We achieve this by leveraging the extension mechanisms provided by XML and XML schemas. We have defined a set of generic, reusable ADL modules called xADL 2.0, useful as an ADL by itself, but also extensible to support new applications and domains. To support this extensibility, we have developed a set of reflective syntax-based tools that adapt to language changes automatically, as well as several semantically-aware tools that provide support for advanced features of xADL 2.0. We demonstrate the effectiveness, scalability, and flexibility of our approach through a diverse set of experiences. First, our approach has been applied in industrial contexts, modeling software architectures for aircraft software and spacecraft systems. Second, we show how xADL 2.0 can be extended to support the modeling features found in two different representations for modeling product-line architectures. Finally, we show how our infrastructure has been used to support its own development. The technical contribution of our infrastructure is augmented by several research contributions: the first decomposition of an architecture description language into modules, insights about how to develop new language modules and a process for integrating them, and insights about the roles of different kinds of tools in a modular ADL-based infrastructure.

Journal ArticleDOI
TL;DR: In this paper, the authors extend the syntax and semantics of RDF to cover named graphs, which enables RDF statements that describe graphs, useful in many Semantic Web application areas.

Patent
18 Nov 2005
TL;DR: In this paper, a method and system for transformation of an electronic document through learning transformation rules during training from the original electronic document using visual user feedback and applying the learned transformation rules to a second electronic document having a similar structure as the original document or all future instances of the original e-document.
Abstract: The present invention relates to a method and system for transformation of an electronic document through learning transformation rules during training from the original electronic document using visual user feedback and applying the learned transformation rules to either the original electronic document or a second electronic document having a similar structure as the original document or all future instances of the original electronic document. Accordingly, the transformed document is customized to the user's preference learned during training. Preferably, the transformed document is created in a queriable form. For example, the original electronic document can be defined any type of mark-up language or electronic document generation language, such as Hypertext mark-up language (HTML), extended mark-up language (XML), portable data file (PDF) or Microsoft® Word, and the like and the transformed document is defined in a queriable language such as (XML) views and the like. For example, a virtual page can be a customization of an instance of a Web page which can be used to transform all future instances of the original Web page. Alternatively, the virtual page is formed form a customization of an original electronic document, such as a chapter in a book, which is applied to a second electronic document having a similar structure, such as all chapters in the book.

01 Jan 2005
TL;DR: A meta model for event logs is proposed that gives the requirements for the data that should be available, both informally and formally, and backs up with an XML format called MXML and a tooling framework that is capable of reading MXML files.
Abstract: Modern process-aware information systems store detailed information about processes as they are being executed. This kind of information can be used for very different purposes. The term process mining refers to the techniques and tools to extract knowledge (e.g., in the form of models) from this. Several key players in this area have developed sophisticated process mining tools, such as Aris PPM and the HP Business Cockpit, that are capable of using the information available to generate meaningful insights. What most of these commercial process mining tools have in common is that installation and maintenance of the systems requires enormous effort, and deep knowledge of the underlying information system. Moreover, information systems log events in different ways. Therefore, the interface between process-aware information systems and process mining tools is far from trivial. It is vital to correctly map and interpret event logs recorded by the underlying information systems. Therefore, we propose a meta model for event logs. We give the requirements for the data that should be available, both informally and formally. Furthermore, we back our meta model up with an XML format called MXML and a tooling framework that is capable of reading MXML files. Although, the approach presented in this paper is very pragmatic, it can be seen as a first step towards and ontological analysis of process mining data.

Patent
13 Apr 2005
TL;DR: In this paper, techniques for detecting, managing, and presenting syndication XML (feeds) are disclosed, where a web browser automatically determines that a web site is publishing feeds and notifies the user, who can then access the feed easily.
Abstract: Techniques for detecting, managing, and presenting syndication XML (feeds) are disclosed. In one embodiment, a web browser automatically determines that a web site is publishing feeds and notifies the user, who can then access the feed easily. In another embodiment, a browser determines that a web page or feed is advertising relationship XML, and displays information about the people identified in the relationship XML. In yet another embodiment, a browser determines that a file contains a feed and enables the user to view it in a user-friendly way. In yet another embodiment, feed state information is stored in a repository that is accessible by applications that are used to view the feed. In yet another embodiment, if a feed's state changes, an application notifies the repository, and the state is updated. In yet another embodiment, a feed is parsed and stored in a structured way.

Journal ArticleDOI
TL;DR: A practical algorithm that, unlike classical algorithms based on determinization of tree automata, checks the inclusion relation by a top-down traversal of the original type expressions, which can exploit the property that type expressions being compared often share portions of their representations.
Abstract: We propose regular expression types as a foundation for statically typed XML processing languages. Regular expression types, like most schema languages for XML, introduce regular expression notations such as repetition (a), alternation (v), etc., to describe XML documents. The novelty of our type system is a semantic presentation of subtyping, as inclusion between the sets of documents denoted by two types. We give several examples illustrating the usefulness of this form of subtyping in XML processing.The decision problem for the subtype relation reduces to the inclusion problem between tree automata, which is known to be EXPTIME-complete. To avoid this high complexity in typical cases, we develop a practical algorithm that, unlike classical algorithms based on determinization of tree automata, checks the inclusion relation by a top-down traversal of the original type expressions. The main advantage of this algorithm is that it can exploit the property that type expressions being compared often share portions of their representations. Our algorithm is a variant of Aiken and Murphy's set-inclusion constraint solver, to which are added several new implementation techniques, correctness proofs, and preliminary performance measurements on some small programs in the domain of typed XML processing.

Journal ArticleDOI
TL;DR: An agent-based framework designed for providing proactive services in domotic environments, and an agent architecture that adopts interoperability techniques that represent an efficient experience for adaptive domotic framework are presented.
Abstract: The evolution of the microprocessor industry, combined with the reduction on cost and increase of efficiency, gives rise to new scenario for ubiquitous computing where humans trigger seamlessly activities and tasks using unusual (often imperceptible) interfaces according to physical space and context. Many problems must be faced: adaptivity, hybrid control strategies, system (hardware) integration, and ubiquitous networking access. In this paper, a solution that attempts to provide a flexible and dependable solution to these complicated problems is illustrated. First, an extensible markup language (XML)-derived technologies is proposed to define fuzzy markup language (FML), a markup language skilled for defining detailed structure of fuzzy control independent from its legacy representation. FML is essentially composed of three layers: 1) XML in order to create a new markup language for fuzzy logic control; 2) document type definition in order to define the legal building blocks; and 3) extensible stylesheet language transformations in order to convert a fuzzy controller description into a specific programming language. Then an agent-based framework designed for providing proactive services in domotic environments, is presented. The agent architecture, exploiting mobile computation, is able to maximize the fuzzy control deployment for the natively FML representation by performing an efficient distribution of pieces of the global control flow over the different computers. Agents are also used to capture user habits, to identify requests, and to apply the artefact-mediated activity through an adaptive fuzzy control strategy. The architecture adopts interoperability techniques that, combined with sophisticated control facilities, represent an efficient experience for adaptive domotic framework.

Journal ArticleDOI
01 Apr 2005
TL;DR: The TurboXPath path processor is proposed, which accepts a language equivalent to a subset of the for-let-where constructs of XQuery over a single document, and can be extended to provide full XQuery support or used to augment federated database engines for efficient handling of queries over XML data streams produced by external sources.
Abstract: Efficient querying of XML streams will be one of the fundamental features of next-generation information systems. In this paper we propose the TurboXPath path processor, which accepts a language equivalent to a subset of the for-let-where constructs of XQuery over a single document. TurboXPath can be extended to provide full XQuery support or used to augment federated database engines for efficient handling of queries over XML data streams produced by external sources. Internally, TurboXPath uses a tree-shaped path expression with multiple outputs to drive the execution. The result of a query execution is a sequence of tuples of XML fragments matching the output nodes. Based on a streamed execution model, TurboXPath scales up to large documents and has limited memory consumption for increased concurrency. Experimental evaluation of a prototype demonstrates performance gains compared to other state-of-the-art path processors.

Proceedings ArticleDOI
14 Jun 2005
TL;DR: It is shown that NaLIX, while far from being able to pass the Turing test, is perfectly usable in practice, and able to handle even quite complex queries in a variety of application domains.
Abstract: Database query languages can be intimidating to the non-expert, leading to the immense recent popularity for keyword based search in spite of its significant limitations. The holy grail has been the development of a natural language query interface. We present NaLIX, a generic interactive natural language query interface to an XML database. Our system can accept an arbitrary English language sentence as query input, which can include aggregation, nesting, and value joins, among other things. This query is translated, potentially after reformulation, into an XQuery expression that can be evaluated against an XML database. The translation is done through mapping grammatical proximity of natural language parsed tokens to proximity of corresponding elements in the result XML. In this demonstration, we show that NaLIX, while far from being able to pass the Turing test, is perfectly usable in practice, and able to handle even quite complex queries in a variety of application domains. In addition, we also demonstrate how carefully designed features in NaLIX facilitate the interactive query process and improve the usability of the interface.

Proceedings ArticleDOI
12 Oct 2005
TL;DR: This work proposes Program Trace Query Language (PTQL), a language based on relational queries over program traces, in which programmers can write expressive, declarative queries about program behavior, and describes the compiler, Partiqle, which instruments the program to execute the query on-line.
Abstract: Instrumenting programs with code to monitor runtime behavior is a common technique for profiling and debugging. In practice, instrumentation is either inserted manually by programmers, or automatically by specialized tools that monitor particular properties. We propose Program Trace Query Language (PTQL), a language based on relational queries over program traces, in which programmers can write expressive, declarative queries about program behavior. We also describe our compiler, Partiqle. Given a PTQL query and a Java program, Partiqle instruments the program to execute the query on-line. We apply several PTQL queries to a set of benchmark programs, including the Apache Tomcat Web server. Our queries reveal significant performance bugs in the jack SpecJVM98 benchmark, in Tomcat, and in the IBM Java class library, as well as some correct though uncomfortably subtle code in the Xerces XML parser. We present performance measurements demonstrating that our prototype system has usable performance.

Journal ArticleDOI
12 Jun 2005
TL;DR: From such descriptions, the PADS compiler generates libraries and tools for manipulating the data, including parsing routines, statistical profiling tools, translation programs to produce well-behaved formats such as Xml or those required for loading relational databases, and Tools for running XQueries over raw PADS data sources.
Abstract: PADS is a declarative data description language that allows data analysts to describe both the physical layout of ad hoc data sources and semantic properties of that data. From such descriptions, the PADS compiler generates libraries and tools for manipulating the data, including parsing routines, statistical profiling tools, translation programs to produce well-behaved formats such as Xml or those required for loading relational databases, and tools for running XQueries over raw PADS data sources. The descriptions are concise enough to serve as "living" documentation while flexible enough to describe most of the ASCII, binary, and Cobol formats that we have seen in practice. The generated parsing library provides for robust, application-specific error handling.

Journal ArticleDOI
TL;DR: The correspondences between the PDB dictionary and the XML schema metadata are described as well as the XML representations of PDB dictionaries and data files.
Abstract: Summary: The Protein Data Bank (PDB) has recently released versions of the PDB Exchange dictionary and the PDB archival data files in XML format collectively named PDBML. The automated generation of these XML files is driven by the data dictionary infrastructure in use at the PDB. The correspondences between the PDB dictionary and the XML schema metadata are described as well as the XML representations of PDB dictionaries and data files. Availability: The current software translated XML schema file is located at http://deposit.pdb.org/pdbML/pdbx-v1.000.xsd, and on the PDB mmCIF resource page at http://deposit.pdb.org/mmcif/. PDBML files are stored on the PDB beta ftp site at ftp://beta.rcsb.org/pub/pdb/uniformity/data/XML Contact: jwest@rcsb.rutgers.edu

Patent
02 Feb 2005
TL;DR: In this paper, a power management architecture for an electrical power distribution system, or portion thereof, is disclosed, which includes multiple electronic devices distributed throughout the power distribution systems to manage the flow and consumption of power from the system using real-time communications.
Abstract: A power management architecture for an electrical power distribution system, or portion thereof, is disclosed The architecture includes multiple electronic devices distributed throughout the power distribution system to manage the flow and consumption of power from the system using real time communications Power management application software and/or hardware components operate on the electronic devices and the back-end servers and inter-operate via the network to implement a power management application The architecture provides a scalable and cost effective framework of hardware and software upon which such power management applications can operate to manage the distribution and consumption of electrical power by one or more utilities/suppliers and/or customers which provide and utilize the power distribution system Autonomous communication on the network between IED's, back-end servers and other entities coupled with secure networks, themselves interconnected, via firewalls, by one or more unsecure networks, is facilitated by the use of an XML firewall using SOAP SOAP allows a device to communicate without knowledge of how the sender's system operates or data formats are organized

Journal ArticleDOI
TL;DR: This paper investigates the challenges of providing effective mechanism for enforcement of enterprise policy across distributed domains, ensuring secure content-based access to enterprise resources at all user levels, and allowing the specification of temporal and nontemporal context conditions to support fine-grained dynamic access control.
Abstract: Modern day enterprises exhibit a growing trend toward adoption of enterprise computing services for efficient resource utilization, scalability, and flexibility. These environments are characterized by heterogeneous, distributed computing systems exchanging enormous volumes of time-critical data with varying levels of access control in a dynamic business environment. The enterprises are thus faced with significant challenges as they endeavor to achieve their primary goals, and simultaneously ensure enterprise-wide secure interoperation among the various collaborating entities. Key among these challenges are providing effective mechanism for enforcement of enterprise policy across distributed domains, ensuring secure content-based access to enterprise resources at all user levels, and allowing the specification of temporal and nontemporal context conditions to support fine-grained dynamic access control. In this paper, we investigate these challenges, and present X-GTRBAC, an XML-based GTRBAC policy specification language and its implementation for enforcing enterprise-wide access control. Our specification language is based on the GTRBAC model that incorporates the content- and context-aware dynamic access control requirements of an enterprise. An X-GTRBAC system has been implemented as a Java application. We discuss the salient features of the specification language, and present the software architecture of our system. A comprehensive example is included to discuss and motivate the applicability of the X-GTRBAC framework to a generic enterprise environment. An application level interface for implementing the policy in the X-GTRBAC system is also provided to consolidate the ideas presented in the paper.

Proceedings Article
30 Aug 2005
TL;DR: This work proposes novel XML scoring methods that are inspired by tf*idf and that account for both structure and content while considering query relaxations and proposes efficient data structures in order to speed up ranked query processing.
Abstract: XML repositories are usually queried both on structure and content. Due to structural heterogeneity of XML, queries are often interpreted approximately and their answers are returned ranked by scores. Computing answer scores in XML is an active area of research that oscillates between pure content scoring such as the well-known tf*idf and taking structure into account. However, none of the existing proposals fully accounts for structure and combines it with content to score query answers. We propose novel XML scoring methods that are inspired by tf*idf and that account for both structure and content while considering query relaxations. Twig scoring, accounts for the most structure and content and is thus used as our reference method. Path scoring is an approximation that loosens correlations between query nodes hence reducing the amount of time required to manipulate scores during top-k query processing. We propose efficient data structures in order to speed up ranked query processing. We run extensive experiments that validate our scoring methods and that show that path scoring provides very high precision while improving score computation time.

Proceedings ArticleDOI
14 Jun 2005
TL;DR: The overall architecture and design aspects of a hybrid relational and XML database system called System RX are described, which is the first truly hybrid system that comingles XML and relational data, giving them equal footing.
Abstract: This paper describes the overall architecture and design aspects of a hybrid relational and XML database system called System RX. We believe that such a system is fundamental in the evolution of enterprise data management solutions: XML and relational data will co-exist and complement each other in enterprise solutions. Furthermore, a successful XML repository requires much of the same infrastructure that already exists in a relational database management system. Finally, XML query languages have considerable conceptual and functional overlap with relational dataflow engines. System RX is the first truly hybrid system that comingles XML and relational data, giving them equal footing. The new support for XML includes native support for storage and indexing as well as query compilation and evaluation support for the latest industry-standard query languages, SQL/XML and XQuery. By building a hybrid system, we leverage more than 20 years of data management research to advance XML technology to the same standards expected from mature relational systems.

Journal ArticleDOI
TL;DR: For multiple data systems to cooperate with each other, they must understand each other’s schemas; without such understanding, the multitude of data sources amounts to a digital version of the Tower of Babel.
Abstract: When independent parties develop database schemas for the same domain, they will almost always be quite different from each other. These differences are referred to as semantic heterogeneity, which also appears in the presence of multiple XML documents, Web services, and ontologies—or more broadly, whenever there is more than one way to structure a body of data. The presence of semi-structured data exacerbates semantic heterogeneity, because semi-structured schemas are much more flexible to start with. For multiple data systems to cooperate with each other, they must understand each other’s schemas. Without such understanding, the multitude of data sources amounts to a digital version of the Tower of Babel.

Proceedings Article
30 Aug 2005
TL;DR: TopX as discussed by the authors is a top-k query engine for XML documents with a focus on inexpensive sequential access to index lists and only a few judiciously scheduled random accesses.
Abstract: This paper presents a novel engine, coined TopX, for efficient ranked retrieval of XML documents over semistructured but nonschematic data collections. The algorithm follows the paradigm of threshold algorithms for top-k query processing with a focus on inexpensive sequential accesses to index lists and only a few judiciously scheduled random accesses. The difficulties in applying the existing top-k algorithms to XML data lie in 1) the need to consider scores for XML elements while aggregating them at the document level, 2) the combination of vague content conditions with XML path conditions, 3) the need to relax query conditions if too few results satisfy all conditions, and 4) the selectivity estimation for both content and structure conditions and their impact on evaluation strategies. TopX addresses these issues by precomputing score and path information in an appropriately designed index structure, by largely avoiding or postponing the evaluation of expensive path conditions so as to preserve the sequential access pattern on index lists, and by selectively scheduling random accesses when they are cost-beneficial. In addition, TopX can compute approximate top-k results using probabilistic score estimators, thus speeding up queries with a small and controllable loss in retrieval precision.