scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Semantic-based Structural and Content indexing for the efficient retrieval of queries over large XML data repositories

01 Jul 2014-Future Generation Computer Systems (North-Holland)-Vol. 37, pp 212-231
TL;DR: An optimization approach that takes into consideration the semantics of the dataset in order to deal with the complexity of multi-disciplinary domains in Big Data, in particular when the data is represented as XML documents is adopted.
About: This article is published in Future Generation Computer Systems.The article was published on 2014-07-01. It has received 16 citations till now. The article focuses on the topics: Streaming XML & Efficient XML Interchange.
Citations
More filters
Journal ArticleDOI
TL;DR: This special issue presents advances in Semantics, Intelligent Processing and Services for Big Data and their applications to a variety of domains including mobile computing, smart cities, forensics and medicine.

29 citations


Cites background from "Semantic-based Structural and Conte..."

  • ...[6], address semantic approaches for structural and content indexing as a basis for efficient retrieval of queries over large XML data repositories....

    [...]

Posted Content
TL;DR: The share of publications containing empirical results is well below the average compared to computer science research as a whole, and Variety is considered the most promising uncharted area in Big Data.
Abstract: Background: Big Data is a relatively new eld of research and technology, and literature reports a wide variety of concepts labeled with Big Data. The maturity of a research eld can be measured in the number of publications containing empirical results. In this paper we present the current status of empirical research in Big Data. Method: We employed a systematic mapping method with which we mapped the collected research according to the labels Variety, Volume and Velocity. In addition, we addressed the application areas of Big Data. Results: We found that 151 of the assessed 1778 contributions contain a form of empirical result and can be mapped to one or more of the 3 V’s and 59 address an application area. Conclusions: The share of publications containing empirical results is well below the average compared to computer science research as a whole. In order to mature the research on Big Data, we recommend applying empirical methods to strengthen the condence in the reported results. Based on our trend analysis we consider Variety to be the most promising uncharted area in Big Data.

18 citations


Additional excerpts

  • ...Volume 85 [3][4][6][14][17] [18][19][21][22][27][28] [29][32][34][35][37][39] [41][49][58][57][60][63] [64][66][67][71][73] [78][81][82][83][88] [93] [94] [97] [105][107] [110] [116] [118] [122] [123] [132][134] [136] [137] [143] [146] [148] [151] [160] [169][172] [173] [180] [181] [182] [188] [193] [190] [192] [194] [201] [198][199][200] [205] [203] [204] [207] [208] [212] [210] [213] [215][216] [227] [220] [222][223] [221] [228] [231] [234] Velocity 11 [51][59][77] [79][84][103] [164][183][209][214] [219]...

    [...]

Journal ArticleDOI
TL;DR: To the best of the knowledge, this is the first work that provides a detailed description of XML query processing techniques that are related to structural aspects and that contains information about their theoretical and practical features as well as about their mutual compatibility and general usability.
Abstract: Since the boom in new proposals on techniques for efficient querying of XML data is now over and the research world has shifted its attention toward new types of data formats, we believe that it is crucial to review what has been done in the area to help users choose an appropriate strategy and scientists exploit the contributions in new areas of data processing. The aim of this work is to provide a comprehensive study of the state-of-the-art of approaches for the structural querying of XML data. In particular, we start with a description of labeling schemas to capture the structure of the data and the respective storage strategies. Then we deal with the key part of every XML query processing: a twig query join, XML query algebras, optimizations of query plans, and selectivity estimation of XML queries. To the best of our knowledge, this is the first work that provides such a detailed description of XML query processing techniques that are related to structural aspects and that contains information about their theoretical and practical features as well as about their mutual compatibility and general usability.

16 citations


Additional excerpts

  • ...Another type of partitioning is semantic partitioning (Alghamdi et al. 2014) that partitions the XML document according to the structure specified in an XML schema....

    [...]

Journal ArticleDOI
TL;DR: This research project aims to review on some of the latest techniques for each node indexing group and identifies the trends which can be useful for new researcher.
Abstract: Background/Objectives: Node indexing has been developed to optimize query retrieval. Since its inception in the early century, there are many node indexing techniques. Methods/Statistical Analysis: Node indexing can be group into four major groups which is, subtree labeling, prefix-based labeling, multiplicative labeling and hybrid labeling. Each indexing techniques has its advantages and disadvantages. However, there is an absence of literature reviews on the review of the recent techniques; the latest one was in year 2009. As such, this research project aims to review on some of the latest techniques for each node indexing group. Findings: Choosing a correct indexing is critical. For example, prefix-based indexing scheme size grows too huge, while high computation cost is needed to annotate using multiplicative labeling. On the other hand, the subtree group is weak in data updates, while a hybrid scheme combining various schemes with the aim to create a scheme with the strengths of several schemes. Application/Improvements: Most important, this review explores and identifies the trends which can be useful for new researcher.

6 citations

Journal ArticleDOI
TL;DR: A Data Service Framework, which integrates the data from distributed sources such as databases, Simple Object Access Protocol (SOAP) based web services and flat files, and performs create, read, update and delete (CRUD) operations on it through Representational State Transfer (REST) services over the Hyper Text Transfer Protocol (HTTP).
Abstract: Heterogeneous data on distributed computing sources are growing day by day. To manage the data from the distributed sources into a distinct type of application like mobile, cloud, desktop, web etc. is a challenging issue in the global information systems, particularly for cooperation and interoperability. This paper proposes a Data Service Framework, which integrates the data from distributed sources such as databases, Simple Object Access Protocol (SOAP) based web services and flat files, and performs create, read, update and delete (CRUD) operations on it through Representational State Transfer (REST) services over the Hyper Text Transfer Protocol (HTTP). The proposed data service framework also supports java database connectivity (JDBC). Detailed description of the proposed framework and experimental results are reported in this paper.

5 citations


Cites background or methods from "Semantic-based Structural and Conte..."

  • ...Alghamdi [18] optimize the structural and constant part of XML queries by introducing the method of indexing and processing XML data based on the concept of objects that is formed from the semantic...

    [...]

  • ...for cooperation and interoperability within the global information systems [18, 19]....

    [...]

References
More filters
Proceedings ArticleDOI
03 Jun 2002
TL;DR: This paper shows that XML's ordered data model can indeed be efficiently supported by a relational database system, and proposes three order encoding methods that can be used to represent XML order in the relational data model, and also proposes algorithms for translating ordered XPath expressions into SQL using these encoding methods.
Abstract: XML is quickly becoming the de facto standard for data exchange over the Internet. This is creating a new set of data management requirements involving XML, such as the need to store and query XML documents. Researchers have proposed using relational database systems to satisfy these requirements by devising ways to "shred" XML documents into relations, and translate XML queries into SQL queries over these relations. However, a key issue with such an approach, which has largely been ignored in the research literature, is how (and whether) the ordered XML data model can be efficiently supported by the unordered relational data model. This paper shows that XML's ordered data model can indeed be efficiently supported by a relational database system. This is accomplished by encoding order as a data value. We propose three order encoding methods that can be used to represent XML order in the relational data model, and also propose algorithms for translating ordered XPath expressions into SQL using these encoding methods. Finally, we report the results of an experimental study that investigates the performance of the proposed order encoding methods on a workload of ordered XML queries and updates.

2,402 citations


"Semantic-based Structural and Conte..." refers background or methods in this paper

  • ...Thereafter, [31] utilized the idea of Dewey labeling in XML data as for each node, there is an associated vector of numbers that represents the ID of the node in a path from the root to that node by including its ancestors coding as a prefix and it also includes the node number within its siblings of the same parent....

    [...]

  • ...Prefix-based labels are much easier to update than range-based labels because only the nodes in the sub-tree rooted at the following sibling need to be updated when a new node is inserted [31]....

    [...]

  • ...[32] propose ORDPATH, which is similar to the Dewey ID [31], but the ORDPATH label differs from the Dewey ID in that the ORDPATH label uses only odd numbers in its coding and reserves even numbers for further node insertions....

    [...]

Proceedings Article
25 Aug 1997
TL;DR: The theoretical foundations of DataGuides are presented along with an algorithm for their creation and an overview of incremental maintenance, and performance results based on the implementation of dataGuides in the Lore DBMS for semistructured data are provided.
Abstract: In semistructured databases there is no schema fixed in advance. To provide the benefits of a schema in such environments, we introduce DataGuides: concise and accurate structural summaries of semistructured databases. DataGuides serve as dynamic schemas, generated from the database; they are useful for browsing database structure, formulating queries, storing information such as statistics and sample values, and enabling query optimization. This paper presents the theoretical foundations of DataGuides along with an algorithm for their creation and an overview of incremental maintenance. We provide performance results based on our implementation of DataGuides in the Lore DBMS for semistructured data. We also describe the use of DataGuides in Lore, both in the user interface to enable structure browsing and query formulation, and as a means of guiding the query processor and optimizing query execution.

1,341 citations


"Semantic-based Structural and Conte..." refers background or methods in this paper

  • ...TwigX-Guide has outperformed other twig querying systems, taking advantage of both the path summary in DataGuide [17] for efficient path queries with parent–child edges and the region encoding in TwigStack [45] in its ability to process twig queries....

    [...]

  • ...The DataGuides index [17,18] summarizes all the unique paths in XML data starting from the root....

    [...]

  • ...Since this approach uses DataGuide and the edge approach to join paths, it does not keep the hierarchy information to answer complex twig queries....

    [...]

  • ...The path index consists of the index tree corresponding to DataGuide [17] and an instance function for each edge of the tree index....

    [...]

  • ...There are two DataGuide index types: (i) a minimal DataGuide which has less traversal paths, thus its index size is compact, and (ii) a strong DataGuide where every label path in the XML data is described exactly once in the DataGuides....

    [...]

Book
21 Oct 1999
TL;DR: A Syntax for Data: Typing semistructured data and the Lore system and database products supporting XML are explained.
Abstract: 1 Introduction 2 A Syntax for Data 3 XML 4 Query Languages 5 Query Languages for XML 6 Interpretation and advanced features 7 Typing semistructured data 8 Query Processing 9 The Lore system 10 Strudel 11 Database products supporting XML

1,195 citations


"Semantic-based Structural and Conte..." refers background or methods in this paper

  • ...The bi-similarity-based index is constructed based on bisimilarity from the root element to the indexed element and consists of two types, namely the forward bi-similarity index as 1-index, A(k)-index [23] and D(k)-index [24] and the forward and backward bi-similarity index as F&B-index [25] and (F + B)Kindex [26]....

    [...]

  • ...The F&B-index [25] is a ‘‘covering index’’ that is created by inversing XML data edges to obtain structural summaries of in-coming (forward) and out-going (backward) paths....

    [...]

Proceedings ArticleDOI
03 Jun 2002
TL;DR: This paper proposes a novel holistic twig join algorithm, TwigStack, that uses a chain of linked stacks to compactly represent partial results to root-to-leaf query paths, which are then composed to obtain matches for the twig pattern.
Abstract: XML employs a tree-structured data model, and, naturally, XML queries specify patterns of selection predicates on multiple elements related by a tree structure. Finding all occurrences of such a twig pattern in an XML database is a core operation for XML query processing. Prior work has typically decomposed the twig pattern into binary structural (parent-child and ancestor-descendant) relationships, and twig matching is achieved by: (i) using structural join algorithms to match the binary relationships against the XML database, and (ii) stitching together these basic matches. A limitation of this approach for matching twig patterns is that intermediate result sizes can get large, even when the input and output sizes are more manageable.In this paper, we propose a novel holistic twig join algorithm, TwigStack, for matching an XML query twig pattern. Our technique uses a chain of linked stacks to compactly represent partial results to root-to-leaf query paths, which are then composed to obtain matches for the twig pattern. When the twig pattern uses only ancestor-descendant relationships between elements, TwigStack is I/O and CPU optimal among all sequential algorithms that read the entire input: it is linear in the sum of sizes of the input lists and the final result list, but independent of the sizes of intermediate results. We then show how to use (a modification of) B-trees, along with TwigStack, to match query twig patterns in sub-linear time. Finally, we complement our analysis with experimental results on a range of real and synthetic data, and query twig patterns.

1,014 citations


"Semantic-based Structural and Conte..." refers methods in this paper

  • ...TwigX-Guide has outperformed other twig querying systems, taking advantage of both the path summary in DataGuide [17] for efficient path queries with parent–child edges and the region encoding in TwigStack [45] in its ability to process twig queries....

    [...]

  • ...Twig2Stack evaluates a query in a bottom-up manner to reduce the double phase of TwigStack to a single phase....

    [...]

  • ...TwigX-Guide outperformed other well known methods, such TwigStack, TwigStackXB[10] [45], TwigINLAB [53] and TwigStackList [54], in most queries for the comparison done by [7]....

    [...]

  • ...[45] proposed novel approaches called PathStack and TwigStack to minimize the intermediate results in the memory....

    [...]

  • ...The difference between the two methods is that PathStack is proposed for a linear path query; while TwigStack, which is an extended version of PathStack aims to evaluate a general twig query....

    [...]

Proceedings ArticleDOI
07 Aug 2002
TL;DR: It is shown that, in some cases, tree-merge algorithms can have performance comparable to stack-tree algorithms, in many cases they are considerably worse, and this behavior is explained by analytical results that demonstrate that, on sorted inputs, the stack- tree algorithms have worst-case I/O and CPU complexities linear in the sum of the sizes of inputs and output, while the tree-MERge algorithms do not have the same guarantee.
Abstract: XML queries typically specify patterns of selection predicates on multiple elements that have some specified tree structured relationships. The primitive tree structured relationships are parent-child and ancestor-descendant, and finding all occurrences of these relationships in an XML database is a core operation for XML query processing. We develop two families of structural join algorithms for this task: tree-merge and stack-tree. The tree-merge algorithms are a natural extension of traditional merge joins and the multi-predicate merge joins, while the stack-tree algorithms have no counterpart in traditional relational join processing. We present experimental results on a range of data and queries using the TIMBER native XML query engine built on top of SHORE. We show that while, in some cases, tree-merge algorithms can have performance comparable to stack-tree algorithms, in many cases they are considerably worse. This behavior is explained by analytical results that demonstrate that, on sorted inputs, the stack-tree algorithms have worst-case I/O and CPU complexities linear in the sum of the sizes of inputs and output, while the tree-merge algorithms do not have the same guarantee.

895 citations


"Semantic-based Structural and Conte..." refers methods in this paper

  • ...The stack based approach [44] improved MPMGJN by utilizing a single stack to operate binary structural join....

    [...]