A heterogeneous cloud system, for example, a Hadoop 2.6.0 platform, provides distributed but cohesive services with rich features on large-scale management, reliability, and error tolerance. As big data processing is concerned, newly built cloud clusters meet the challenges of performance optimization focusing on faster task execution and more efficient usage of computing resources. Presently proposed approaches concentrate on temporal improvement, that is, shortening MapReduce time, but seldom focus on storage occupation; however, unbalanced cloud storage strategies could exhaust those nodes with heavy MapReduce cycles and further challenge the security and stability of the entire cluster. In this paper, an adaptive method is presented aiming at spatial-temporal efficiency in a heterogeneous cloud environment. A prediction model based on an optimized Kernel-based Extreme Learning Machine algorithm is proposed for faster forecast of job execution duration and space occupation, which consequently facilitates the process of task scheduling through a multi-objective algorithm called time and space optimized NSGA-II TS-NSGA-II. Experiment results have shown that compared with the original load-balancing scheme, our approach can save approximate 47-55i¾źs averagely on each task execution. Simultaneously, 1.254i¾ź of differences on hard disk occupation were made among all scheduled reducers, which achieves 26.6% improvement over the original scheme. Copyright © 2016 John Wiley & Sons, Ltd.

https://onlinelibrary.wiley.com/doi/pdf/10.1002/sec.1582

A speculative approach to spatial-temporal efficiency with multi-objective optimization in a heterogeneous cloud environment

RDF has become very popular for semantic data publishing due to its flexible and universal graph-like data model. Thus, the ever-increasing size of RDF data collections raises the need for scalable distributed approaches. We endorse the usage of existing infrastructures for Big Data processing like Hadoop for this purpose. Yet, SPARQL query performance is a major challenge as Hadoop is not intentionally designed for RDF processing. Existing approaches often favor certain query pattern shapes while performance drops significantly for other shapes. In this paper, we introduce a novel relational partitioning schema for RDF data called ExtVP that uses a semi-join based preprocessing, akin to the concept of Join Indices in relational databases, to efficiently minimize query input size regardless of its pattern shape and diameter. Our prototype system S2RDF is built on top of Spark and uses SQL to execute SPARQL queries over ExtVP. We demonstrate its superior performance in comparison to state of the art SPARQL-on-Hadoop approaches.

/pdf/s2rdf-rdf-querying-with-sparql-on-spark-3awwjknuwg.pdf

S2RDF: RDF querying with SPARQL on spark

In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.

The Family of MapReduce and Large Scale Data Processing Systems

The Resource Description Framework (RDF) pioneered by the W3C is increasingly being adopted to model data in a variety of scenarios, in particular data to be published or exchanged on the Web. Managing large volumes of RDF data is challenging, due to the sheer size, the heterogeneity, and the further complexity brought by RDF reasoning. To tackle the size challenge, distributed storage architectures are required. Cloud computing is an emerging paradigm massively adopted in many applications for the scalability, fault-tolerance, and elasticity feature it provides, enabling the easy deployment of distributed and parallel architectures. In this article, we survey RDF data management architectures and systems designed for a cloud environment, and more generally, those large-scale RDF data management systems that can be easily deployed therein. We first give the necessary background, then describe the existing systems and proposals in this area, and classify them according to dimensions related to their capabilities and implementation techniques. The survey ends with a discussion of open problems and perspectives.

https://hal.inria.fr/hal-01020977/document

RDF in the clouds: a survey

In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large-scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling, and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large-scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large-scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.

The family of mapreduce and large-scale data processing systems

Existing MapReduce systems support relational style join operators which translate multi-join query plans into severalMap-Reduce cycles. This leads to high I/O and communication costs due to the multiple data transfer steps between map and reduce phases. SPARQL graph pattern matching is dominated by join operations, and is unlikely to be efficiently processed using existing techniques. This cost is prohibitive for RDF graph pattern matching queries which typically involve several join operations. In this paper, we propose an approach for optimizing graph pattern matching by reinterpreting certain join tree structures as grouping operations. This enables a greater degree of parallelism in join processing resulting in more "bushy" like query execution plans with fewer Map-Reduce cycles. This approach requires that the intermediate results are managed as sets of groups of triples or TripleGroups. We therefore propose a data model and algebra - Nested TripleGroup Algebra for capturing and manipulating TripleGroups. The relationship with the traditional relational style algebra used in Apache Pig is discussed. A comparative performance evaluation of the traditional Pig approach and RAPID+ (Pig extended with NTGA) for graph pattern matching queries on the BSBM benchmark dataset is presented. Results show up to 60% performance improvement of our approach over traditional Pig for some tasks.

/pdf/an-intermediate-algebra-for-optimizing-rdf-graph-pattern-12opilwx7t.pdf

An intermediate algebra for optimizing RDF graph pattern matching on MapReduce

MapReduce-based data processing platforms offer a promising approach for cost-effective and Web-scale processing of Semantic Web data. However, one major challenge is that this computational paradigm leads to high I/O and communication costs when processing tasks with several join operations typical in SPARQL queries. The goal of this demonstration is to show how a system RAPID+, an extension of Apache Pig, enables more efficient SPARQL query processing on MapReduce using an alternative query algebra called the Nested TripleGroup Algebra (NTGA). The demonstration will offer opportunities for users to explore NTGA-Hadoop query plans for different SPARQL query structures as well as explore relationships between query plans based on relational algebra operators and those using NTGA operators.

From SPARQL to MapReduce: the journey using a nested TripleGroup algebra

Recently, the number and size of RDF data collections has increased rapidly making the issue of scalable processing techniques crucial. The MapReduce model has become a de facto standard for large scale data processing using a cluster of machines in the cloud. Generally, RDF query processing creates join-intensive workloads, resulting in lengthy MapReduce workflows with expensive I/O, data transfer, and sorting costs. However, the MapReduce computation model provides limited static optimization techniques used in relational databases (e.g., indexing and cost-based optimization). Consequently, dynamic optimization techniques for such join-intensive tasks on MapReduce need to be investigated. In some previous efforts, we propose a Nested Triple Group data model and Algebra (NTGA) for efficient graph pattern query processing in the cloud. Here, we extend this work with a scan-sharing technique that is used to optimize the processing of graph patterns with repeated properties. Specifically, our scan-sharing technique eliminates the need for repeated scanning of input relations when properties are used repeatedly in graph patterns. A formal foundation underlying this scan sharing technique is discussed as well as an implementation strategy that has been integrated in the Apache Pig framework is presented. We also present a comprehensive evaluation demonstrating performance benefits of our NTGA plus scan-sharing approach.

Scan-Sharing for Optimizing RDF Graph Pattern Matching on MapReduce

Scalable processing of Semantic Web queries has become a critical need given the rapid upward trend in availability of Semantic Web data. The MapReduce paradigm is emerging as a platform of choice for large scale data processing and analytics due to its ease of use, cost effectiveness, and potential for unlimited scaling. Processing queries on Semantic Web triple models is a challenge on the mainstream MapReduce platform called Apache Hadoop, and its extensions such as Pig and Hive. This is because such queries require numerous joins which leads to lengthy and expensive MapReduce workflows. Further, in this paradigm, cloud resources are acquired on demand and the traditional join optimization machinery such as statistics and indexes are often absent or not easily supported.In this demonstration, we will present RAPID+, an extended Apache Pig system that uses an algebraic approach for optimizing queries on RDF data models including queries involving inferencing. The basic idea is that by using logical and physical operators that are more natural to MapReduce processing, we can reinterpret such queries in a way that leads to more concise execution workflows and small intermediate data footprints that minimize disk I/Os and network transfer overhead. RAPID+ evaluates queries using the Nested TripleGroup Data Model and Algebra(NTGA). The demo will show comparative performance of NTGA query plans vs. relational algebra-like query plans used by Apache Pig and Hive.

Optimizing RDF(S) queries on cloud platforms

Broadened adoption of the Linking Open Data tenets has led to a significant surge in the amount of Semantic Web data, particularly RDF data This has positioned the issue of scalable data processing techniques for RDF as a central issue in the Semantic Web research community The RDF data model is a fine-grained model representing relationships as binary relations Thus, answering queries (typically graph pattern matching queries) over RDF data requires several join operations to reassemble related data While MapReduce based processing is emerging as the de facto paradigm for processing large scale data, it is known to be inefficient for join-intensive workloads In addition, most of the existing techniques for optimizing RDF data processing do not transfer well to the MapReduce model and often require significant lead time for pre-processing Such a requirement may not be desirable for on-demand cloud database scenarios where the goal is to reduce the Time-To-Result (TTR) In this position paper, we argue that some of these challenges can be overcome by rethinking the operators for graph pattern processing, as well as adopting dynamic optimization techniques that exploit information from the previous execution steps to eliminate intermediate results that are irrelevant in the context of future execution steps We present some preliminary evaluation results of the proposed techniques

http://datasys.cs.iit.edu/events/DataCloud-SC11/s02.pdf

HyeongSik Kim

Papers

An intermediate algebra for optimizing RDF graph pattern matching on MapReduce

From SPARQL to MapReduce: the journey using a nested TripleGroup algebra

Scan-Sharing for Optimizing RDF Graph Pattern Matching on MapReduce

Optimizing RDF(S) queries on cloud platforms

Efficient processing of RDF graph pattern matching on MapReduce platforms