scispace - formally typeset
Search or ask a question

Showing papers on "Online analytical processing published in 2020"


Journal ArticleDOI
TL;DR: A new approach named SkyViz is proposed focused on the visualization area, in particular on how to specify the user’s objectives and describe the dataset to be visualized, and how to translate this specification into a platform-independent visualization type, andHow to concretely implement this visualization type on the target execution platform.
Abstract: In big data analytics, advanced analytic techniques operate on big datasets aimed at complementing the role of traditional OLAP for decision making. To enable companies to take benefit of these tec...

52 citations


Journal ArticleDOI
TL;DR: Understanding how the research in trajectory data are being conducted, what main techniques have been used, and how they can be embedded in an Online Analytical Processing (OLAP) architecture can enhance the efficiency and development of decision-making systems that deal with trajectory data.
Abstract: Trajectory data allow the study of the behavior of moving objects, from humans to animals. Wireless communication, mobile devices, and technologies such as Global Positioning System (GPS) have contributed to the growth of the trajectory research field. With the considerable growth in the volume of trajectory data, storing such data into Spatial Database Management Systems (SDBMS) has become challenging. Hence, Spatial Big Data emerges as a data management technology for indexing, storing, and retrieving large volumes of spatio-temporal data. A Data Warehouse (DW) is one of the premier Big Data analysis and complex query processing infrastructures. Trajectory Data Warehouses (TDW) emerge as a DW dedicated to trajectory data analysis. A list and discussions on problems that use TDW and forward directions for the works in this field are the primary goals of this survey. This article collected state-of-the-art on Big Data trajectory analytics. Understanding how the research in trajectory data are being conducted, what main techniques have been used, and how they can be embedded in an Online Analytical Processing (OLAP) architecture can enhance the efficiency and development of decision-making systems that deal with trajectory data.

35 citations


Journal ArticleDOI
01 Apr 2020
TL;DR: This paper focuses on eleven choke points where the optimizations are beneficial independently of the database system and focuses on the flattening of subqueries and the placement of predicates, which have the biggest impact.
Abstract: TPC-H continues to be the most widely used benchmark for relational OLAP systems. It poses a number of challenges, also known as "choke points", which database systems have to solve in order to achieve good benchmark results. Examples include joins across multiple tables, correlated subqueries, and correlations within the TPC-H data set. Knowing the impact of such optimizations helps in developing optimizers as well as in interpreting TPC-H results across database systems.This paper provides a systematic analysis of choke points and their optimizations. It complements previous work on TPC-H choke points by providing a quantitative discussion of their relevance. It focuses on eleven choke points where the optimizations are beneficial independently of the database system. Of these, the flattening of subqueries and the placement of predicates have the biggest impact. Three queries (Q2, Q17, and Q21) are strongly influenced by the choice of an efficient query plan; three others (Q1, Q13, and Q18) are less influenced by plan optimizations and more dependent on an efficient execution engine.

33 citations


Journal ArticleDOI
TL;DR: This study describes a method for the exploitation of historical data that are related to production performance and aggregated from IoT, to eliciting the future behavior of the production, while indicating the measured values that are responsible for negative production performance, without training.
Abstract: A dashboard application is proposed and developed to act as a Digital Twin that would indicate the Measured Value to be held accountable for any future failures. The current study describes a method for the exploitation of historical data that are related to production performance and aggregated from IoT, to eliciting the future behavior of the production, while indicating the measured values that are responsible for negative production performance, without training. The dashboard is implemented in the Java programming language, while information is stored into a Database that is aggregated by an Online Analytical Processing (OLAP) server. This achieves easy Key Performance Indicators (KPIs) visualization through the dashboard. Finally, indicative cases of a simulated transfer line are presented and numerical examples are given for validation and demonstration purposes. The need for human intervention is pointed out.

29 citations


Proceedings ArticleDOI
10 Dec 2020
TL;DR: In this article, a machine learning and big data analytic tool for processing and analyzing COVID-19 epidemiological data is presented, which makes good use of taxonomy and OLAP to generalize some specific attributes into some generalized attributes for effective big data analytics.
Abstract: In the current technological era, huge amounts of big data are generated and collected from a wide variety of rich data sources. These big data can be of different levels of veracity in the sense that some of them are precise while some others are imprecise and uncertain. Embedded in these big data are useful information and valuable knowledge to be discovered. An example of these big data is healthcare and epidemiological data such as data related to patients who suffered from epidemic diseases like the coronavirus disease 2019 (COVID-19). Knowledge discovered from these epidemiological data—via data science techniques such as machine learning, data mining, and online analytical processing (OLAP)—helps researchers, epidemiologists and policy makers to get a better understanding of the disease, which may inspire them to come up ways to detect, control and combat the disease. In this paper, we present a machine learning and big data analytic tool for processing and analyzing COVID-19 epidemiological data. Specifically, the tool makes good use of taxonomy and OLAP to generalize some specific attributes into some generalized attributes for effective big data analytics. Instead of ignoring unknown or unstated values of some attributes, the tool provides users with flexibility of including or excluding these values, depending on their preference and applications. Moreover, the tool discovers frequent patterns and their related patterns, which help reveal some useful knowledge such as absolute and relative frequency of the patterns. Furthermore, the tool learns from the patterns discovered from historical data and predicts useful information such as clinical outcomes for future data. As such, the tool helps users to get a better understanding of information about the confirmed cases of COVID-19. Although this tool is designed for machine learning and analytics of big epidemiological data, it would be applicable to machine learning and analytics of big data in many other real-life applications and services.

28 citations


Posted Content
TL;DR: This work proposes an in-memory system design which is non-intrusive to the current state-of-art OLTP and OLAP engines, and uses it to evaluate the performance of the approach.
Abstract: Modern Hybrid Transactional/Analytical Processing (HTAP) systems use an integrated data processing engine that performs analytics on fresh data, which are ingested from a transactional engine. HTAP systems typically consider data freshness at design time, and are optimized for a fixed range of freshness requirements, addressed at a performance cost for either OLTP or OLAP. The data freshness and the performance requirements of both engines, however, may vary with the workload. We approach HTAP as a scheduling problem, addressed at runtime through elastic resource management. We model an HTAP system as a set of three individual engines: an OLTP, an OLAP and a Resource and Data Exchange (RDE) engine. We devise a scheduling algorithm which traverses the HTAP design spectrum through elastic resource management, to meet the data freshness requirements of the workload. We propose an in-memory system design which is non-intrusive to the current state-of-art OLTP and OLAP engines, and we use it to evaluate the performance of our approach. Our evaluation shows that the performance benefit of our system for OLAP queries increases over time, reaching up to 50% compared to static schedules for 100 query sequences, while maintaining a small, and controlled, drop in the OLTP throughput.

24 citations


Proceedings ArticleDOI
11 Jun 2020
TL;DR: In this article, the authors propose a new framework called a query-data routing tree, or qd-tree, to address the problem of best assigning records to data blocks on storage.
Abstract: Corporations today collect data at an unprecedented and accelerating scale, making the need to run queries on large datasets increasingly important. Technologies such as columnar block-based data organization and compression have become standard practice in most commercial database systems. However, the problem of best assigning records to data blocks on storage is still open. For example, today's systems usually partition data by arrival time into row groups, or range/hash partition the data based on selected fields. For a given workload, however, such techniques are unable to optimize for the important metric of the number of blocks accessed by a query. This metric directly relates to the I/O cost, and therefore performance, of most analytical queries. Further, they are unable to exploit additional available storage to drive this metric down further. In this paper, we propose a new framework called a query-data routing tree, or qd-tree, to address this problem, and propose two algorithms for their construction based on greedy and deep reinforcement learning techniques. Experiments over benchmark and real workloads show that a qd-tree can provide physical speedups of more than an order of magnitude compared to current blocking schemes, and can reach within 2X of the lower bound for data skipping based on selectivity, while providing complete semantic descriptions of created blocks.

23 citations


Journal ArticleDOI
01 Aug 2020
TL;DR: The main challenges come from the need to balance data immutability, tamper evidence, and performance, and a clean-slate approach is examined, by describing a new system, Spitz, specifically designed for efficiently supporting immutable and tamper-evident transaction management.
Abstract: Databases in the past have helped businesses maintain and extract insights from their data. Today, it is common for a business to involve multiple independent, distrustful parties. This trend towards decentralization introduces a new and important requirement to databases: the integrity of the data, the history, and the execution must be protected. In other words, there is a need for a new class of database systems whose integrity can be verified (or verifiable databases).In this paper, we identify the requirements and the design challenges of verifiable databases. We observe that the main challenges come from the need to balance data immutability, tamper evidence, and performance. We first consider approaches that extend existing OLTP and OLAP systems with support for verification. We next examine a clean-slate approach, by describing a new system, Spitz, specifically designed for efficiently supporting immutable and tamper-evident transaction management. We conduct a preliminary performance study of both approaches against a baseline system, and provide insights on their performance.

22 citations


Proceedings ArticleDOI
11 Jun 2020
TL;DR: In this paper, the authors propose an in-memory system design which is nonintrusive to the current state-of-the-art OLTP and OLAP engines, and use it to evaluate the performance of their approach.
Abstract: Modern Hybrid Transactional/Analytical Processing (HTAP) systems use an integrated data processing engine that performs analytics on fresh data, which are ingested from a transactional engine. HTAP systems typically consider data freshness at design time, and are optimized for a fixed range of freshness requirements, addressed at a performance cost for either OLTP or OLAP. The data freshness and the performance requirements of both engines, however, may vary with the workload. We approach HTAP as a scheduling problem, addressed at runtime through elastic resource management. We model an HTAP system as a set of three individual engines: an OLTP, an OLAP and a Resource and Data Exchange (RDE) engine. We devise a scheduling algorithm which traverses the HTAP design spectrum through elastic resource management, to meet the workload data freshness requirements. We propose an in-memory system design which is non-intrusive to the current state-of-art OLTP and OLAP engines, and we use it to evaluate the performance of our approach. Our evaluation shows that the performance benefit of our system for OLAP queries increases over time, reaching up to 50% compared to static schedules for 100 query sequences, while maintaining a small, and controlled, drop in the OLTP throughput.

19 citations


Proceedings ArticleDOI
15 Jun 2020
TL;DR: One of the first experimental studies on characterizing Intel® Optane™ DC PMM's performance behavior in the context of analytical database workloads is presented, revealing interesting performance tradeoffs that can help guide the design of next-generation OLAP systems in presence of persistent memory in the storage hierarchy.
Abstract: New data storage technologies such as the recently introduced Intel® Optane™ DC Persistent Memory Module (PMM) offer exciting opportunities for optimizing the query processing performance of database workloads. In particular, the unique combination of low latency, byte-addressability, persistence, and large capacity make persistent memory (PMem) an attractive alternative along with DRAM and SSDs. Exploring the performance characteristics of this new medium is the first critical step in understanding how it will impact the design and performance of database systems. In this paper, we present one of the first experimental studies on characterizing Intel® Optane™ DC PMM's performance behavior in the context of analytical database workloads. First, we analyze basic access patterns common in such workloads, such as sequential, selective, and random reads as well as the complete Star Schema Benchmark, comparing standalone DRAM- and PMem-based implementations. Then we extend our analysis to join algorithms over larger datasets, which require using DRAM and PMem in a hybrid fashion while paying special attention to the read-write asymmetry of PMem. Our study reveals interesting performance tradeoffs that can help guide the design of next-generation OLAP systems in presence of persistent memory in the storage hierarchy.

19 citations


Proceedings Article
01 Jan 2020
TL;DR: It is demonstrated that the standard PCIe interconnect substantially limits the performance of state-of-the-art GPUs and a hybrid materialization approach which combines eager with lazy data transfers is proposed, and the wide gap between GPU and PCIe throughput can be bridged through efficient data sharing techniques.
Abstract: GPUs are becoming increasingly popular in large scale data center installations due to their strong, embarrassingly parallel, processing capabilities. Data management systems are riding the wave by using GPUs to accelerate query execution, mainly for analytical workloads. However, this acceleration comes at the price of a slow interconnect which imposes strong restrictions in bandwidth and latency when bringing data from the main memory to the GPU for processing. The related research in data management systems mostly relies on late materialization and data sharing to mitigate the overheads introduced by slow interconnects even in the standard CPU processing case. Finally, workload trends move beyond analytical to fresh data processing, typically referred to as Hybrid Transactional and Analytical Processing (HTAP). Therefore, we experience an evolution in three different axes: interconnect technology, GPU architecture, and workload characteristics. In this paper, we break the evolution of the technological landscape into steps and we study the applicability and performance of late materialization and data sharing in each one of them. We demonstrate that the standard PCIe interconnect substantially limits the performance of state-of-the-art GPUs and we propose a hybrid materialization approach which combines eager with lazy data transfers. Further, we show that the wide gap between GPU and PCIe throughput can be bridged through efficient data sharing techniques. Finally, we provide an H2TAP system design which removes software-level interference and we show that the interference in the memory bus is minimal, allowing data transfer optimizations as in OLAP workloads.

Proceedings ArticleDOI
20 Apr 2020
TL;DR: The key challenges involved in building the DBIM-on-ADG infrastructure are explored, including synchronized maintenance of the In-Memory Column Store on the Standby database, with high-speed OLTP activity continuously modifying data on the Primary database.
Abstract: Oracle Database In-Memory (DBIM) provides orders of magnitude speedup for analytic queries with its highly compressed, transactionally consistent, memory-optimized Column Store. Customers can use Oracle DBIM for making real-time decisions by analyzing vast amounts of data at blazingly fast speeds. Active Data Guard (ADG) is Oracle’s comprehensive solution for high-availability and disaster recovery for the Oracle Database. Oracle ADG eliminates the high cost of idle redundancy by allowing reporting applications, ad-hoc queries and data extracts to be offloaded to the synchronized, physical Standby database replicated using Oracle ADG. In Oracle 12.2, we extended the DBIM advantage to Oracle ADG architecture. DBIM-on-ADG significantly boosts the performance of analytic, read-only workloads running on the physical Standby database, while the Primary database continues to process high-speed OLTP workloads. Customers can partition their data across the In-Memory Column Stores on the Primary and Standby databases based on access patterns, and reap the benefits of fault-tolerance as well as workload isolation without compromising on critical performance SLAs. In this paper, we explore and address the key challenges involved in building the DBIM-on-ADG infrastructure, including synchronized maintenance of the In-Memory Column Store on the Standby database, with high-speed OLTP activity continuously modifying data on the Primary database.

Journal ArticleDOI
TL;DR: The results show the proposed scheme can provide lower overhead than the traditional SQL‐based database while facilitating the scope and flexibility of data warehouse services.
Abstract: The emergence of big data makes more and more enterprise change data management strategy, from simple data storage to OLAP query analysis; meanwhile, NoSQL‐based data warehouse receive more increasing attention than traditional SQL‐based database. By improving the JFSS model for ETL, this paper proposes the uniform distribution code (UDC), model identification code (MIC), standard dimension code (SDC), and attribute dimensional code (ADC); defines the data storage format of ; and identifies the extraction, transformation, and loading strategies of data warehouse. Several experiments are carried out to analyze single record and range record queries as typical OLAP based on Hadoop database (HBase). The results show the proposed scheme can provide lower overhead than the traditional SQL‐based database while facilitating the scope and flexibility of data warehouse services.

Journal ArticleDOI
01 Oct 2020
TL;DR: Experiments show that the analytical analysis of divergent execution and resource contention helps to improve the accuracy of the cost model, and Pyper significantly outperforms other GPU query engines on TPC-H and SSB queries.
Abstract: In recent years, we have witnessed significant efforts to improve the performance of Online Analytical Processing (OLAP) on graphics processing units (GPUs). Most existing studies have focused on i...

Journal ArticleDOI
TL;DR: A smart intra-query fault tolerance mechanism for MPP databases that achieves fault tolerance by performing checkpointing, i.e., materializing intermediate results of selected operators, and aims at promoting query success rate within a given time.
Abstract: Intra-query fault tolerance has increasingly been a concern for online analytical processing, as more and more enterprises migrate data analytical systems from mainframes to commodity computers. Most massive parallel processing (MPP) databases do not support intra-query fault tolerance. They may suffer from prolonged query latency when running on unreliable commodity clusters. While SQL-on-Hadoop systems can utilize the fault tolerance support of low-level frameworks, such as MapReduce and Spark, their cost-effectiveness is not always acceptable. In this paper, we propose a smart intra-query fault tolerance (SIFT) mechanism for MPP databases. SIFT achieves fault tolerance by performing checkpointing, i.e., materializing intermediate results of selected operators. Different from existing approaches, SIFT aims at promoting query success rate within a given time. To achieve its goal, it needs to: (1) minimize query rerunning time after encountering failures and (2) introduce as less checkpointing overhead as possible. To evaluate SIFT in real-world MPP database systems, we implemented it in Greenplum. The experimental results indicate that it can improve success rate of query processing effectively, especially when working with unreliable hardware.

Journal ArticleDOI
TL;DR: In this article, a study aimed at discovering the impact of big data in terms of its dimensions (Variety, Velocity, Volume, and Veracity) on financial reports quality in the present business intelligence (OLAP, Data Mining, and Data Warehouse) as a moderating variable in Jordanian telecom companies.
Abstract: This study aimed at discovering the impact of big data in terms of its dimensions (Variety, Velocity, Volume, and Veracity) on financial reports quality in the present business intelligence in terms of its dimensions (Online Analytical Processing (OLAP), Data Mining, and Data Warehouse) as a moderating variable in Jordanian telecom companies. The sample included (139) employees in Jordanian Telecom Companies. Multiple and Stepwise Linear Regression were used to test the effect of the independent variable on the dependent variable. And Hierarchical Regression analysis, to test the effect of the independent variable on the dependent variable in the presence of the moderating variable. The study reached a set of results, the most prominent of which was the presence of a statistically significant effect of using big data in improve the quality of financial reports, Business intelligence contributes to improving the impact of big data in terms of its dimensions (Volume, Velocity, Variety, and Veracity) on the quality of financial reports. The study recommends the necessity of working on making use of big data and resorting to business intelligence solutions because of its great role in improving the quality of financial reports and thus supporting decision-making functions for a large group of users.

Journal ArticleDOI
01 Feb 2020
TL;DR: The results show that traditional commercial OLAP systems suffer from their long instruction footprint, which results in high response times, and high-performance columnstores execute tight instruction streams; however, they spend 25 to 82% of their CPU cycles on stalls both for sequential- and random-access-heavy workloads.
Abstract: Understanding micro-architectural behavior is important for efficiently using hardware resources. Recent work has shown that in-memory online transaction processing (OLTP) systems severely underutilize their core micro-architecture resources [29]. Whereas, online analytical processing (OLAP) workloads exhibit a completely different computing pattern. OLAP workloads are read-only, bandwidth-intensive, and include various data access patterns. With the rise of column-stores, they run on high-performance engines that are tightly optimized for modern hardware. Consequently, micro-architectural behavior of modern OLAP systems remains unclear.This work presents a micro-architectural analysis of a set of OLAP systems. The results show that traditional commercial OLAP systems suffer from their long instruction footprint, which results in high response times. High-performance columnstores execute tight instruction streams; however, they spend 25 to 82% of their CPU cycles on stalls both for sequential- and random-access-heavy workloads. Concurrent query execution can improve the utilization, but it creates interference in the shared resources, which results in sub-optimal performance.

Posted Content
TL;DR: This special issue includes five contributions to the fields of business process innovation in the big data era, unstructured big data analytical methods in firms, online analytical processing approach for business intelligence in big data, geospatial insights for retail recommendation using similarity measures, and big data and operational changes through interactive data visualization.
Abstract: This special issue was open for submissions in the field of big data in business. Accordingly, this special issue includes five contributions to the fields of business process innovation in the big data era, unstructured big data analytical methods in firms, online analytical processing (OLAP) approach for business intelligence in big data, geospatial insights for retail recommendation using similarity measures, and big data and operational changes through interactive data visualization. A bibliometric approach is used to visualize and highlight the exciting literature on big data followed by highlighting the contribution of this special issue.

Proceedings Article
01 Jan 2020
TL;DR: This paper envisage a conversational framework specifically devised for OLAP applications that converts natural language text in GPSJ queries and relies on an ad-hoc grammar and a knowledge base storing multidimensional metadata and cubes values.
Abstract: The democratization of data access and the adoption of OLAP in scenarios requiring hand-free interfaces push towards the creation of smart OLAP interfaces. In this paper, we envisage a conversational framework specifically devised for OLAP applications. The system converts natural language text in GPSJ (Generalized Projection, Selection and Join) queries. The approach relies on an ad-hoc grammar and a knowledge base storing multidimensional metadata and cubes values. In case of ambiguous or incomplete query description, the system is able to obtain the correct query either through automatic inference or through interactions with the user to disambiguate the text. Our tests show very promising results both in terms of effectiveness and efficiency.

Journal ArticleDOI
TL;DR: The proposed approach aims to achieve an acceptable trade-off between the afore-mentioned two objectives and is addressed using NSGA-II in this paper.
Abstract: Data warehouse is constructed with the purpose of supporting decision making. Decision making queries, being long and complex, consume a lot of time in processing against a continuously growing data warehouse. View materialization is one of the alternative ways of improving the response time of such analytical or decision making queries. This involves selection and materialization of views that minimize the analytical query response times while adhering to the resource constraints. This is referred to as the view selection problem, which is a NP-Hard problem. The view selection problem is concerned with simultaneously minimizing the cost of evaluating materialized and non-materialized views. This being a bi-objective optimization problem is addressed using NSGA-II in this paper. The proposed approach aims to achieve an acceptable trade-off between the afore-mentioned two objectives.

Journal ArticleDOI
15 Dec 2020
TL;DR: In this article, a special issue on big data in business is presented, which includes five contributions to the fields of business process innovation in the big data era, unstructured big data analytical methods in firms, online analytical processing approach for business intelligence in big data, geospatial insights for retail recommendation using similarity measures, and big data and operational changes through interactive data visualization.
Abstract: This special issue was open for submissions in the field of big data in business. Accordingly, this special issue includes five contributions to the fields of business process innovation in the big data era, unstructured big data analytical methods in firms, online analytical processing approach for business intelligence in big data, geospatial insights for retail recommendation using similarity measures, and big data and operational changes through interactive data visualization. A bibliometric approach is used to visualize and highlight the exciting literature on big data followed by highlighting the contribution of this special issue.

Journal ArticleDOI
TL;DR: A benchmark to aid Big Data OLAP designers to choose the most suitable cube design for their goals is proposed and the main requirements and trade-offs for effectively designing a Big data OLAP cube taking advantage of data pre-aggregation techniques are described.
Abstract: In recent years, several new technologies have enabled OLAP processing over Big Data sources. Among these technologies, we highlight those that allow data pre-aggregation because of their demonstrated performance in data querying. This is the case of Apache Kylin, a Hadoop based technology that supports sub-second queries over fact tables with billions of rows combined with ultra high cardinality dimensions. However, taking advantage of data pre-aggregation techniques to designing analytic models for Big Data OLAP is not a trivial task. It requires very advanced knowledge of the underlying technologies and user querying patterns. A wrong design of the OLAP cube alters significantly several key performance metrics, including: (i) the analytic capabilities of the cube (time and ability to provide an answer to a query), (ii) size of the OLAP cube, and (iii) time required to build the OLAP cube. Therefore, in this paper we (i) propose a benchmark to aid Big Data OLAP designers to choose the most suitable cube design for their goals, (ii) we identify and describe the main requirements and trade-offs for effectively designing a Big Data OLAP cube taking advantage of data pre-aggregation techniques, and (iii) we validate our benchmark in a case study.

Proceedings ArticleDOI
11 Jun 2020
TL;DR: This Demo presents Grasper, an RDMA-enabled distributed graph OLAP system, which adopts a series of new system designs to overcome the challenges of OLAP on graphs.
Abstract: Achieving high performance OLAP over large graphs is a challenging problem and has received great attention recently because of its broad spectrum of applications. Existing systems have various performance bottlenecks due to limitations such as low parallelism and high network overheads. This Demo presents Grasper, an RDMA-enabled distributed graph OLAP system, which adopts a series of new system designs to overcome the challenges of OLAP on graphs. The take-aways for Demo attendees are: (1)~a good understanding of the challenges of processing graph OLAP queries; (2)~useful insights about where Grasper's good performance comes from; (3)~inspirations about how to design an efficient graph OLAP system by comparing Grasper with existing systems.

Proceedings ArticleDOI
20 Apr 2020
TL;DR: The SETLBI (Semantic Extract-Transform-Load and Business Intelligence) integration platform that brings together the Semantic Web and Business intelligence technologies is presented, which facilitates Data Warehouse designers to build a semantic Data Warehouse, and enables OLAP-style analyses.
Abstract: With the growing popularity of Semantic Web technologies, more and more organizations natively manage data using Semantic Web standards, in particular RDF. This development gives rise to new requirements for Business Intelligence tools to enable analyses in the style of On-Line Analytical Processing (OLAP) over RDF data. In this demonstration, we therefore present the SETLBI (Semantic Extract-Transform-Load and Business Intelligence) integration platform that brings together the Semantic Web and Business Intelligence technologies. SETLBI covers all phases of integration: target definition, source to target mappings generation, semantic and non-semantic source extraction, data transformation, and target population and update. It facilitates Data Warehouse designers to build a semantic Data Warehouse, either from scratch or by defining a multi-dimensional view over existing RDF data sources, and further enables OLAP-style analyses.

Proceedings ArticleDOI
11 Jun 2020
TL;DR: The prototype system SPRINTER is implemented by integrating the proposed methods into an open-source in-memory OLAP system and it is shown that SPRINTER outperforms the state-of-the-art OLAP systems for complex queries.
Abstract: The concept of OLAP query processing is now being widely adopted in various applications. The number of complex queries containing the joins between non-unique keys (called FK-FK joins) increases in those applications. However, the existing in-memory OLAP systems tend not to handle such complex queries efficiently since they generate a large amount of intermediate results or incur a huge amount of probe cost. In this paper, we propose an effective query planning method for complex OLAP queries. It generates a query plan containing n-ary join operators based on a cost model. The plan does not generate intermediate results for processing FK-FK joins and significantly reduces the probe cost. We also propose an efficient processing method for n-ary join operators. We implement the prototype system SPRINTER by integrating our proposed methods into an open-source in-memory OLAP system. Through experiments using the TPC-DS benchmark, we have shown that SPRINTER outperforms the state-of-the-art OLAP systems for complex queries.

Journal ArticleDOI
TL;DR: A cube operator is defined called MC-CUBE (MapReduce Columnar CUBE), which allows building columnar NoSQL cubes according to the columnar approach by taking into account the non-relational and distributed aspects when data warehouses are stored.
Abstract: In the Big Data warehouse context, a column-oriented NoSQL database system is considered as the storage model which is highly adapted to data warehouses and online analysis. Indeed, the use of NoSQ...

Journal ArticleDOI
TL;DR: This article aims to introduce a model to overcome null value in converting document-oriented NoSQL databases into relational databases using parallel similarity techniques and is an efficient and suitable approach for extracting OLAP cubes from an NoSQL database.
Abstract: Today, the relational database is not suitable for data management due to the large variety and volume of data which are mostly untrusted. Therefore, NoSQL has attracted the attention of companies. Despite it being a proper choice for managing a variety of large volume data, there is a big challenge and difficulty in performing online analytical processing (OLAP) on NoSQL since it is schema-less. This article aims to introduce a model to overcome null value in converting document-oriented NoSQL databases into relational databases using parallel similarity techniques. The proposed model includes four phases, shingling, chunck, minhashing, and locality-sensitive hashing MapReduce (LSHMR). Each phase performs a proper process on input NoSQL databases. The main idea of LSHMR is based on the nature of both locality-sensitive hashing (LSH) and MapReduce (MR). In this article, the LSH similarity search technique is used on the MR framework to extract OLAP cubes. LSH is used to decrease the number of comparisons. Furthermore, MR enables efficient distributed and parallel computing. The proposed model is an efficient and suitable approach for extracting OLAP cubes from an NoSQL database.

Journal ArticleDOI
TL;DR: This paper proposes T+MultiDim, a multidimensional conceptual data model enabling both instant- and interval-based semantics over temporal dimensions, and provides suitable OLAP (On-Line Analytical Processing) operators for querying temporal information.

Journal ArticleDOI
TL;DR: This paper proposes a complete set of techniques for probabilistic data cubes, from cuboid aggregation, over cube materialization, to query evaluation, and studies two types of aggregation: convolution and sketch-based, which take polynomial time complexities for aggregation and jointly enable efficient query processing.
Abstract: On-Line Analytical Processing ( OLAP ) enables powerful analytics by quickly computing aggregate values of numerical measures over multiple hierarchical dimensions for massive datasets. However, many types of source data, e.g., from GPS, sensors, and other measurement devices, are intrinsically inaccurate (imprecise and/or uncertain) and thus OLAP cannot be readily applied. In this paper, we address the resulting data veracity problem in OLAP by proposing the concept of probabilistic data cubes. Such a cube is comprised of a set of probabilistic cuboids which summarize the aggregated values in the form of probability mass functions (pmfs in short ) and thus offer insights into the underlying data quality and enable confidence-aware query evaluation and analysis. However, the probabilistic nature of data poses computational challenges, since a probabilistic database can have exponential number of possible worlds under the possible world semantics. Even worse, it is hard to share computations among different cuboids, as aggregation functions that are distributive for traditional data cubes, e.g., $\tt SUM$ SUM , become holistic in probabilistic settings. In this paper, we propose a complete set of techniques for probabilistic data cubes, from cuboid aggregation, over cube materialization, to query evaluation. We study two types of aggregation: convolution and sketch-based, which take polynomial time complexities for aggregation and jointly enable efficient query processing. Also, our proposal is versatile in terms of: 1) its capability of supporting common aggregation functions, i.e., $\tt SUM$ SUM , $\tt COUNT$ COUNT , $\tt MAX$ MAX , and $\tt AVG$ AVG ; 2) its adaptivity to different materialization strategies, e.g., full versus partial materialization, with support of our devised cost models and parallelization framework; 3) its coverage of common OLAP operations, i.e., probabilistic slicing and dicing queries. Extensive experiments over real and synthetic datasets show that our techniques are effective and scalable.

Journal ArticleDOI
29 Aug 2020
TL;DR: The conceptual and logical modelling of the semantic trajectory data warehouse is developed and the produced results prove the efficiency in improving nursing productivity.
Abstract: A Trajectory Data Warehouse is a central repository of large amount of data focusing on moving objects, which have been collected and integrated from multiple sources with spatial and temporal dimensions as the main metrics of analysis. By adding semantic-related contextual information, it is converted to a Semantic Trajectory Data Warehouse. It transforms raw trajectories to valuable information that can be utilized for decision-making purposes in ubiquitous applications. Human recourses management is a domain that may benefit significantly from semantic trajectory data warehouses. In particular, employees working shifts can be considered as trajectories. In this work, standard data warehousing tools are used to store data about nursing personnel shifts as trajectories of moving persons. The conceptual and logical modelling of the semantic trajectory data warehouse is developed. The objective is the observation, management and scheduling of nurses’ shifts data by the computation of OLAP operations over them. A prototype implementation has also been realized to illustrate the functionality of the proposed model. The produced results prove the efficiency in improving nursing productivity.