Showing papers on "Online analytical processing published in 2017"

PDF

Open Access

Proceedings Article•DOI•

Extracting Top-K Insights from Multi-dimensional Data

[...]

Bo Tang¹, Shi Han², Man Lung Yiu¹, Rui Ding², Dongmei Zhang² - Show less +1 more•Institutions (2)

Hong Kong Polytechnic University¹, Microsoft²

09 May 2017

TL;DR: This paper takes the first attempt towards automatically extracting top-k insights from multi-dimensional data by proposing the concept of insight, which captures interesting observation derived from aggregation results in multiple steps.

...read moreread less

Abstract: OLAP tools have been extensively used by enterprises to make better and faster decisions. Nevertheless, they require users to specify group-by attributes and know precisely what they are looking for. This paper takes the first attempt towards automatically extracting top-k insights from multi-dimensional data. This is useful not only for non-expert users, but also reduces the manual effort of data analysts. In particular, we propose the concept of insight which captures interesting observation derived from aggregation results in multiple steps (e.g., rank by a dimension, compute the percentage of measure by a dimension). An example insight is: ``Brand B's rank (across brands) falls along the year, in terms of the increase in sales''. Our problem is to compute the top-k insights by a score function. It poses challenges on (i) the effectiveness of the result and (ii) the efficiency of computation. We propose a meaningful scoring function for insights to address (i). Then, we contribute a computation framework for top-k insights, together with a suite of optimization techniques (i.e., pruning, ordering, specialized cube, and computation sharing) to address (ii). Our experimental study on both real data and synthetic data verifies the effectiveness and efficiency of our proposed solution.

...read moreread less

98 citations

Journal Article•DOI•

Relaxed operator fusion for in-memory databases: making compilation, vectorization, and prefetching work together at last

[...]

Prashanth Menon¹, Todd C. Mowry¹, Andrew Pavlo¹•Institutions (1)

Carnegie Mellon University¹

01 Sep 2017

TL;DR: A query processing model called "relaxed operator fusion" is presented that allows the DBMS to introduce staging points in the query plan where intermediate results are temporarily materialized and reduces the execution time of OLAP queries by up to 2.2× and achieves up to 1.8× better performance compared to other in-memory DBMSs.

...read moreread less

Abstract: In-memory database management systems (DBMSs) are a key component of modern on-line analytic processing (OLAP) applications, since they provide low-latency access to large volumes of data. Because disk accesses are no longer the principle bottleneck in such systems, the focus in designing query execution engines has shifted to optimizing CPU performance. Recent systems have revived an older technique of using just-in-time (JIT) compilation to execute queries as native code instead of interpreting a plan. The state-of-the-art in query compilation is to fuse operators together in a query plan to minimize materialization overhead by passing tuples efficiently between operators. Our empirical analysis shows, however, that more tactful materialization yields better performance.We present a query processing model called "relaxed operator fusion" that allows the DBMS to introduce staging points in the query plan where intermediate results are temporarily materialized. This allows the DBMS to take advantage of inter-tuple parallelism inherent in the plan using a combination of prefetching and SIMD vectorization to support faster query execution on data sets that exceed the size of CPU-level caches. Our evaluation shows that our approach reduces the execution time of OLAP queries by up to 2.2× and achieves up to 1.8× better performance compared to other in-memory DBMSs.

...read moreread less

95 citations

Proceedings Article•DOI•

Hybrid Transactional/Analytical Processing: A Survey

[...]

Fatma Ozcan¹, Yuanyuan Tian¹, Pinar Tözün¹•Institutions (1)

IBM¹

09 May 2017

TL;DR: This tutorial is to quickly review the historical progression of OLTP and OLAP systems, discuss the driving factors for HTAP, and provide a deep technical analysis of existing and emerging HTAP solutions, detailing their key architectural differences and trade-offs.

...read moreread less

Abstract: The popularity of large-scale real-time analytics applications (real-time inventory/pricing, recommendations from mobile apps, fraud detection, risk analysis, IoT, etc.) keeps rising. These applications require distributed data management systems that can handle fast concurrent transactions (OLTP) and analytics on the recent data. Some of them even need running analytical queries (OLAP) as part of transactions. Efficient processing of individual transactional and analytical requests, however, leads to different optimizations and architectural decisions while building a data management system. For the kind of data processing that requires both analytics and transactions, Gartner recently coined the term Hybrid Transactional/Analytical Processing (HTAP). Many HTAP solutions are emerging both from the industry as well as academia that target these new applications. While some of these are single system solutions, others are a looser coupling of OLTP databases or NoSQL systems with analytical big data platforms, like Spark. The goal of this tutorial is to 1-) quickly review the historical progression of OLTP and OLAP systems, 2-) discuss the driving factors for HTAP, and finally 3-) provide a deep technical analysis of existing and emerging HTAP solutions, detailing their key architectural differences and trade-offs.

...read moreread less

82 citations

Proceedings Article•DOI•

BatchDB: Efficient Isolated Execution of Hybrid OLTP+OLAP Workloads for Interactive Applications

[...]

Darko Makreshanski¹, Jana Giceva¹, Claude Barthels¹, Gustavo Alonso¹•Institutions (1)

ETH Zurich¹

09 May 2017

TL;DR: BatchDB achieves good performance, provides a high level of data freshness, and minimizes load interaction between the transactional and analytical engines, thus enabling real time analysis over fresh data under tight SLAs for both OLTP and OLAP workloads.

...read moreread less

Abstract: In this paper we present BatchDB, an in-memory database engine designed for hybrid OLTP and OLAP workloads. BatchDB achieves good performance, provides a high level of data freshness, and minimizes load interaction between the transactional and analytical engines, thus enabling real time analysis over fresh data under tight SLAs for both OLTP and OLAP workloads. BatchDB relies on primary-secondary replication with dedicated replicas, each optimized for a particular workload type (OLTP, OLAP), and a light-weight propagation of transactional updates. The evaluation shows that for standard TPC-C and TPC-H benchmarks, BatchDB can achieve competitive performance to specialized engines for the corresponding transactional and analytical workloads, while providing a level of performance isolation and predictable runtime for hybrid workload mixes (OLTP+OLAP) otherwise unmet by existing solutions.

...read moreread less

82 citations

Journal Article•DOI•

A view-based model of data-cube to support big earth data systems interoperability

[...]

Stefano Nativi¹, Paolo Mazzetti¹, Max Craglia•Institutions (1)

National Research Council¹

04 Dec 2017

TL;DR: A view-based model of Earth Data-Cube systems to design its infrastructural architecture and content schemas is introduced, with the final goal of enabling and facilitating interoperability.

...read moreread less

Abstract: Big Earth Data-Cube infrastructures are becoming more and more popular to provide Analysis Ready Data, especially for managing satellite time series. These infrastructures build on the concept of m...

...read moreread less

58 citations

Journal Article•DOI•

Adaptive work placement for query processing on heterogeneous computing resources

[...]

Tomas Karnagel¹, Dirk Habich¹, Wolfgang Lehner¹•Institutions (1)

Dresden University of Technology¹

01 Mar 2017

TL;DR: This work proposes an adaptive placement approach being independent of cardinality estimation of intermediate results, which significantly improves OLAP query processing on heterogeneous hardware, while being adaptive enough to react to changing cardinalities of intermediate query results.

...read moreread less

Abstract: The hardware landscape is currently changing from homogeneous multi-core systems towards heterogeneous systems with many different computing units, each with their own characteristics. This trend is a great opportunity for data-base systems to increase the overall performance if the heterogeneous resources can be utilized efficiently. To achieve this, the main challenge is to place the right work on the right computing unit. Current approaches tackling this placement for query processing assume that data cardinalities of intermediate results can be correctly estimated. However, this assumption does not hold for complex queries. To overcome this problem, we propose an adaptive placement approach being independent of cardinality estimation of intermediate results. Our approach is incorporated in a novel adaptive placement sequence. Additionally, we implement our approach as an extensible virtualization layer, to demonstrate the broad applicability with multiple database systems. In our evaluation, we clearly show that our approach significantly improves OLAP query processing on heterogeneous hardware, while being adaptive enough to react to changing cardinalities of intermediate query results.

...read moreread less

38 citations

Book•

Main Memory Database Systems

[...]

Alfons Kemper¹, Thomas Neumann¹•Institutions (1)

Technische Universität München¹

20 Jul 2017

TL;DR: The recent advances in processor technology - soon hundreds of cores and terabytes of DRAM in commodity servers - have spawned the academic as well as the industrial interest in main-memory database technology as discussed by the authors.

...read moreread less

Abstract: The recent advances in processor technology - soon hundreds of cores and terabytes of DRAM in commodity servers - have spawned the academic as well as the industrial interest in main-memory database technology. In this panel, we will discuss the virtues of different architectural designs w.r.t. transaction processing as well as OLAP query processing.

...read moreread less

33 citations

Journal Article•DOI•

Parallel replication across formats in SAP HANA for scaling out mixed OLTP/OLAP workloads

[...]

Juchang Lee, Seunghyun Moon¹, Kyu Hwan Kim, Kim Deok Hoe, Sang Kyun Cha², Wook-Shin Han¹ - Show less +2 more•Institutions (2)

Pohang University of Science and Technology¹, Seoul National University²

01 Aug 2017

TL;DR: Asynchronous Parallel Table Replication (ATR) employs a novel optimistic lock-free parallel log replay scheme which exploits characteristics of multi-version concurrency control (MVCC) in order to enable real-time reporting by minimizing the propagation delay between the primary and replicas.

...read moreread less

Abstract: Modern in-memory database systems are facing the need of efficiently supporting mixed workloads of OLTP and OLAP. A conventional approach to this requirement is to rely on ETL-style, application-driven data replication between two very different OLTP and OLAP systems, sacrificing real-time reporting on operational data. An alternative approach is to run OLTP and OLAP workloads in a single machine, which eventually limits the maximum scalability of OLAP query performance. In order to tackle this challenging problem, we propose a novel database replication architecture called Asynchronous Parallel Table Replication (ATR). ATR supports OLTP workloads in one primary machine, while it supports heavy OLAP workloads in replicas. Here, row-store formats can be used for OLTP transactions at the primary, while column-store formats are used for OLAP analytical queries at the replicas. ATR is designed to support elastic scalability of OLAP query performance while it minimizes the overhead for transaction processing at the primary and minimizes CPU consumption for replayed transactions at the replicas. ATR employs a novel optimistic lock-free parallel log replay scheme which exploits characteristics of multi-version concurrency control (MVCC) in order to enable real-time reporting by minimizing the propagation delay between the primary and replicas. Through extensive experiments with a concrete implementation available in a commercial database system, we demonstrate that ATR achieves sub-second visibility delay even for update-intensive workloads, providing scalable OLAP performance without notable overhead to the primary.

...read moreread less

31 citations

Journal Article•DOI•

Reverse engineering aggregation queries

[...]

Wei Chit Tan¹, Meihui Zhang¹, Hazem Elmeleegy, Divesh Srivastava²•Institutions (2)

Singapore University of Technology and Design¹, AT&T Labs²

01 Aug 2017

TL;DR: This paper develops a novel three-phase algorithm named REGAL 1, based on a lattice graph structure, that finds a multi-dimensional filter that is needed to generate the exact query output table for OLAP queries with group-by and aggregation.

...read moreread less

Abstract: Query reverse engineering seeks to re-generate the SQL query that produced a given query output table from a given database. In this paper, we solve this problem for OLAP queries with group-by and aggregation. We develop a novel three-phase algorithm named REGAL 1 for this problem. First, based on a lattice graph structure, we identify a set of group-by candidates for the desired query. Second, we apply a set of aggregation constraints that are derived from the properties of aggregate operators at both the table-level and the group-level to discover candidate combinations of group-by columns and aggregations that are consistent with the given query output table. Finally, we find a multi-dimensional filter, i.e., a conjunction of selection predicates over the base table attributes, that is needed to generate the exact query output table. We conduct an extensive experimental study over the TPC-H dataset to demonstrate the effectiveness and efficiency of our proposal.

...read moreread less

30 citations

SAP HANA – The Evolution of an In-Memory DBMS from Pure OLAP Processing Towards Mixed Workloads

[...]

Norman May, Alexander Böhm, Wolfgang Lehner

01 Jan 2017

TL;DR: The challenges of running mixed workloads with low-latency OLTP queries and complex analytical queries in the context of the same database management system are discussed and an outlook on the future database interaction patterns of modern business applications is given.

...read moreread less

Abstract: The journey of SAP HANA started as an in-memory appliance for complex, analytical applications. The success of the system quickly motivated SAP to broaden the scope from the OLAP workloads the system was initially architected for to also handle transactional workloads, in particular to support its Business Suite flagship product. In this paper, we highlight some of the core design changes to evolve an in-memory column store system towards handling OLTP workloads. We also discuss the challenges of running mixed workloads with low-latency OLTP queries and complex analytical queries in the context of the same database management system and give an outlook on the future database interaction patterns of modern business applications we see emerging currently.

...read moreread less

28 citations

Proceedings Article•DOI•

Efficient Exploration of Telco Big Data with Compression and Decaying

[...]

Constantinos Costa¹, Georgios Chatzimilioudis¹, Demetrios Zeinalipour-Yazti², Mohamed F. Mokbel³•Institutions (3)

University of Cyprus¹, Max Planck Society², University of Minnesota³

19 Apr 2017

TL;DR: SPATE is introduced, an innovative telco big data exploration framework whose objectives are minimizing the storage space needed to incrementally retain data over time, and minimizing the response time for spatiotemporal data exploration queries over recent data.

...read moreread less

Abstract: In the realm of smart cities, telecommunication companies (telcos) are expected to play a protagonistic role as these can capture a variety of natural phenomena on an ongoing basis, e.g., traffic in a city, mobility patterns for emergency response or city planning. The key challenges for telcos in this era is to ingest in the most compact manner huge amounts of network logs, perform big data exploration and analytics on the generated data within a tolerable elapsed time. This paper introduces SPATE, an innovative telco big data exploration framework whose objectives are two-fold: (i) minimizing the storage space needed to incrementally retain data over time, and (ii) minimizing the response time for spatiotemporal data exploration queries over recent data. The storage layer of our framework uses lossless data compression to ingest recent streams of telco big data in the most compact manner retaining full resolution for data exploration tasks. The indexing layer of our system then takes care of the progressive loss of detail in information, coined decaying, as data ages with time. The exploration layer provides visual means to explore the generated spatio-temporal information space. We measure the efficiency of the proposed framework using a 5GB anonymized real telco network trace and a variety of telco-specific tasks, such as OLAP and OLTP querying, privacy-aware data sharing, multivariate statistics, clustering and regression. We show that out framework can achieve comparable response times to the state-of-the-art using an order of magnitude less storage space.

...read moreread less

Journal Article•DOI•

In-Depth Analysis of Energy Efficiency Related Factors in Commercial Buildings Using Data Cube and Association Rule Mining

[...]

Byeongjoon Noh, Juntae Son, Hansaem Park, Seongju Chang

17 Nov 2017-Sustainability

TL;DR: In this paper, a data cube model combined with association rule mining is proposed for more flexible and detailed analysis of building energy consumption profiles using the Commercial Buildings Energy Consumption Survey (CBECS) dataset, which has accumulated over 6700 existing commercial buildings across the U.S.A.

...read moreread less

Abstract: Significant amounts of energy are consumed in the commercial building sector, resulting in various adverse environmental issues. To reduce energy consumption and improve energy efficiency in commercial buildings, it is necessary to develop effective methods for analyzing building energy use. In this study, we propose a data cube model combined with association rule mining for more flexible and detailed analysis of building energy consumption profiles using the Commercial Buildings Energy Consumption Survey (CBECS) dataset, which has accumulated over 6700 existing commercial buildings across the U.S.A. Based on the data cube model, a multidimensional commercial sector building energy analysis was performed based upon on-line analytical processing (OLAP) operations to assess the energy efficiency according to building factors with various levels of abstraction. Furthermore, the proposed analysis system provided useful information that represented a set of energy efficient combinations by applying the association rule mining method. We validated the feasibility and applicability of the proposed analysis model by structuring a building energy analysis system and applying it to different building types, weather conditions, composite materials, and heating/cooling systems of the multitude of commercial buildings classified in the CBECS dataset.

...read moreread less

Journal Article•DOI•

Materialized View Selection using Artificial Bee Colony Optimization

[...]

Biri Arun¹, T. V. Vijay Kumar¹•Institutions (1)

Jawaharlal Nehru University¹

01 Jan 2017-International Journal of Intelligent Information Technologies

TL;DR: The views selected using ABCVSA on materialization would reduce the query response time of OLAP queries and thereby aid analysts in arriving at strategic business decisions in an effective manner.

...read moreread less

Abstract: Data warehouse is an essential component of almost every modern enterprise information system. It stores huge amount of subject-oriented, time-stamped, non-volatile and integrated data. It is highly required of the system to respond to complex online analytical queries posed against its data warehouse in seconds for efficient decision making. Optimization of online analytical query processing OLAP could substantially minimize delays in query response time. Materialized view is an efficient and effective OLAP query optimization technique to minimize query response time. Selecting a set of such appropriate views for materialization is referred to as view selection, which is a nontrivial task. In this regard, an Artificial Bee Colony ABC based view selection algorithm ABCVSA, which has been adapted by incorporating N-point and GBFS based N-point random insertion operations, to select Top-K views from a multidimensional lattice is proposed. Experimental results show that ABCVSA performs better than the most fundamental view selection algorithm HRUA. Thus, the views selected using ABCVSA on materialization would reduce the query response time of OLAP queries and thereby aid analysts in arriving at strategic business decisions in an effective manner.

...read moreread less

Journal Article•DOI•

EXODuS: Exploratory OLAP over Document Stores

[...]

Mohamed Lamine Chouder¹, Stefano Rizzi², Rachid Chalal¹•Institutions (2)

École Normale Supérieure¹, University of Bologna²

01 Nov 2017-Information Systems

TL;DR: This paper proposes EXODuS, an interactive, schema-on-read approach to enable OLAP querying of document stores in the context of self-service BI and exploratory OLAP, which adopts a data-driven approach based on the mining of approximate functional dependencies to discover multidimensional hierarchies in document stores.

...read moreread less

Journal Article•DOI•

An application of OLAP/GIS-Fuzzy AHP-TOPSIS methodology for decision making: Location selection for landfill of industrial wastes as a case study

[...]

Mohamed Hanine¹, Omar Boutkhoum¹, Abdessadek Tikniouine¹, Tarik Agouti¹•Institutions (1)

Cadi Ayyad University¹

01 Sep 2017-Ksce Journal of Civil Engineering

TL;DR: In this paper, an OLAP/GIS-Fuzzy AHP-TOPSIS based methodology for evaluation and selection of best sites for landfill of industrial wastes (LIW) is proposed.

...read moreread less

Abstract: The location selection for Landfill of Industrial Wastes (LIW) is a very significant task in waste management studies, which has significant impacts on sustainable development of the region. Furthermore, the selection of the appropriate and efficient sites for LIWs is an important multi-criteria decision making problem. The present document suggests an OLAP/GIS-Fuzzy AHP-TOPSIS based methodology for evaluation and selection of best sites for LIWs. In this respect, the candidate locations are specified based on the combination of On-Line Analytical Processing and Geographic Information System (OLAP/GIS). The Fuzzy Analytical Hierarchy Process (Fuzzy-AHP), a multi-criteria decision-making method is applied to analyze the structure of the problem and obtain the weights of the qualitative and quantitative criteria, by incorporating the uncertainty values in decision-making. Then, the Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) is taken into account to assess and rank the alternative locations. Finally, a hypothetical application of the proposed approach is illustrated by a case study of location selection for LIWs in Morocco. The results show that the proposed methodology can successfully achieve the aim of this work.

...read moreread less

Patent•

Multidimensional online analytical processing (MOLAP)-based data processing method and apparatus

[...]

Li Yinwei

08 Mar 2017

TL;DR: In this paper, a multidimensional online analytical processing (MOLAP)-based data processing method and apparatus is presented, which comprises the steps of creating a data cube according to a fact table and a dimension table; performing data pre-calculation on all possible combinations of dimensions according to data recorded in the data cube; and storing a precalculation result in an open-source database.

...read moreread less

Abstract: The present invention discloses a multidimensional online analytical processing (MOLAP)-based data processing method and apparatus. The data processing method comprises the steps of creating a data cube according to a fact table and a dimension table; performing data pre-calculation on all possible combinations of dimensions according to data recorded in the data cube; and storing a pre-calculation result in an open-source database, so as to determine an inquire result according to the pre-calculation result in inquiry. Through adoption of the method, the existing data inquiry solution can be optimized, and non-technical staff can inquire in massive data.

...read moreread less

Proceedings Article•DOI•

Icarus: Towards a multistore database system

[...]

Marco Vogt¹, Alexander Stiemer¹, Heiko Schuldt¹•Institutions (1)

University of Basel¹

01 Dec 2017

TL;DR: This paper introduces the multistore ICARUS, a database management system that combines OLTP and OLAP elements and is able to speed-up queries up to a factor of three by properly routing queries to the best underlying DBMS.

...read moreread less

Abstract: The last years have seen a vast diversification on the database market. In contrast to the “one-size-fits-all” paradigm according to which systems have been designed in the past, today's database management systems (DBMS) are tuned for particular workloads. This has led to DBMSs optimized for high performance, high throughput read/write workloads in online transaction processing (OLTP) and systems optimized for complex analytical queries (OLAP). However, this approach reaches a limit when systems have to deal with mixed workloads that are neither pure OLAP nor pure OLTP workloads. In such cases, multistores are increasingly gaining popularity. Rather than supporting one single database paradigm and addressing one particular workload, multistores encompass several DBMSs that store data in different schemas and allow to route requests on a per-query level to the most appropriate system. In this paper, we introduce the multistore ICARUS. In our evaluation based on a workload that combines OLTP and OLAP elements, we show that ICARUS is able to speed-up queries up to a factor of three by properly routing queries to the best underlying DBMS.

...read moreread less

Journal Article•DOI•

Multidimensional Model Design using Data Mining: A Rapid Prototyping Methodology

[...]

Sandro Bimonte, Lucile Sautot¹, Ludovic Journaux, Bruno Faivre²•Institutions (2)

Agro ParisTech¹, University of Burgundy²

01 Jan 2017-International Journal of Data Warehousing and Mining

TL;DR: A new rapid prototyping methodology, integrating two different DM algorithms, to define dimension hierarchies according to decision-maker knowledge, and a complete UML Profile to define a DW schema that integrates both the DM algorithms are proposed.

...read moreread less

Abstract: Designing and building a Data Warehouse DW, and associated OLAP cubes, are long processes, during which decision-maker requirements play an important role. But decision-makers are not OLAP experts and can find it difficult to deal with the concepts behind DW and OLAP. To support DW design in this context, we propose: i a new rapid prototyping methodology, integrating two different DM algorithms, to define dimension hierarchies according to decision-maker knowledge; ii a complete UML Profile, to define a DW schema that integrates both the DM algorithms; iii a mapping process to transform multidimensional schemata according to the results of the DM algorithms; iv a tool implementing the proposed methodology; v a full validation, based on a real case study concerning bird biodiversity. In conclusion, we confirm the rapidity and efficacy of our methodology and tool in providing a multidimensional schema to satisfy decision-maker analytical needs.

...read moreread less

Posted Content•

Warehousing complex data from the Web

[...]

Omar Boussaid¹, Jérôme Darmont¹, Fadila Bentayeb¹, Sabine Loudcher¹•Institutions (1)

University of Lyon¹

02 Jan 2017-arXiv: Databases

TL;DR: In this paper, the authors present a complex data warehousing methodology that exploits XML as a pivot language, which includes the integration of complex data in an ODS, under the form of XML documents; their dimensional modeling and storage in an XML data warehouse; and their analysis with combined OLAP and data mining techniques.

...read moreread less

Abstract: The data warehousing and OLAP technologies are now moving onto handling complex data that mostly originate from the Web. However, intagrating such data into a decision-support process requires their representation under a form processable by OLAP and/or data mining techniques. We present in this paper a complex data warehousing methodology that exploits XML as a pivot language. Our approach includes the integration of complex data in an ODS, under the form of XML documents; their dimensional modeling and storage in an XML data warehouse; and their analysis with combined OLAP and data mining techniques. We also address the crucial issue of performance in XML warehouses.

...read moreread less

Journal Article•DOI•

QETL: An approach to on-demand ETL from non-owned data sources

[...]

Lorenzo Baldacci¹, Matteo Golfarelli¹, Simone Graziani¹, Stefano Rizzi¹•Institutions (1)

University of Bologna¹

01 Nov 2017

TL;DR: The experimental tests show that QETL effectively reuses data to cut extraction costs, thus leading to significant performance improvements, and is proposed to feed a multidimensional cube.

...read moreread less

Abstract: In traditional OLAP systems, the ETL process loads all available data in the data warehouse before users start querying them. In some cases, this may be either inconvenient (because data are supplied from a provider for a fee) or unfeasible (because of their size); on the other hand, directly launching each analysis query on source data would not enable data reuse, leading to poor performance and high costs. The alternative investigated in this paper is that of fetching and storing data on-demand, i.e., as they are needed during the analysis process. In this direction we propose the Query-Extract-Transform-Load (QETL) paradigm to feed a multidimensional cube; the idea is to fetch facts from the source data provider, load them into the cube only when they are needed to answer some OLAP query, and drop them when some free space is needed to load other facts. Remarkably, QETL includes an optimization step to cheaply extract the required data based on the specific features of the data provider. The experimental tests, made on a real case study in the genomics area, show that QETL effectively reuses data to cut extraction costs, thus leading to significant performance improvements.

...read moreread less

Patent•

Big data-based online analytical processing system and method

[...]

Lin Jie, Zhao Yanyan, Tang Yuan, Zhong Dejian, Li Nianhua - Show less +1 more

01 Feb 2017

TL;DR: In this article, a big data-based online analytical processing system and method is proposed for carrying out quick multi-dimensional query and analysis on data sets with different scales and levels under a Hadoop environment.

...read moreread less

Abstract: The invention discloses a big data-based online analytical processing system and method. The system can be used for carrying out quick multi-dimensional query and analysis on data sets with different scales and levels under a Hadoop environment. A query plan selected through query, planning and estimation comprises MDX query supporting Hive and Hbase precomputation cache mechanism-based multi-dimensional query. According to the system and method, optimization of the MDX query supporting Hive data warehouses on extensible cluster nodes and of the Hbase precomputation cache mechanism-based multi-dimensional query are realized, the low-delay multi-dimensional query requirements of the data sets with different scales and levels are satisfied, and the OLAP multi-dimensional query of different OLAP data organization models under a single data source background is solved. Aiming at the performance optimization problem of Hive multi-dimensional query on large-scale data sets, an Hbase cache-based segmented layered dimensionality-reduction aggregation algorithm is proposed, and the algorithm brings MOLAP for solving the multi-dimensional query calculation of large-scale data into a big data OLAP system, so that the extendibility and effectiveness of the multi-dimensional query of data with different scales and levels under a big data background are greatly enhanced.

...read moreread less

Proceedings Article•DOI•

Approximate Query Processing for Interactive Data Science

[...]

Tim Kraska¹•Institutions (1)

Brown University¹

09 May 2017

TL;DR: This talk presents some of the recent results from building a third-generation AQP system, called IDEA, which is the first Interactive Data Exploration Accelerator and allows data scientists to connect to a data source and immediately start exploring without any preparation time while still guaranteeing interactive latencies largely independently of the type of operation or data size.

...read moreread less

Abstract: shift in the algorithms and tools used to analyze data towards more interactive systems with highly collaborative and visual interfaces. Ideally, a data scientist and a domain expert should be able to make discoveries together by directly manipulating, analyzing, and visualizing data on the spot, for example, using an interactive whiteboard like the recently released Microsoft Surface Hub. While such an interactive pattern would democratize data science and make it more accessible to a wider range of users, it also requires a rethinking of the full analytical stack. Most importantly, it necessitates the next generation of approximate query processing (AQP) techniques to guarantee (visual) results at interactive speeds during the data exploration process. The first generation of AQP focused on online aggregation for simple OLAP queries; a small subset of the functionality needed for data science workflows. The second generation widened the scope to more complex workflows mainly by taking advantage of pre-computed samples at the cost of assuming that most or all queries are known upfront; again a bad fit as it is rarely the case that all exploration patterns are known in advance. The next, the third, generation of AQP has to give up on this assumption, that most queries are known upfront, but instead can leverage that data exploration pipelines are incrementally created by the user through a visual interface. In this talk, I will present some of our recent results from building a third-generation AQP system, called IDEA. IDEA is the first Interactive Data Exploration Accelerator and allows data scientists to connect to a data source and immediately start exploring without any preparation time while still guaranteeing interactive latencies largely independently of the type of operation or data size. IDEA achieves this through novel AQP- and result reuse-techniques, which better leverage the incremental nature of the exploration process. Most importantly, IDEA automatically creates stratified samples based on the user interaction and is able to reuse approximate intermediate results between interactions. The core idea behind our approximation and reuse technique is a reformulation of the AQP model itself based on the observation that most visualizations convey simple statistics over the data. For example, the commonly used count histograms can be seen as visualizations of the frequency statistic over the value range of an attribute. To leverage this, we propose a new AQP model that treats the aggregate query results as random variables. Surprisingly, this new model makes it not only easier to reuse results and to reason formally about the error bounds but also enables a completely new set of query rewrite rules based on probability theory. Finally, it turns out that online aggregation, which is typically used to approximate results without a pre-computed index, stratified samples, or sketches, struggles to provide high-quality results for rare sub-populations. At the same time, as one of our user studies revealed, it is quite common for users to explore rare sub-populations, as they often contain the most interesting insights (e.g., the habits of the few highly valued customers, the suspicious outliers, etc.). We, therefore, propose a new data structure, called tail index, which is a low-overhead partial index that is created on the fly based on the user interactions. Together with our new AQP model, tail indexes enable us to provide low approximation errors, even on increasingly small sub-population, at interactive speeds without any pre-computation or an upfront known workload.

...read moreread less

Book Chapter•DOI•

On-line analytical processing

[...]

Alberto Abelló Gamazo, Óscar Romero Moral

21 Jan 2017

TL;DR: On-line analytical processing (OLAP) describes an approach to decision support, which aims to extract knowledge from a data warehouse, or more specifically, from data marts, so that they are able to interactively generate ad hoc queries without the intervention of IT professionals.

...read moreread less

Abstract: On-line analytical processing (OLAP) describes an approach to decision support, which aims to extract knowledge from a data warehouse, or more specifically, from data marts. Its main idea is providing navigation through data to non-expert users, so that they are able to interactively generate ad hoc queries without the intervention of IT professionals. This name was introduced in contrast to on-line transactional processing (OLTP), so that it reflected the different requirements and characteristics between these classes of uses. The concept falls in the area of business intelligence.

...read moreread less

Journal Article•DOI•

A Proposed Business Intelligent Framework for Recommender Systems

[...]

Sitalakshmi Venkatraman

15 Nov 2017

TL;DR: A new paradigm of applying business intelligence (BI) concepts to RS for intelligently responding to user changes and business complexities is explored and a BI based framework adopting a hybrid methodology for RS is proposed with a focus on enhancing the RS performance.

...read moreread less

Abstract: In this Internet age, recommender systems (RS) have become popular, offering new opportunities and challenges to the business world. With a continuous increase in global competition, e-businesses, information portals, social networks and more, websites are required to become more user-centric and rely on the presence and role of RS in assisting users in better decision making. However, with continuous changes in user interests and consumer behavior patterns that are influenced by easy access to vast information and social factors, raising the quality of recommendations has become a challenge for recommender systems. There is a pressing need for exploring hybrid models of the five main types of RS, namely collaborative, demographic, utility, content and knowledge based approaches along with advancements in Big Data (BD) to become more context-aware of the technology and social changes and to behave intelligently. There is a gap in literature with a research focus in this direction. This paper takes a step to address this by exploring a new paradigm of applying business intelligence (BI) concepts to RS for intelligently responding to user changes and business complexities. A BI based framework adopting a hybrid methodology for RS is proposed with a focus on enhancing the RS performance. Such a business intelligent recommender system (BIRS) can adopt On-line Analytical Processing (OLAP) tools and performance monitoring metrics using data mining techniques of BI to enhance its own learning, user profiling and predictive models for making a more useful set of personalised recommendations to its users. The application of the proposed framework to a B2C e-commerce case example is presented.

...read moreread less

Posted Content•

Fast OLAP Query Execution in Main Memory on Large Data in a Cluster

[...]

Demian Hespe, Martin Weidner, Jonathan Dees, Peter Sanders

15 Sep 2017-arXiv: Databases

TL;DR: In this paper, the authors explore techniques for efficient execution of analytical SQL queries on large amounts of data in a parallel database cluster while making maximal use of the available hardware, including precompiled query plans for efficient CPU utilization, full parallelization on single nodes and across the cluster, and efficient inter-node communication.

...read moreread less

Abstract: Main memory column-stores have proven to be efficient for processing analytical queries. Still, there has been much less work in the context of clusters. Using only a single machine poses several restrictions: Processing power and data volume are bounded to the number of cores and main memory fitting on one tightly coupled system. To enable the processing of larger data sets, switching to a cluster becomes necessary. In this work, we explore techniques for efficient execution of analytical SQL queries on large amounts of data in a parallel database cluster while making maximal use of the available hardware. This includes precompiled query plans for efficient CPU utilization, full parallelization on single nodes and across the cluster, and efficient inter-node communication. We implement all features in a prototype for running a subset of TPC-H benchmark queries. We evaluate our implementation using a 128 node cluster running TPC-H queries with 30 000 gigabyte of uncompressed data.

...read moreread less

Patent•

OLAP pre-calculation model, automatic modeling method, and automatic modeling system

[...]

Li Dong, Han Qing, Li Yang

01 Aug 2017

TL;DR: In this paper, the authors present an OLAP pre-calculation model, an automatic modeling method and an automatic modelling system, which includes a dimension module, an aggregation group module and a measure module.

...read moreread less

Abstract: The present application relates to an OLAP pre-calculation model, an automatic modeling method and an automatic modeling system The model includes a dimension module, an aggregation group module and a measure module The method includes collecting data statistics on all data sources to obtain data statistics result, conducting query dryrun based on data model and sample queries given by a user to determine a business model, conducting query dryrun on the sample queries and collecting query statistics, carrying out physical modeling and defining dimension, measure and aggregation group of a pre-calculation model, and obtaining a business modeling result and a pre-calculation model The system includes data statistics module, business model module, query statistics module, and model establishing modules A more efficient combination of pre-calculated dimensions can be produced and redundant calculation and data storage can be reduced

...read moreread less

Proceedings Article•DOI•

HTAPBench: Hybrid Transactional and Analytical Processing Benchmark

[...]

Fábio Coelho¹, João Paulo¹, Ricardo Vilaça, José Pereira¹, Rui Oliveira¹ - Show less +1 more•Institutions (1)

University of Minho¹

17 Apr 2017

TL;DR: A load balancer within HTAPBench regulates the coexistence of OLTP and OLAP workloads, proposing a method for the generation of both new data and requests, so that OLAP requests over freshly modified data are comparable across runs.

...read moreread less

Abstract: The increasing demand for real-time analytics requires the fusion of Transactional (OLTP) and Analytical (OLAP) systems, eschewing ETL processes and introducing a plethora of proposals for the so-called Hybrid Analytical and Transactional Processing (HTAP) systems.Unfortunately, current benchmarking approaches are not able to comprehensively produce a unified metric from the assessment of an HTAP system. The evaluation of both engine types is done separately, leading to the use of disjoint sets of benchmarks such as TPC-C or TPC-H.In this paper we propose a new benchmark, HTAPBench, providing a unified metric for HTAP systems geared toward the execution of constantly increasing OLAP requests limited by an admissible impact on OLTP performance. To achieve this, a load balancer within HTAPBench regulates the coexistence of OLTP and OLAP workloads, proposing a method for the generation of both new data and requests, so that OLAP requests over freshly modified data are comparable across runs.We demonstrate the merit of our approach by validating it with different types of systems: OLTP, OLAP and HTAP; showing that the benchmark is able to highlight the differences between them, while producing queries with comparable complexity across experiments with negligible variability.

...read moreread less

Journal Article•DOI•

A data warehouse to explore multidimensional simulated data from a spatially distributed agro-hydrological model to improve catchment nitrogen management

[...]

Tassadit Bouadi¹, Marie-Odile Cordier¹, Pierre Moreau², Ren Quiniou³, Jordy Salmon-Monviola², Chantal Gascuel-Odoux² - Show less +2 more•Institutions (3)

University of Rennes¹, Agrocampus Ouest², French Institute for Research in Computer Science and Automation³

01 Nov 2017-Environmental Modelling and Software

TL;DR: A data warehouse built to store and analyze simulation data from the spatially distributed agro-hydrological model TNT2 is described and how to use OLAP to explore and extract all kinds of useful high-level information by aggregating the data along these three dimensions is shown.

...read moreread less

Abstract: Spatially distributed agro-hydrological models allow researchers and stakeholders to represent, understand and formulate hypotheses about the functioning of agro-environmental systems and to predict their evolution. These models have guided agricultural management by simulating effects of landscape structure, farming system changes and their spatial arrangement on stream water quality. Such models generate many intermediate results that should be managed, analyzed and transformed into usable information. We describe a data warehouse (N-Catch) built to store and analyze simulation data from the spatially distributed agro-hydrological model TNT2. We present scientific challenges to and tools for building data warehouses and describe the three dimensions of N-Catch: space, time and an original hierarchical description of cropping systems. We show how to use OLAP to explore and extract all kinds of useful high-level information by aggregating the data along these three dimensions and how to facilitate exploration of the spatial dimension by coupling N-Catch with GIS. Such tool constitutes an efficient interface between science and society, simulation remaining a research activity, exploration of the results becoming an easy task accessible for a large audience. A data warehouse (DW) as a tool to explore simulated agro-environmental data.N-Catch as an example of a DW for analyzing N emissions across a catchment.DWs for catchment N management.

...read moreread less

Proceedings Article•DOI•

Performing OLAP over Graph Data: Query Language, Implementation, and a Case Study

[...]

Leticia I. Gómez¹, Bart Kuijpers², Alejandro A. Vaisman¹•Institutions (2)

Instituto Tecnológico de Buenos Aires¹, University of Hasselt²

28 Aug 2017

TL;DR: This paper shows how the proposed multidimensional (MD) data model for graph analysis was implemented over the widely used Neo4J graph database, discusses implementation issues, and presents a detailed case study to show how OLAP operations can be used on graphs.

...read moreread less

Abstract: In current Big Data scenarios, traditional data warehousing and Online Analytical Processing (OLAP) operations on cubes are clearly not sufficient to address the current data analysis requirements. Nevertheless, OLAP operations and models can expand the possibilities of graph analysis beyond the traditional graph-based computation. In spite of this, there is not much work on the problem of taking OLAP analysis to the graph data model. In previous work we proposed a multidimensional (MD) data model for graph analysis, that considers not only the basic graph data, but background information in the form of dimension hierarchies as well. The graphs in our model are node- and edge-labelled directed multi-hypergraphs, called graphoids, defined at several different levels of granularity. In this paper we show how we implemented this proposal over the widely used Neo4J graph database, discuss implementation issues, and present a detailed case study to show how OLAP operations can be used on graphs.

...read moreread less

Journal Article•DOI•

Textual aggregation approaches in OLAP context: A survey

[...]

Mustapha Bouakkaz¹, Youcef Ouinten¹, Sabine Loudcher², Yulia A. Strekalova³•Institutions (3)

University of Laghouat¹, University of Lyon², University of Florida³

01 Dec 2017-International Journal of Information Management

TL;DR: A new classification framework is provided in which the existing textual aggregation approaches are grouped into two main classes, namely approach based on cube structure and approaches based on text mining, which are discussed and synthesized.

...read moreread less

Collapse