Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

doi:10.1186/S40537-019-0196-1

Open AccessJournal ArticleDOI

Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

Eduarda Costa, +2 more

- 01 Dec 2019 -

Journal of Big Data

- Vol. 6, Iss: 1, pp 1-38

TLDR

This paper evaluates the impact of data partitioning and bucketing in Hive-based systems, testing different data organization strategies and verifying the efficiency of those strategies in query performance, demonstrating the advantages of implementing Big Data Warehouses based on denormalized models and the potential benefit of using adequate partitioning strategies.

Abstract:

Hive has long been one of the industry-leading systems for Data Warehousing in Big Data contexts, mainly organizing data into databases, tables, partitions and buckets, stored on top of an unstructured distributed file system like HDFS. Some studies were conducted for understanding the ways of optimizing the performance of several storage systems for Big Data Warehousing. However, few of them explore the impact of data organization strategies on query performance, when using Hive as the storage technology for implementing Big Data Warehousing systems. Therefore, this paper evaluates the impact of data partitioning and bucketing in Hive-based systems, testing different data organization strategies and verifying the efficiency of those strategies in query performance. The obtained results demonstrate the advantages of implementing Big Data Warehouses based on denormalized models and the potential benefit of using adequate partitioning strategies. Defining the partitions aligned with the attributes that are frequently used in the conditions/filters of the queries can significantly increase the efficiency of the system in terms of response time. In the more intensive workload benchmarked in this paper, overall decreases of about 40% in processing time were verified. The same is not verified with the use of bucketing strategies, which shows potential benefits in very specific scenarios, suggesting a more restricted use of this functionality, namely in the context of bucketing two tables by the join attribute of these tables.

Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

Citations

Supply chain data integration: A literature review

An empirical study on data warehouse systems effectiveness: the case of Jordanian banks in the business intelligence era

On the use of simulation as a Big Data semantic validator for supply chain management

HaRD: a heterogeneity-aware replica deletion for HDFS

A Survey on Data-driven Performance Tuning for Big Data Analytics Platforms

References

Data-intensive applications, challenges, techniques and technologies: A survey on Big Data

Hive: a warehousing solution over a map-reduce framework

Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data

Hive - a petabyte scale data warehouse using Hadoop

The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling

Related Papers (5)

Efficient Big Data Modelling and Organization for Hadoop Hive-Based Data Warehouses

Major technical advancements in apache hive

Big Data for supply chain management in the service and manufacturing sectors

Big data analytics in supply chain management between 2010 and 2016: Insights to industries

Setting an industry 4.0 research and development agenda for simulation - a literature review