scispace - formally typeset
Open AccessJournal ArticleDOI

Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

TLDR
This paper evaluates the impact of data partitioning and bucketing in Hive-based systems, testing different data organization strategies and verifying the efficiency of those strategies in query performance, demonstrating the advantages of implementing Big Data Warehouses based on denormalized models and the potential benefit of using adequate partitioning strategies.
Abstract
Hive has long been one of the industry-leading systems for Data Warehousing in Big Data contexts, mainly organizing data into databases, tables, partitions and buckets, stored on top of an unstructured distributed file system like HDFS. Some studies were conducted for understanding the ways of optimizing the performance of several storage systems for Big Data Warehousing. However, few of them explore the impact of data organization strategies on query performance, when using Hive as the storage technology for implementing Big Data Warehousing systems. Therefore, this paper evaluates the impact of data partitioning and bucketing in Hive-based systems, testing different data organization strategies and verifying the efficiency of those strategies in query performance. The obtained results demonstrate the advantages of implementing Big Data Warehouses based on denormalized models and the potential benefit of using adequate partitioning strategies. Defining the partitions aligned with the attributes that are frequently used in the conditions/filters of the queries can significantly increase the efficiency of the system in terms of response time. In the more intensive workload benchmarked in this paper, overall decreases of about 40% in processing time were verified. The same is not verified with the use of bucketing strategies, which shows potential benefits in very specific scenarios, suggesting a more restricted use of this functionality, namely in the context of bucketing two tables by the join attribute of these tables.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Supply chain data integration: A literature review

TL;DR: A Systematic Literature Review (SLR) of simulation methods that deal with risks in SCs, with particular emphasis on the type of data integration employed by such works, is proposed.
Journal ArticleDOI

An empirical study on data warehouse systems effectiveness: the case of Jordanian banks in the business intelligence era

TL;DR: In this article , the authors developed a theoretical model specific to the data warehouse system domain that builds on the DeLone and McLean model and empirically tested by means of structural equation modelling applying the partial least squares approach and using data collected in a survey questionnaire from 127 respondents at Jordanian banks.
Journal ArticleDOI

On the use of simulation as a Big Data semantic validator for supply chain management

TL;DR: It is concluded that, while SC simulations using Big Data concepts and technologies are within the grasp of organizations, their data models still require considerable improvements, in order to produce perfect mimics of their SCs.
Journal ArticleDOI

HaRD: a heterogeneity-aware replica deletion for HDFS

TL;DR: A heterogeneity-aware replica deletion scheme (HaRD), which considers the nodes’ processing capabilities when deleting replicas; hence it stores more replicas on the more powerful nodes and reduces execution time by up to 60%, and 17% when compared to Hadoop and WBRD, respectively.
Journal ArticleDOI

A Survey on Data-driven Performance Tuning for Big Data Analytics Platforms

TL;DR: This work reviews performance tuning strategies in the big data environment, focusing on data-driven tuning techniques, and proposes an initial classification based on the domain state-of-the-art and presented general and system-specific tuning recommendations.
References
More filters
Journal ArticleDOI

Data-intensive applications, challenges, techniques and technologies: A survey on Big Data

TL;DR: This paper is aimed to demonstrate a close-up view about Big Data, including Big Data applications, Big Data opportunities and challenges, as well as the state-of-the-art techniques and technologies currently adopt to deal with the Big Data problems.
Journal ArticleDOI

Hive: a warehousing solution over a map-reduce framework

TL;DR: Hadoop is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware.
Book

Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data

TL;DR: This book reveals how IBM is leveraging open source Big Data technology, infused with IBM technologies, to deliver a robust, secure, highly available, enterprise-class Big Data platform.
Proceedings ArticleDOI

Hive - a petabyte scale data warehouse using Hadoop

TL;DR: Hive is presented, an open-source data warehousing solution built on top of Hadoop that supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into map-reduce jobs that are executed using Hadoops.
Book

The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling

Ralph Kimball, +1 more
TL;DR: The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition is a complete library of updated dimensional modeling techniques, the most comprehensive collection ever.
Related Papers (5)