What are the contributions mentioned in the paper "Efficient processing of data warehousing queries in a split execution environment" ?

Hadapt is a start-up company currently commercializing the Yale University research project called HadoopDB. This work considers processing data warehousing queries over very large datasets. The authors analyze the complexity of this problem in the split execution environment of HadoopDB. In this paper, the authors discuss in detail performance-oriented query execution strategies for data warehouse queries in split execution environments, with particular focus on join and aggregation operations. The authors show that HadoopDB successfully competes with other systems.

What is the main benefit of using a database system for these operations?

Employing a database system for these operations generally results in higher performance because a DBMS provides more efficient operator implementation, better I/O handling, and clustering/indexing.

What is the way to eliminate the first MapReduce job?

in some cases, HadoopDB’s SideDB extension can be used to entirely eliminate the first MapReduce job for a split semijoin.

What version of Linux is used to run the cluster?

Each node in the cluster has a single 2.40 GHz Intel Core 2 Duo processor running 64-bit Red Hat Enterprise Linux 5 (kernel version 2.6.

What is the reason why PostgreSQL needs to swap to disk?

Given the large number of records, PostgreSQL is not able to keep all the intermediate data in memory and therefore needs to swap to disk.

Why did the authors not enable the replication features in DBMS-X?

because their entire benchmark is read-only, the authors did not enable the replication features in DBMS-X, since rather than improving performance this would have complicated the installation process.

What is the main benefit of using HadoopDB for multi-stage jobs?

in order to handle even more complicated queries that include multi-stage jobs, the authors enabled HadoopDB to consume records from a combined input consisting of data from both database tables and HDFS files.

What happens when the join cannot be pushed into the database system?

This happens when the join cannot bepushed into the database system and therefore must be performed by Hadoop which is much slower than DBMS.

How many MB of data is used in Q17?

In Q17, however, very selective predicates are applied to the part table (69GB of raw data), resulting in only about 6MB of data (around 600 thousand integer identifiers).

What is the effect of the addition of a join on the total query performance?

This extra join causes the total query performance to become slower by a factor of four versus the layout with referential partitioning.

(Open Access) Efficient processing of data warehousing queries in a split execution environment (2011) | Kamil Bajda-Pawlikowski

Q: What is the main benefit of implementing join operations in HadoopDB?

when tables are co-partitioned (e.g., hash partitioned on the join attribute), join operations can also be processed inside the database system.

Q: Why is Hive unable to use hash partitioning?

Because it is typically deployed on top of a distributed file system, Hive is unable to use hash-partitioning on a join key for the colocation of related tables — a typical strategy that parallel databases exploit to minimize data movement across nodes.

Q: Why is Hive unable to implement cost-based algorithms?

because the system catalog lacks statistics on data distribution, cost-based algorithms cannot be implemented in Hive’s optimizer.

Efﬁcient Processing of Data Warehousing Queries

in a Split Execution Environment

Kamil Bajda-Pawlikowski

1+2

, Daniel J. Abadi

1+2

, Avi Silberschatz

, Erik Paulson

Hadapt Inc.,

Yale University,

University of Wisconsin-Madison

{kbajda,dna}@hadapt.com; avi@cs.yale.edu; epaulson@cs.wisc.edu

ABSTRACT

Hadapt is a start-up company currently commercializing

the Yale University research project called HadoopDB. The

company focuses on building a platform for Big Data analyt-

ics in the cloud by introducing a storage layer optimized for

structured data and by providing a framework for executing

SQL queries eﬃciently.

This work considers processing data warehousing queries

over very large datasets. Our goal is to maximize perfor-

mance while, at the same time, not giving up fault tolerance

and scalability. We analyze the complexity of this problem

in the split execution environment of HadoopDB. Here, in-

coming queries are examined; parts of the query are pushed

down and executed inside the higher performing database

layer; and the rest of the query is processed in a more generic

MapReduce framework.

In this paper, we discuss in detail performance-oriented

query execution strategies for data warehouse queries in split

execution environments, with particular focus on join and

aggregation operations. The eﬃciency of our techniques

is demonstrated by running experiments using the TPC-

H benchmark with 3TB of data. In these experiments we

compare our results with a standard commercial parallel

database and an open-source MapReduce implementation

featuring a SQL interface (Hive). We show that HadoopDB

successfully competes with other systems.

Categories and Subject Descriptors

H.2.4 [Database Management]: Systems - Query process-

ing

General Terms

Performance, Algorithms, Experimentation

Keywords

Query Execution, MapReduce, Hadoop

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

SIGMOD’11, June 12–16, 2011, Athens, Greece.

1. INTRODUCTION

MapReduce [19] is emerging as a leading framework for

performing scalable parallel analytics and data mining.

Some of the reasons for the popularity of MapReduce

include the availability of a free and open source implemen-

tation (Hadoop) [2], impressive ease-of-use experience [30],

as well as Google’s, Yahoo!’s, and Facebook’s wide usage

[19, 25] and evangelization of this technology. Moreover,

MapReduce has been shown to deliver stellar performance

on extreme-scale benchmarks [17, 3]. All these factors have

resulted in the rapid adoption of MapReduce for many

diﬀerent kinds of data analysis and processing [15, 18, 32,

29, 25, 11].

Historically, the main applications of the MapReduce

framework included Web indexing, text analytics, and

graph data mining.

Now, however, as MapReduce is steadily developing into

the de facto data analysis standard, it repeatedly becomes

employed for querying structured data — an area tradition-

ally dominated by relational databases in data warehouse

deployments. Even though many argue that MapReduce

is not optimal for analyzing structured data [21, 30], it is

nonetheless used increasingly frequently for that purpose

because of a growing tendency to unify the data manage-

ment platform. Thus, the standard structured data analysis

can proceed side-by-side with the complex analytics that

MapReduce is well-suited for. Moreover, data warehous-

ing in this new platform enjoys the superior scalability of

MapReduce [9] at a lower price. For example, Facebook

famously ran a proof of concept comparing several paral-

lel relational database vendors before deciding to run their

2.5 petabyte clickstream data warehouse using Hadoop [27]

instead.

Consequently, in recent years a signiﬁcant amount of re-

search and commercial activity has focused on integrating

MapReduce and relational database technology [31, 9, 24,

16, 34, 33, 22, 14]. There are two approaches to this prob-

lem: (1) Starting with a parallel database system and adding

some MapReduce features [24, 16, 33], and (2) Starting with

MapReduce and adding database system technology [31, 34,

9, 22, 14]. While both options are valid routes towards the

integration, we expect that the second approach will ulti-

mately prevail. This is because while there exists no widely

available open source parallel database system, MapReduce

is oﬀered as an open source project. Furthermore, it is ac-

companied by a plethora of free tools, as well as cluster

availability and support.

HadoopDB [9] follows the second of the approaches men-

tioned above. The technology developed at Yale University

is commercialized by Hadapt [1]. The research project re-

vealed that many of Hadoop’s problems with performance

on structured data can be attributed to a suboptimal stor-

age layer. The default Hadoop storage layer, HDFS, is the

distributed ﬁle system. When HDFS was replaced with mul-

tiple instances of a relational database system (one instance

per node in a shared-nothing cluster), HadoopDB outper-

formed Hadoop’s default conﬁguration by up to an order of

magnitude. The reason for the performance improvement

can be attributed to leveraging decades’ worth of research

in the database systems community. Some optimizations

developed during this period include the careful layout of

data on disk, indexing, sorting, shared I/O, buﬀer manage-

ment, compression, and query optimization. By combin-

ing the job scheduler, task coordination, and parallelization

layer of Hadoop, with the storage layer of the DBMS, we

were able to retain the best features of both systems. While

achieving performance on structured data analysis compara-

ble with commercial parallel database systems, we maitained

Hadoop’s fault tolerance, scalability, ability to handle het-

erogeneous node performance, and query interface ﬂexibility.

In this paper, we describe several query execution and

storage layer strategies that we developed to improve per-

formance by yet another order of magnitude in comparison

to the original research project. As a result, HadoopDB

performs up to two orders of magnitude better than stan-

dard Hadoop. Furthermore, these modiﬁcations enabled

HadoopDB to eﬃciently process signiﬁcantly more compli-

cated SQL queries. These include queries from the TPC-

H benchmark — the most commonly used benchmark for

comparing modern parallel database systems. The tech-

niques we employ range from integrating with a column-

store database system (in particular, one based on the Mon-

etDB/X100 project), introducing referential partitioning to

maximize the number of single-node joins, integrating semi-

joins into the Hadoop Map phase, preparing aggregated data

before performing joins, and combining joins and aggrega-

tion in a single Reduce phase.

Some of the strategies we discuss have been previously

used or are currently available in commercial parallel

database systems. What is interesting about these strate-

gies in the context of HadoopDB, however, is the relative

importance of the diﬀerent techniques in a split query

execution environment where both relational database sys-

tems and MapReduce are responsible for query processing.

Futhermore, many commercial parallel DBMS vendors

do not publish their query execution techniques in the

research community. Therefore, while not necessarily new

to implementation, some of the techniques presented in this

paper are nevertheless new to publication.

In general, there are two heuristics that guide our opti-

mizations:

1. Database systems can process data at a faster rate

than Hadoop.

2. Each MapReduce job typically involves many I/O op-

erations and network transfers. Thus, it is important

to minimize the number of MapReduce jobs in a series

into which a SQL query is translated.

Consequently, HadoopDB attempts to push as much pro-

cessing as possible into single-node database systems and

to perform as many relational query operators as possible

in each “Map” and “Reduce” task. Our focus in this pa-

per is on the processing of SQL queries by splitting their

execution across Hadoop and DBMS. HadoopDB, however,

also retains its ability to accept queries written directly in

MapReduce.

In order to measure the relative eﬀectiveness of our dif-

ferent query execution techniques, we selectively turn them

on and oﬀ and measure the eﬀect on the performance of

HadoopDB for the TPC-H benchmark. Our primary com-

parison points are the ﬁrst version of HadoopDB (without

these techniques), and Hive, the currently dominant SQL

interface to Hadoop. For continuity of comparison, we also

benchmark against the same commercial parallel database

system used in the original HadoopDB paper. HadoopDB

shows consistently impressive performance that positions it

as a legitimate player in the rapidly emerging market of “Big

Data” analytics.

In addition to bringing high performance SQL to Hadoop,

Hadapt adjusts on the ﬂy to changing conditions in cloud

environments. Hadapt is the only analytical database plat-

form designed from scratch for cloud deployments. This pa-

per does not discuss the cloud-based innovations of Hadapt.

Rather, the sole focus is on the recent performance-oriented

innovations developed in the Yale HadoopDB project.

2. BACKGROUND AND RELATED WORK

2.1 Hive and Hadoop

Hive [4] is an open-source data warehousing infrastructure

built on top of Hadoop [2]. Hive accepts queries expressed

in a SQL-like language called HiveQL and executes them

against data stored in the Hadoop Distributed File System

(HDFS).

A big limitation of the current implementation of Hive

is its data storage layer. Because it is typically deployed

on top of a distributed ﬁle system, Hive is unable to use

hash-partitioning on a join key for the colocation of related

tables — a typical strategy that parallel databases exploit

to minimize data movement across nodes. Moreover, Hive

workloads are very I/O heavy due to lack of native index-

ing. Furthermore, because the system catalog lacks statis-

tics on data distribution, cost-based algorithms cannot be

implemented in Hive’s optimizer. We expect that Hive’s

developers will resolve these shortcomings in the future

The original HadoopDB research project replaced HDFS

with many single-node database systems. Besides yielding

short-term performance beneﬁts, this design made it easier

to implement some standard parallel database techniques.

Having achieved this, we can now focus on the more ad-

vanced split query execution techniques presented in this

paper. We describe the original HadoopDB research in more

detail in the following subsection.

2.2 HadoopDB

In this section we overview the architecture and rel-

evant query execution strategies implemented in the

HadoopDB [9, 10] project.

In fact, the most recent version (0.7.0) introduced some of

the missing features. Unfortunaly, it was released after we

completed our experiments.

2.2.1 HadoopDB Architecture

The central idea behind HadoopDB is to create a single

system by connecting multiple independent single-node

databases deployed across a cluster (see our previous

work [9] for more details). Figure 1 presents the architec-

ture of the system. Queries are parallelized using Hadoop,

which serves as a coordination layer. To achieve high

eﬃciency, performance sensitive parts of query processing

are pushed into underlying database systems. HadoopDB

thus resembles a shared-nothing parallel database where

Hadoop provides runtime scheduling and job management

that ensures scalability up to thousands of nodes.

Figure 1: The HadoopDB Architecture

The main components of HadoopDB include:

1. Database Connector that allows Hadoop jobs to access

multiple database systems by executing SQL queries

via a JDBC interface.

2. Data Loader that hash-partitions and splits data into

smaller chunks and coordinates their parallel load into

the database systems.

3. Catalog which contains both metadata about the lo-

cation of database chunks stored in the cluster and

statistics about the data.

4. Query Interface which allows queries to be submitted

via a MapReduce API or SQL.

In the original HadoopDB paper [9], the prototype was

built using PostgreSQL as the underlying DBMS layer.

By design, HadoopDB may leverage any JDBC-compliant

database system. Our solution is able to transform a

single-node DBMS into a highly scalable parallel data

analytics platform that can handle very large datasets and

provide automatic fault tolerance and load balancing. In

this paper, we demonstrate our ﬂexibility by integrating

with a new columnar database engine described in the

following section.

2.2.2 VectorWise/X100 Database

We used an early version of the VectorWise (VW) en-

gine [7], a single-node DBMS based on the MonetDB/X100

research project [13, 35]. VW provides high performance in

analytical queries due to vectorized operations on in-cache

data and eﬃcient I/O.

The unique feature of the VW/X100 database engine is its

ability to take advantage of modern CPU capabilities such

as SIMD instructions. This allows a data processing opera-

tion such as a predicate evaluation to be applied to several

values from a column simultaneously on a single processor.

Furthermore, in contrast to the tuple-at-a-time iterators tra-

ditionally employed by database systems, X100 processes

multiple values (typically vectors of length 1024) at once.

Moreover, VW makes an eﬀort to keep the processed vec-

tors in cache to reduce unnecessary RAM access.

In the storage layer, VectorWise is a ﬂexible column-store

that allows for ﬁner-grained I/O, enabling the system to

spend time reading only those attributes which are rele-

vant to a particular query. To further reduce I/O, auto-

matic lightweight compression is applied. Finally, cluster-

ing indices and the exploitation of data correlations through

sparse MinMax indices allow even more savings in disk ac-

cess.

2.2.3 HadoopDB Query Execution

The basic strategy of implementing queries in HadoopDB

involves pushing those parts of query processing that can

be performed independently into single-node database sys-

tems by issuing SQL statements. This approach is eﬀective

for selection, projection, and partial aggregation — process-

ing that Hadoop typically performs during the Map and

Combine phases. Employing a database system for these

operations generally results in higher performance because

a DBMS provides more eﬃcient operator implementation,

better I/O handling, and clustering/indexing.

Moreover, when tables are co-partitioned (e.g., hash par-

titioned on the join attribute), join operations can also be

processed inside the database system. The beneﬁt here is

twofold. First, joins become local operarations which elim-

inates the necessity of sending data over the network. Sec-

ond, joins are performed inside the DBMS which typically

implements these operations very eﬃciently.

The initial release of HadoopDB included the implemen-

tation of Hadoop’s InputFormat interface, which allowed, in

a given job, accessing either a single table or a group of co-

partitioned tables. In other words, HadoopDB’s Database

Connector supported only streams of tuples with an identical

schema. In this paper, however, we discuss more advanced

execution plans where some joins require data redistribu-

tion before computing and therefore cannot be performed

entirely within single-node database systems. To accomo-

date such plans, we extended the Database Connector to

give Hadoop access to multiple database tables within the

Map phase of a single job. After repartitioning on the join

key, related records are sent to the Reduce phase in which

the actual join is computed.

Furthermore, in order to handle even more complicated

queries that include multi-stage jobs, we enabled HadoopDB

to consume records from a combined input consisting of data

from both database tables and HDFS ﬁles. In addition, we

enhanced HadoopDB so that, at any point during process-

ing, jobs can issue additional SQL queries via an extension

we call SideDB (a “database task done on the side”).

Apart from the SideDB extention, all query execution in

HadoopDB beyond the Map phase is carried out inside the

Hadoop framework. To achieve high performance along the

entire execution path, further optimizations are necessary.

These are described in detail in the next section.

3. SPLIT QUERY EXECUTION

In this section we discuss four techniques that optimize

the execution of data warehouse queries across Hadoop and

single-node database systems installed on every node in a

shared-nothing network. We further discuss implementation

details within HadoopDB.

3.1 Referential Partitioning

Distributed joins are expensive, especially in Hadoop, be-

cause they require one extra MR job [30, 34, 9] to repartition

data on a join key. In general, database system developers

spend a lot of time optimizing the performance of joins which

are very common and costly operations. Typically, joins

computed within a database system will involve far fewer

reads and writes to disk than joins computed across multi-

ple MapReduce jobs inside Hadoop. Hence, for performance

reasons, HadoopDB strongly prefers to compute joins com-

pletely inside the database engine deployed on each node.

To be performed completely inside the database layer in

HadoopDB, a join must be local i.e. each node must join

data from tables stored locally without shipping any data

over the network. When data needs to be sent across a

cluster, Hadoop takes over query processing, which means

that the join is not done inside the database engines. If

two tables are hash partitioned on the join attribute (e.g.,

both employee and department tables on department id),

then a local join is possible since each single-node database

system can compute a join on its partition of data without

considering partitions stored on other nodes.

As a rule, traditional parallel database systems prefer lo-

cal joins over repartitioned joins since the former are less

expensive. This discrepancy in cost between local and repar-

titioned joins is even greater in HadoopDB due to the per-

formance diﬀerence in join implementation between DBMS

and Hadoop. For this reason, HadoopDB is willing to sac-

riﬁce certain performance beneﬁts, such as quick load time,

in exchange for local joins.

In order to push as many joins as possible into single node

database systems inside HadoopDB, we perform “aggres-

sive” hash-partitioning. Typically, database tables are hash-

partitioned on an attribute selected from a given table. This

method, however, limits the degree of co-partitioning, since

tables can be related to each other via many steps of foreign-

key/primary-key references. For example, in TPC-H, the

lineitem table contains a foreign-key to the orders table via

the order key attribute, while the orders table contains a

foreign-key to the customer table via the customer key at-

tribute. If the lineitem table could be partitioned by the

customer who made the order, then any of the straightfor-

ward join combinations of the customer, orders, and lineitem

tables would be local to each node.

Yet, since the lineitem table does not contain the customer

key attribute, direct partitioning is impossible. HadoopDB

was, therefore, extended to support referential partitioning.

Although a similarly named technique was recently made

available in Oracle 11g [23], it served a diﬀerent purpose

than in our project where this partitioning scheme facilitates

joins across a shared-nothing network.

Obviously, this method can be extended to an arbitrary

number of tables referenced in a cascading way. During data

load, referential partitioning involves the additional step of

joining with a parent table to retrieve its foreign key. This,

however, is a one time cost that gets amortized quickly by

superior performance on join queries. This technique bene-

ﬁts TPC-H queries 3, 5, 7, 8, 10, and 18, all of which need

joins between the customer, orders, and lineitem tables.

3.2 Split MR/DB Joins

For tables that are not co-partitioned the join is generally

performed using the MapReduce framework. This usually

takes place in the Reduce phase of a job. The Map phase

reads each table partition and, for each tuple, outputs the

join attribute intended to automatically repartition the ta-

bles between the Map and Reduce phases.

Therefore, the same Reduce task is responsible for pro-

cessing all tuples with the same join key. Natural joins and

equi-joins require no further network communication — the

Reduce tasks simply perform the join on their partition of

data.

The above algorithm works similarly to a partitioned par-

allel join described in parallel database literature [28, 20].

In general this method requires repartitioning both tables

across nodes. In several speciﬁc cases, however, the latter

operation is unnecessary — a situation that parallel DBMS

implementations take advantage of whenever possible. Two

common join optimizations are the directed join and the

broadcast join. The former is applicable when one of the ta-

bles is already partitioned by the join key. In this case only

the other table has to be distributed using the same parti-

tioning function. The join can proceed locally on each node.

The broadcast join is used when one table is much larger

than the other. The large table should be left in its original

location while the entire small table ought to be shipped to

every node in the cluster. Each partition of the larger table

can then be joined locally with the smaller table.

Unfortunately, implementing directed and broadcast joins

in Hadoop requires computing the join in the Map phase.

This is not a trivial task

since reading multiple data sets

with an algorithm that might require multiple passes does

not ﬁt well into the Map sequential scan model. Further-

more, HDFS does not promise to keep diﬀerent datasets

co-partitioned between jobs. Therefore, a Map task can-

not assume that two diﬀerent datasets partitioned using the

same hash function are actually stored on the same node.

For this reason, previous work on adding specialized joins

to the MapReduce framework typically focused on the rela-

tively simple broadcast join. This algorithm is implemented

in Hive, Pig, and a recent research paper [12]

. Since none

of the abovementioned systems implement cost-based query

optimizers, a hint must be included in the query to let the

system know that a broadcast join algorithm should be used.

Unless both tables are already sorted by the join key, in

which case one can use Hadoop’s merge join operator.

This work goes quite a bit farther than Hive and Pig, imple-

menting several optimizations on top of the basic broadcast

join, though each optimization maintains the single-pass se-

quential scan requirement of the larger table during the Map

phase.

The implementation of the broadcast join in these systems

is as follows. Each Map worker reads the smaller table from

HDFS and stores it in an in-memory hash table. This has

the eﬀect of replicating the small table to each local node. A

sequential scan of the larger table follows. As in a standard

simple hash-join, the in-memory hash map is probed with

each tuple of this larger table to check for a matching key

value. The reading of both tables helps avoid the diﬃculties

of implementing a multi-pass algorithm. Since the join is

computed in the Map phase, it is called a Map-side join.

Split execution environments enable the implementation

of a variety of joins in the Map phase and reveal some in-

teresting new tradeoﬀs. First, take the case of the broad-

cast join. There are two ways that the latter can be imple-

mented in a split execution framework. The ﬁrst way is to

use the standard Map-side join discussed above. The sec-

ond way, possible only in HadoopDB, involves writing the

smaller table to a temporary table in the database system

on each node. Then the join is computed completely inside

the DBMS and the resulting tuples are read by the Map

tasks for further processing.

The signiﬁcance of the tradeoﬀ between these two ap-

proaches depends on the DBMS software used. A partic-

ularly important factor is the cost of writing to a temporary

table and sharing this table across multiple partitions on

the same node. In general, as long as this cost is not too

high, computing the join inside the DBMS will yield better

performance than computing it in the Java code of the Map

task. This is explored further in Section 4.

Another type of join enabled by split execution environ-

ments is the directed join. Here, HadoopDB runs a stan-

dard MapReduce job to repartition the second table. First

we look up in the HadoopDB catalog how the ﬁrst table

was distributed and use this function to repartition the sec-

ond table. Any selection operations on the second table are

performed in the Map phase of this job. The OutputFor-

mat feature of Hadoop is then used to circumvent HDFS

and write the output of this repartitioning directly into the

database systems located on each node. HadoopDB provides

native support for keeping data co-partitioned between jobs.

Therefore, once both tables are partitioned on the same at-

tribute inside the HadoopDB storage layer, the next MapRe-

duce job can compute the join by pushing it entirely into the

database systems. The resulting tuples get fed to the Map

phase as a single stream.

In the experimental results presented later in this paper,

we will further explore the performance of split MR/DB

joins. This technique proved to be particularly beneﬁcial in

TPC-H queries 11, 16, and 17.

3.2.1 Split MR/DB Semijoin

A semijoin is one more type of join that can be split into

two MapReduce jobs, the second of which computes the join

in the Map phase. Here, not only does the ﬁrst MapReduce

job perform selection operations on the table, but it also

projects the join attribute. The resulting column is then

replicated as in a Map-side join. If the projected column

is very small (for example, the key from a dictionary ta-

ble or a table after applying a very selective predicate), the

Map-side join is replaced with a selection predicate using

the SQL clause ’foreignKey IN (listOfValues)’ and pushed

into the DBMS. This allows the join to be performed inside

the database system without ﬁrst loading the data into a

temporary table inside the DBMS.

Furthermore, in some cases, HadoopDB’s SideDB exten-

sion can be used to entirely eliminate the ﬁrst MapReduce

job for a split semijoin. At job setup a SideDB query ex-

tractes the projected join key column instead of running a

seperate MapReduce job.

The SideDB extension is also helpful for looking up and

extracting attributes from small tables such as dictionary

tables. Such a situation typically occurs at the very end

of the query plan, right before outputting the results. For

example, integer identiﬁers that were carried through the

query execution, are replaced by actual text values (e.g.,

names of the nations replacing the nation identiﬁer in TPC-

H). A similar concept in column-store databases is known

as late materialization [8, 26].

The query rewrite version of the map-side split semijoin

technique is commonly used in HadoopDB’s implementa-

tion of TPC-H to satisfy the benchmark rules forbidding the

replication of tables. All queries that include joins with re-

gion and nation tables are implemented using the selection-

predicate-rewriting and SideDB optimizations.

3.3 Post-join Aggregation

In HadoopDB, since aggregation operations can be exe-

cuted in database engines, there is usually no need for a

MapReduce Combiner.

Still there exists no standard way of performing post-

Reduce aggregation. While Reduce is meant for aggrega-

tion by design, it can only be applied if the repartitioning

between the Map and Reduce phases is performed on the

grouping attribute(s) speciﬁed in the query. If, however, the

partitioning is done on a join key (in order to join two diﬀer-

ent tables), then another partitioning is needed to compute

the aggregation, since, in general, the grouping attribute is

diﬀerent from the join key. The new partitioning therefore

requires another MapReduce job and all its associated over-

head.

In such situations, hash-based partial aggregation is done

at the end of each Reduce task. The grouping attribute ex-

tracted from each result of the Reduce task is used to probe

a hash table in order to update the appropriate running ag-

gregation. This procedure can save signiﬁcant I/O, since

the output of Reduce tasks is written redundantly to HDFS

whereas the output of Map tasks is written only locally.

Hence, by outputting partially aggregated data instead of

raw values, we reduce the amount of data to be written to

HDFS. TPC-H queries that beneﬁt from this technique in-

clude 5, 7, 8, and 9.

A similar technique is applied to TOP N selections, where

the list of the top N entries is maintained in an in-memory

tree map throughout the Reduce phase and outputted at

the end. In-memory data structures are also used for com-

bining an ORDER BY clause with another operator inside

the same Reduce task, again saving an extra MapReduce

job. Examples where this technique is beneﬁcial are TPC-H

queries 2, 3, 10, 13, and 18.

3.4 Pre-join Aggregation

Whereas in most database systems aggregations are typ-

ically performed after a join, in HadoopDB they sometimes

get transformed into partial aggregation operators and com-

puted before a join. This happens when the join cannot be

Efficient processing of data warehousing queries in a split execution environment

Figures

Citations

Large-scale machine learning at twitter

Split query processing in polybase

Shark: fast data analysis using coarse-grained distributed memory

A comprehensive view of Hadoop research—A systematic literature review

Making Sense of Big Data.

References

MapReduce: simplified data processing on large clusters

Pregel: a system for large-scale graph processing

Parallel database systems: the future of high performance database systems

Map-Reduce for Machine Learning on Multicore

A comparison of approaches to large-scale data analysis

Related Papers (5)

MapReduce: simplified data processing on large clusters

HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Hive - a petabyte scale data warehouse using Hadoop

Hive: a warehousing solution over a map-reduce framework

Pig latin: a not-so-foreign language for data processing

Frequently Asked Questions (16)

Q1. What are the contributions mentioned in the paper "Efficient processing of data warehousing queries in a split execution environment" ?

Q2. What is the main benefit of using a database system for these operations?

Q3. What is the way to eliminate the first MapReduce job?

Q4. What is the main benefit of implementing join operations in HadoopDB?

Q5. Why is Hive unable to use hash partitioning?

Q6. What version of Linux is used to run the cluster?

Q7. What are some of the reasons for the popularity of MapReduce?

Q8. What is the reason why PostgreSQL needs to swap to disk?

Q9. Why did the authors not enable the replication features in DBMS-X?

Q10. Why is Hive unable to implement cost-based algorithms?

Q11. What is the reason for the performance improvement?

Q12. Why is MapReduce used for data analysis?

Q13. What is the main benefit of using HadoopDB for multi-stage jobs?

Q14. What happens when the join cannot be pushed into the database system?

Q15. How many MB of data is used in Q17?

Q16. What is the effect of the addition of a join on the total query performance?