scispace - formally typeset
Open AccessProceedings ArticleDOI

Efficient processing of data warehousing queries in a split execution environment

TLDR
This work considers processing data warehousing queries over very large datasets by analyzing the complexity of this problem in the split execution environment of HadoopDB, with particular focus on join and aggregation operations.
Abstract
Hadapt is a start-up company currently commercializing the Yale University research project called HadoopDB. The company focuses on building a platform for Big Data analytics in the cloud by introducing a storage layer optimized for structured data and by providing a framework for executing SQL queries efficiently. This work considers processing data warehousing queries over very large datasets. Our goal is to maximize perfor mance while, at the same time, not giving up fault tolerance and scalability. We analyze the complexity of this problem in the split execution environment of HadoopDB. Here, incoming queries are examined; parts of the query are pushed down and executed inside the higher performing database layer; and the rest of the query is processed in a more generic MapReduce framework.In this paper, we discuss in detail performance-oriented query execution strategies for data warehouse queries in split execution environments, with particular focus on join and aggregation operations. The efficiency of our techniques is demonstrated by running experiments using the TPC-H benchmark with 3TB of data. In these experiments we compare our results with a standard commercial parallel database and an open-source MapReduce implementation featuring a SQL interface (Hive). We show that HadoopDB successfully competes with other systems.

read more

Content maybe subject to copyright    Report

Efficient Processing of Data Warehousing Queries
in a Split Execution Environment
Kamil Bajda-Pawlikowski
1+2
, Daniel J. Abadi
1+2
, Avi Silberschatz
2
, Erik Paulson
3
1
Hadapt Inc.,
2
Yale University,
3
University of Wisconsin-Madison
{kbajda,dna}@hadapt.com; avi@cs.yale.edu; epaulson@cs.wisc.edu
ABSTRACT
Hadapt is a start-up company currently commercializing
the Yale University research project called HadoopDB. The
company focuses on building a platform for Big Data analyt-
ics in the cloud by introducing a storage layer optimized for
structured data and by providing a framework for executing
SQL queries efficiently.
This work considers processing data warehousing queries
over very large datasets. Our goal is to maximize perfor-
mance while, at the same time, not giving up fault tolerance
and scalability. We analyze the complexity of this problem
in the split execution environment of HadoopDB. Here, in-
coming queries are examined; parts of the query are pushed
down and executed inside the higher performing database
layer; and the rest of the query is processed in a more generic
MapReduce framework.
In this paper, we discuss in detail performance-oriented
query execution strategies for data warehouse queries in split
execution environments, with particular focus on join and
aggregation operations. The efficiency of our techniques
is demonstrated by running experiments using the TPC-
H benchmark with 3TB of data. In these experiments we
compare our results with a standard commercial parallel
database and an open-source MapReduce implementation
featuring a SQL interface (Hive). We show that HadoopDB
successfully competes with other systems.
Categories and Subject Descriptors
H.2.4 [Database Management]: Systems - Query process-
ing
General Terms
Performance, Algorithms, Experimentation
Keywords
Query Execution, MapReduce, Hadoop
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGMOD’11, June 12–16, 2011, Athens, Greece.
Copyright 2011 ACM 978-1-4503-0661-4/11/06 ...$10.00.
1. INTRODUCTION
MapReduce [19] is emerging as a leading framework for
performing scalable parallel analytics and data mining.
Some of the reasons for the popularity of MapReduce
include the availability of a free and open source implemen-
tation (Hadoop) [2], impressive ease-of-use experience [30],
as well as Google’s, Yahoo!’s, and Facebook’s wide usage
[19, 25] and evangelization of this technology. Moreover,
MapReduce has been shown to deliver stellar performance
on extreme-scale benchmarks [17, 3]. All these factors have
resulted in the rapid adoption of MapReduce for many
different kinds of data analysis and processing [15, 18, 32,
29, 25, 11].
Historically, the main applications of the MapReduce
framework included Web indexing, text analytics, and
graph data mining.
Now, however, as MapReduce is steadily developing into
the de facto data analysis standard, it repeatedly becomes
employed for querying structured data an area tradition-
ally dominated by relational databases in data warehouse
deployments. Even though many argue that MapReduce
is not optimal for analyzing structured data [21, 30], it is
nonetheless used increasingly frequently for that purpose
because of a growing tendency to unify the data manage-
ment platform. Thus, the standard structured data analysis
can proceed side-by-side with the complex analytics that
MapReduce is well-suited for. Moreover, data warehous-
ing in this new platform enjoys the superior scalability of
MapReduce [9] at a lower price. For example, Facebook
famously ran a proof of concept comparing several paral-
lel relational database vendors before deciding to run their
2.5 petabyte clickstream data warehouse using Hadoop [27]
instead.
Consequently, in recent years a significant amount of re-
search and commercial activity has focused on integrating
MapReduce and relational database technology [31, 9, 24,
16, 34, 33, 22, 14]. There are two approaches to this prob-
lem: (1) Starting with a parallel database system and adding
some MapReduce features [24, 16, 33], and (2) Starting with
MapReduce and adding database system technology [31, 34,
9, 22, 14]. While both options are valid routes towards the
integration, we expect that the second approach will ulti-
mately prevail. This is because while there exists no widely
available open source parallel database system, MapReduce
is offered as an open source project. Furthermore, it is ac-
companied by a plethora of free tools, as well as cluster
availability and support.
HadoopDB [9] follows the second of the approaches men-

tioned above. The technology developed at Yale University
is commercialized by Hadapt [1]. The research project re-
vealed that many of Hadoop’s problems with performance
on structured data can be attributed to a suboptimal stor-
age layer. The default Hadoop storage layer, HDFS, is the
distributed le system. When HDFS was replaced with mul-
tiple instances of a relational database system (one instance
per node in a shared-nothing cluster), HadoopDB outper-
formed Hadoop’s default configuration by up to an order of
magnitude. The reason for the performance improvement
can be attributed to leveraging decades’ worth of research
in the database systems community. Some optimizations
developed during this period include the careful layout of
data on disk, indexing, sorting, shared I/O, buffer manage-
ment, compression, and query optimization. By combin-
ing the job scheduler, task coordination, and parallelization
layer of Hadoop, with the storage layer of the DBMS, we
were able to retain the best features of both systems. While
achieving performance on structured data analysis compara-
ble with commercial parallel database systems, we maitained
Hadoop’s fault tolerance, scalability, ability to handle het-
erogeneous node performance, and query interface flexibility.
In this paper, we describe several query execution and
storage layer strategies that we developed to improve per-
formance by yet another order of magnitude in comparison
to the original research project. As a result, HadoopDB
performs up to two orders of magnitude better than stan-
dard Hadoop. Furthermore, these modifications enabled
HadoopDB to efficiently process significantly more compli-
cated SQL queries. These include queries from the TPC-
H benchmark the most commonly used benchmark for
comparing modern parallel database systems. The tech-
niques we employ range from integrating with a column-
store database system (in particular, one based on the Mon-
etDB/X100 project), introducing referential partitioning to
maximize the number of single-node joins, integrating semi-
joins into the Hadoop Map phase, preparing aggregated data
before performing joins, and combining joins and aggrega-
tion in a single Reduce phase.
Some of the strategies we discuss have been previously
used or are currently available in commercial parallel
database systems. What is interesting about these strate-
gies in the context of HadoopDB, however, is the relative
importance of the different techniques in a split query
execution environment where both relational database sys-
tems and MapReduce are responsible for query processing.
Futhermore, many commercial parallel DBMS vendors
do not publish their query execution techniques in the
research community. Therefore, while not necessarily new
to implementation, some of the techniques presented in this
paper are nevertheless new to publication.
In general, there are two heuristics that guide our opti-
mizations:
1. Database systems can process data at a faster rate
than Hadoop.
2. Each MapReduce job typically involves many I/O op-
erations and network transfers. Thus, it is important
to minimize the number of MapReduce jobs in a series
into which a SQL query is translated.
Consequently, HadoopDB attempts to push as much pro-
cessing as possible into single-node database systems and
to perform as many relational query operators as possible
in each “Map” and “Reduce” task. Our focus in this pa-
per is on the processing of SQL queries by splitting their
execution across Hadoop and DBMS. HadoopDB, however,
also retains its ability to accept queries written directly in
MapReduce.
In order to measure the relative effectiveness of our dif-
ferent query execution techniques, we selectively turn them
on and off and measure the effect on the performance of
HadoopDB for the TPC-H benchmark. Our primary com-
parison points are the first version of HadoopDB (without
these techniques), and Hive, the currently dominant SQL
interface to Hadoop. For continuity of comparison, we also
benchmark against the same commercial parallel database
system used in the original HadoopDB paper. HadoopDB
shows consistently impressive performance that positions it
as a legitimate player in the rapidly emerging market of “Big
Data” analytics.
In addition to bringing high performance SQL to Hadoop,
Hadapt adjusts on the fly to changing conditions in cloud
environments. Hadapt is the only analytical database plat-
form designed from scratch for cloud deployments. This pa-
per does not discuss the cloud-based innovations of Hadapt.
Rather, the sole focus is on the recent performance-oriented
innovations developed in the Yale HadoopDB project.
2. BACKGROUND AND RELATED WORK
2.1 Hive and Hadoop
Hive [4] is an open-source data warehousing infrastructure
built on top of Hadoop [2]. Hive accepts queries expressed
in a SQL-like language called HiveQL and executes them
against data stored in the Hadoop Distributed File System
(HDFS).
A big limitation of the current implementation of Hive
is its data storage layer. Because it is typically deployed
on top of a distributed file system, Hive is unable to use
hash-partitioning on a join key for the colocation of related
tables a typical strategy that parallel databases exploit
to minimize data movement across nodes. Moreover, Hive
workloads are very I/O heavy due to lack of native index-
ing. Furthermore, because the system catalog lacks statis-
tics on data distribution, cost-based algorithms cannot be
implemented in Hive’s optimizer. We expect that Hive’s
developers will resolve these shortcomings in the future
1
.
The original HadoopDB research project replaced HDFS
with many single-node database systems. Besides yielding
short-term performance benefits, this design made it easier
to implement some standard parallel database techniques.
Having achieved this, we can now focus on the more ad-
vanced split query execution techniques presented in this
paper. We describe the original HadoopDB research in more
detail in the following subsection.
2.2 HadoopDB
In this section we overview the architecture and rel-
evant query execution strategies implemented in the
HadoopDB [9, 10] project.
1
In fact, the most recent version (0.7.0) introduced some of
the missing features. Unfortunaly, it was released after we
completed our experiments.

2.2.1 HadoopDB Architecture
The central idea behind HadoopDB is to create a single
system by connecting multiple independent single-node
databases deployed across a cluster (see our previous
work [9] for more details). Figure 1 presents the architec-
ture of the system. Queries are parallelized using Hadoop,
which serves as a coordination layer. To achieve high
efficiency, performance sensitive parts of query processing
are pushed into underlying database systems. HadoopDB
thus resembles a shared-nothing parallel database where
Hadoop provides runtime scheduling and job management
that ensures scalability up to thousands of nodes.
Figure 1: The HadoopDB Architecture
The main components of HadoopDB include:
1. Database Connector that allows Hadoop jobs to access
multiple database systems by executing SQL queries
via a JDBC interface.
2. Data Loader that hash-partitions and splits data into
smaller chunks and coordinates their parallel load into
the database systems.
3. Catalog which contains both metadata about the lo-
cation of database chunks stored in the cluster and
statistics about the data.
4. Query Interface which allows queries to be submitted
via a MapReduce API or SQL.
In the original HadoopDB paper [9], the prototype was
built using PostgreSQL as the underlying DBMS layer.
By design, HadoopDB may leverage any JDBC-compliant
database system. Our solution is able to transform a
single-node DBMS into a highly scalable parallel data
analytics platform that can handle very large datasets and
provide automatic fault tolerance and load balancing. In
this paper, we demonstrate our flexibility by integrating
with a new columnar database engine described in the
following section.
2.2.2 VectorWise/X100 Database
We used an early version of the VectorWise (VW) en-
gine [7], a single-node DBMS based on the MonetDB/X100
research project [13, 35]. VW provides high performance in
analytical queries due to vectorized operations on in-cache
data and efficient I/O.
The unique feature of the VW/X100 database engine is its
ability to take advantage of modern CPU capabilities such
as SIMD instructions. This allows a data processing opera-
tion such as a predicate evaluation to be applied to several
values from a column simultaneously on a single processor.
Furthermore, in contrast to the tuple-at-a-time iterators tra-
ditionally employed by database systems, X100 processes
multiple values (typically vectors of length 1024) at once.
Moreover, VW makes an effort to keep the processed vec-
tors in cache to reduce unnecessary RAM access.
In the storage layer, VectorWise is a flexible column-store
that allows for finer-grained I/O, enabling the system to
spend time reading only those attributes which are rele-
vant to a particular query. To further reduce I/O, auto-
matic lightweight compression is applied. Finally, cluster-
ing indices and the exploitation of data correlations through
sparse MinMax indices allow even more savings in disk ac-
cess.
2.2.3 HadoopDB Query Execution
The basic strategy of implementing queries in HadoopDB
involves pushing those parts of query processing that can
be performed independently into single-node database sys-
tems by issuing SQL statements. This approach is effective
for selection, projection, and partial aggregation process-
ing that Hadoop typically performs during the Map and
Combine phases. Employing a database system for these
operations generally results in higher performance because
a DBMS provides more efficient operator implementation,
better I/O handling, and clustering/indexing.
Moreover, when tables are co-partitioned (e.g., hash par-
titioned on the join attribute), join operations can also be
processed inside the database system. The benefit here is
twofold. First, joins become local operarations which elim-
inates the necessity of sending data over the network. Sec-
ond, joins are performed inside the DBMS which typically
implements these operations very efficiently.
The initial release of HadoopDB included the implemen-
tation of Hadoop’s InputFormat interface, which allowed, in
a given job, accessing either a single table or a group of co-
partitioned tables. In other words, HadoopDB’s Database
Connector supported only streams of tuples with an identical
schema. In this paper, however, we discuss more advanced
execution plans where some joins require data redistribu-
tion before computing and therefore cannot be performed
entirely within single-node database systems. To accomo-
date such plans, we extended the Database Connector to
give Hadoop access to multiple database tables within the
Map phase of a single job. After repartitioning on the join
key, related records are sent to the Reduce phase in which
the actual join is computed.
Furthermore, in order to handle even more complicated
queries that include multi-stage jobs, we enabled HadoopDB
to consume records from a combined input consisting of data
from both database tables and HDFS files. In addition, we
enhanced HadoopDB so that, at any point during process-

ing, jobs can issue additional SQL queries via an extension
we call SideDB (a “database task done on the side”).
Apart from the SideDB extention, all query execution in
HadoopDB beyond the Map phase is carried out inside the
Hadoop framework. To achieve high performance along the
entire execution path, further optimizations are necessary.
These are described in detail in the next section.
3. SPLIT QUERY EXECUTION
In this section we discuss four techniques that optimize
the execution of data warehouse queries across Hadoop and
single-node database systems installed on every node in a
shared-nothing network. We further discuss implementation
details within HadoopDB.
3.1 Referential Partitioning
Distributed joins are expensive, especially in Hadoop, be-
cause they require one extra MR job [30, 34, 9] to repartition
data on a join key. In general, database system developers
spend a lot of time optimizing the performance of joins which
are very common and costly operations. Typically, joins
computed within a database system will involve far fewer
reads and writes to disk than joins computed across multi-
ple MapReduce jobs inside Hadoop. Hence, for performance
reasons, HadoopDB strongly prefers to compute joins com-
pletely inside the database engine deployed on each node.
To be performed completely inside the database layer in
HadoopDB, a join must be local i.e. each node must join
data from tables stored locally without shipping any data
over the network. When data needs to be sent across a
cluster, Hadoop takes over query processing, which means
that the join is not done inside the database engines. If
two tables are hash partitioned on the join attribute (e.g.,
both employee and department tables on department id),
then a local join is possible since each single-node database
system can compute a join on its partition of data without
considering partitions stored on other nodes.
As a rule, traditional parallel database systems prefer lo-
cal joins over repartitioned joins since the former are less
expensive. This discrepancy in cost between local and repar-
titioned joins is even greater in HadoopDB due to the per-
formance difference in join implementation between DBMS
and Hadoop. For this reason, HadoopDB is willing to sac-
rifice certain performance benefits, such as quick load time,
in exchange for local joins.
In order to push as many joins as possible into single node
database systems inside HadoopDB, we perform “aggres-
sive” hash-partitioning. Typically, database tables are hash-
partitioned on an attribute selected from a given table. This
method, however, limits the degree of co-partitioning, since
tables can be related to each other via many steps of foreign-
key/primary-key references. For example, in TPC-H, the
lineitem table contains a foreign-key to the orders table via
the order key attribute, while the orders table contains a
foreign-key to the customer table via the customer key at-
tribute. If the lineitem table could be partitioned by the
customer who made the order, then any of the straightfor-
ward join combinations of the customer, orders, and lineitem
tables would be local to each node.
Yet, since the lineitem table does not contain the customer
key attribute, direct partitioning is impossible. HadoopDB
was, therefore, extended to support referential partitioning.
Although a similarly named technique was recently made
available in Oracle 11g [23], it served a different purpose
than in our project where this partitioning scheme facilitates
joins across a shared-nothing network.
Obviously, this method can be extended to an arbitrary
number of tables referenced in a cascading way. During data
load, referential partitioning involves the additional step of
joining with a parent table to retrieve its foreign key. This,
however, is a one time cost that gets amortized quickly by
superior performance on join queries. This technique bene-
fits TPC-H queries 3, 5, 7, 8, 10, and 18, all of which need
joins between the customer, orders, and lineitem tables.
3.2 Split MR/DB Joins
For tables that are not co-partitioned the join is generally
performed using the MapReduce framework. This usually
takes place in the Reduce phase of a job. The Map phase
reads each table partition and, for each tuple, outputs the
join attribute intended to automatically repartition the ta-
bles between the Map and Reduce phases.
Therefore, the same Reduce task is responsible for pro-
cessing all tuples with the same join key. Natural joins and
equi-joins require no further network communication the
Reduce tasks simply perform the join on their partition of
data.
The above algorithm works similarly to a partitioned par-
allel join described in parallel database literature [28, 20].
In general this method requires repartitioning both tables
across nodes. In several specific cases, however, the latter
operation is unnecessary a situation that parallel DBMS
implementations take advantage of whenever possible. Two
common join optimizations are the directed join and the
broadcast join. The former is applicable when one of the ta-
bles is already partitioned by the join key. In this case only
the other table has to be distributed using the same parti-
tioning function. The join can proceed locally on each node.
The broadcast join is used when one table is much larger
than the other. The large table should be left in its original
location while the entire small table ought to be shipped to
every node in the cluster. Each partition of the larger table
can then be joined locally with the smaller table.
Unfortunately, implementing directed and broadcast joins
in Hadoop requires computing the join in the Map phase.
This is not a trivial task
2
since reading multiple data sets
with an algorithm that might require multiple passes does
not fit well into the Map sequential scan model. Further-
more, HDFS does not promise to keep different datasets
co-partitioned between jobs. Therefore, a Map task can-
not assume that two different datasets partitioned using the
same hash function are actually stored on the same node.
For this reason, previous work on adding specialized joins
to the MapReduce framework typically focused on the rela-
tively simple broadcast join. This algorithm is implemented
in Hive, Pig, and a recent research paper [12]
3
. Since none
of the abovementioned systems implement cost-based query
optimizers, a hint must be included in the query to let the
system know that a broadcast join algorithm should be used.
2
Unless both tables are already sorted by the join key, in
which case one can use Hadoop’s merge join operator.
3
This work goes quite a bit farther than Hive and Pig, imple-
menting several optimizations on top of the basic broadcast
join, though each optimization maintains the single-pass se-
quential scan requirement of the larger table during the Map
phase.

The implementation of the broadcast join in these systems
is as follows. Each Map worker reads the smaller table from
HDFS and stores it in an in-memory hash table. This has
the effect of replicating the small table to each local node. A
sequential scan of the larger table follows. As in a standard
simple hash-join, the in-memory hash map is probed with
each tuple of this larger table to check for a matching key
value. The reading of both tables helps avoid the difficulties
of implementing a multi-pass algorithm. Since the join is
computed in the Map phase, it is called a Map-side join.
Split execution environments enable the implementation
of a variety of joins in the Map phase and reveal some in-
teresting new tradeoffs. First, take the case of the broad-
cast join. There are two ways that the latter can be imple-
mented in a split execution framework. The first way is to
use the standard Map-side join discussed above. The sec-
ond way, possible only in HadoopDB, involves writing the
smaller table to a temporary table in the database system
on each node. Then the join is computed completely inside
the DBMS and the resulting tuples are read by the Map
tasks for further processing.
The significance of the tradeoff between these two ap-
proaches depends on the DBMS software used. A partic-
ularly important factor is the cost of writing to a temporary
table and sharing this table across multiple partitions on
the same node. In general, as long as this cost is not too
high, computing the join inside the DBMS will yield better
performance than computing it in the Java code of the Map
task. This is explored further in Section 4.
Another type of join enabled by split execution environ-
ments is the directed join. Here, HadoopDB runs a stan-
dard MapReduce job to repartition the second table. First
we look up in the HadoopDB catalog how the first table
was distributed and use this function to repartition the sec-
ond table. Any selection operations on the second table are
performed in the Map phase of this job. The OutputFor-
mat feature of Hadoop is then used to circumvent HDFS
and write the output of this repartitioning directly into the
database systems located on each node. HadoopDB provides
native support for keeping data co-partitioned between jobs.
Therefore, once both tables are partitioned on the same at-
tribute inside the HadoopDB storage layer, the next MapRe-
duce job can compute the join by pushing it entirely into the
database systems. The resulting tuples get fed to the Map
phase as a single stream.
In the experimental results presented later in this paper,
we will further explore the performance of split MR/DB
joins. This technique proved to be particularly beneficial in
TPC-H queries 11, 16, and 17.
3.2.1 Split MR/DB Semijoin
A semijoin is one more type of join that can be split into
two MapReduce jobs, the second of which computes the join
in the Map phase. Here, not only does the first MapReduce
job perform selection operations on the table, but it also
projects the join attribute. The resulting column is then
replicated as in a Map-side join. If the projected column
is very small (for example, the key from a dictionary ta-
ble or a table after applying a very selective predicate), the
Map-side join is replaced with a selection predicate using
the SQL clause ’foreignKey IN (listOfValues)’ and pushed
into the DBMS. This allows the join to be performed inside
the database system without first loading the data into a
temporary table inside the DBMS.
Furthermore, in some cases, HadoopDB’s SideDB exten-
sion can be used to entirely eliminate the first MapReduce
job for a split semijoin. At job setup a SideDB query ex-
tractes the projected join key column instead of running a
seperate MapReduce job.
The SideDB extension is also helpful for looking up and
extracting attributes from small tables such as dictionary
tables. Such a situation typically occurs at the very end
of the query plan, right before outputting the results. For
example, integer identifiers that were carried through the
query execution, are replaced by actual text values (e.g.,
names of the nations replacing the nation identifier in TPC-
H). A similar concept in column-store databases is known
as late materialization [8, 26].
The query rewrite version of the map-side split semijoin
technique is commonly used in HadoopDB’s implementa-
tion of TPC-H to satisfy the benchmark rules forbidding the
replication of tables. All queries that include joins with re-
gion and nation tables are implemented using the selection-
predicate-rewriting and SideDB optimizations.
3.3 Post-join Aggregation
In HadoopDB, since aggregation operations can be exe-
cuted in database engines, there is usually no need for a
MapReduce Combiner.
Still there exists no standard way of performing post-
Reduce aggregation. While Reduce is meant for aggrega-
tion by design, it can only be applied if the repartitioning
between the Map and Reduce phases is performed on the
grouping attribute(s) specified in the query. If, however, the
partitioning is done on a join key (in order to join two differ-
ent tables), then another partitioning is needed to compute
the aggregation, since, in general, the grouping attribute is
different from the join key. The new partitioning therefore
requires another MapReduce job and all its associated over-
head.
In such situations, hash-based partial aggregation is done
at the end of each Reduce task. The grouping attribute ex-
tracted from each result of the Reduce task is used to probe
a hash table in order to update the appropriate running ag-
gregation. This procedure can save significant I/O, since
the output of Reduce tasks is written redundantly to HDFS
whereas the output of Map tasks is written only locally.
Hence, by outputting partially aggregated data instead of
raw values, we reduce the amount of data to be written to
HDFS. TPC-H queries that benefit from this technique in-
clude 5, 7, 8, and 9.
A similar technique is applied to TOP N selections, where
the list of the top N entries is maintained in an in-memory
tree map throughout the Reduce phase and outputted at
the end. In-memory data structures are also used for com-
bining an ORDER BY clause with another operator inside
the same Reduce task, again saving an extra MapReduce
job. Examples where this technique is beneficial are TPC-H
queries 2, 3, 10, 13, and 18.
3.4 Pre-join Aggregation
Whereas in most database systems aggregations are typ-
ically performed after a join, in HadoopDB they sometimes
get transformed into partial aggregation operators and com-
puted before a join. This happens when the join cannot be

Citations
More filters
Proceedings ArticleDOI

Large-scale machine learning at twitter

TL;DR: A case study of Twitter's integration of machine learning tools into its existing Hadoop-based, Pig-centric analytics platform to provide predictive analytics capabilities that incorporate machine learning, focused specifically on supervised classification.
Proceedings ArticleDOI

Split query processing in polybase

TL;DR: The design and implementation of Polybase are described along with a thorough performance evaluation that explores the benefits of employing a split query processing paradigm for executing queries that involve both structured data in a relational DBMS and unstructured data in Hadoop.
Proceedings ArticleDOI

Shark: fast data analysis using coarse-grained distributed memory

TL;DR: Shark marries query processing with deep data analysis, providing a unified system for easy data manipulation using SQL and pushing sophisticated analysis closer to data.
Journal ArticleDOI

A comprehensive view of Hadoop research—A systematic literature review

TL;DR: It is demonstrated that Hadoop has evolved into a solid platform to process large datasets, but the systematic review was able to spot promising areas and suggest topics for future research within the framework.
Journal ArticleDOI

Making Sense of Big Data.

Edd Dumbill
TL;DR: This article espouses a very diff erent position: that MapReduce is “good enough,” and that instead of trying to invent screwdrivers, the authors should simply get rid of everything that’s not a nail.
References
More filters
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Proceedings ArticleDOI

Pregel: a system for large-scale graph processing

TL;DR: A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.
Proceedings Article

Map-Reduce for Machine Learning on Multicore

TL;DR: This work shows that algorithms that fit the Statistical Query model can be written in a certain "summation form," which allows them to be easily parallelized on multicore computers and shows basically linear speedup with an increasing number of processors.
Proceedings ArticleDOI

A comparison of approaches to large-scale data analysis

TL;DR: A benchmark consisting of a collection of tasks that are run on an open source version of MR as well as on two parallel DBMSs shows a dramatic performance difference between the two paradigms.
Related Papers (5)
Frequently Asked Questions (16)
Q1. What are the contributions mentioned in the paper "Efficient processing of data warehousing queries in a split execution environment" ?

Hadapt is a start-up company currently commercializing the Yale University research project called HadoopDB. This work considers processing data warehousing queries over very large datasets. The authors analyze the complexity of this problem in the split execution environment of HadoopDB. In this paper, the authors discuss in detail performance-oriented query execution strategies for data warehouse queries in split execution environments, with particular focus on join and aggregation operations. The authors show that HadoopDB successfully competes with other systems. 

Employing a database system for these operations generally results in higher performance because a DBMS provides more efficient operator implementation, better I/O handling, and clustering/indexing. 

in some cases, HadoopDB’s SideDB extension can be used to entirely eliminate the first MapReduce job for a split semijoin. 

when tables are co-partitioned (e.g., hash partitioned on the join attribute), join operations can also be processed inside the database system. 

Because it is typically deployed on top of a distributed file system, Hive is unable to use hash-partitioning on a join key for the colocation of related tables — a typical strategy that parallel databases exploit to minimize data movement across nodes. 

Each node in the cluster has a single 2.40 GHz Intel Core 2 Duo processor running 64-bit Red Hat Enterprise Linux 5 (kernel version 2.6. 

Some of the reasons for the popularity of MapReduce include the availability of a free and open source implementation (Hadoop) [2], impressive ease-of-use experience [30], as well as Google’s, Yahoo!’s, and Facebook’s wide usage [19, 25] and evangelization of this technology. 

Given the large number of records, PostgreSQL is not able to keep all the intermediate data in memory and therefore needs to swap to disk. 

because their entire benchmark is read-only, the authors did not enable the replication features in DBMS-X, since rather than improving performance this would have complicated the installation process. 

because the system catalog lacks statistics on data distribution, cost-based algorithms cannot be implemented in Hive’s optimizer. 

The reason for the performance improvement can be attributed to leveraging decades’ worth of research in the database systems community. 

Even though many argue that MapReduce is not optimal for analyzing structured data [21, 30], it is nonetheless used increasingly frequently for that purpose because of a growing tendency to unify the data management platform. 

in order to handle even more complicated queries that include multi-stage jobs, the authors enabled HadoopDB to consume records from a combined input consisting of data from both database tables and HDFS files. 

This happens when the join cannot bepushed into the database system and therefore must be performed by Hadoop which is much slower than DBMS. 

In Q17, however, very selective predicates are applied to the part table (69GB of raw data), resulting in only about 6MB of data (around 600 thousand integer identifiers). 

This extra join causes the total query performance to become slower by a factor of four versus the layout with referential partitioning.