scispace - formally typeset
Open AccessProceedings ArticleDOI

A comparison of approaches to large-scale data analysis

Reads0
Chats0
TLDR
A benchmark consisting of a collection of tasks that are run on an open source version of MR as well as on two parallel DBMSs shows a dramatic performance difference between the two paradigms.
Abstract
There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system's performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.

read more

Content maybe subject to copyright    Report

A Comparison of Approaches to Large-Scale Data Analysis
Andrew Pavlo Erik Paulson Alexander Rasin
Brown University University of Wisconsin Brown University
pavlo@cs.brown.edu epaulson@cs.wisc.edu alexr@cs.brown.edu
Daniel J. Abadi David J. DeWitt Samuel Madden Michael Stonebraker
Yale University Microsoft Inc. M.I.T. CSAIL M.I.T. CSAIL
dna@cs.yale.edu dewitt@microsoft.com madden@csail.mit.edu stonebraker@csail.mit.edu
ABSTRACT
There is currently considerable enthusiasm around the MapReduce
(MR) paradigm for large-scale data analysis [17]. Although the
basic control flow of this framework has existed in parallel SQL
database management systems (DBMS) for over 20 years, some
have called MR a dramatically new computing model [8, 17]. In
this paper, we describe and compare both paradigms. Furthermore,
we evaluate both kinds of systems in terms of performance and de-
velopment complexity. To this end, we define a benchmark con-
sisting of a collection of tasks that we have run on an open source
version of MR as well as on two parallel DBMSs. For each task,
we measure each system’s performance for various degrees of par-
allelism on a cluster of 100 nodes. Our results reveal some inter-
esting trade-offs. Although the process to load data into and tune
the execution of parallel DBMSs took much longer than the MR
system, the observed performance of these DBMSs was strikingly
better. We speculate about the causes of the dramatic performance
difference and consider implementation concepts that future sys-
tems should take from both kinds of architectures.
Categories and Subject Descriptors
H.2.4 [Database Management]: Systems—Parallel databases
General Terms
Database Applications, Use Cases, Database Programming
1. INTRODUCTION
Recently the trade press has been filled with news of the rev-
olution of “cluster computing”. This paradigm entails harnessing
large numbers of (low-end) processors working in parallel to solve
a computing problem. In effect, this suggests constructing a data
center by lining up a large number of low-end servers instead of
deploying a smaller set of high-end servers. With this rise of in-
terest in clusters has come a proliferation of tools for programming
them. One of the earliest and best known such tools in MapReduce
(MR) [8]. MapReduce is attractive because it provides a simple
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGMOD’09, June 29–July 2, 2009, Providence, Rhode Island, USA.
Copyright 2009 ACM 978-1-60558-551-2/09/06 ...$5.00.
model through which users can express relatively sophisticated dis-
tributed programs, leading to significant interest in the educational
community. For example, IBM and Google have announced plans
to make a 1000 processor MapReduce cluster available to teach stu-
dents distributed programming.
Given this interest in MapReduce, it is natural to ask “Why not
use a parallel DBMS instead?” Parallel database systems (which
all share a common architectural design) have been commercially
available for nearly two decades, and there are now about a dozen in
the marketplace, including Teradata, Aster Data, Netezza, DATAl-
legro (and therefore soon Microsoft SQL Server via Project Madi-
son), Dataupia, Vertica, ParAccel, Neoview, Greenplum, DB2 (via
the Database Partitioning Feature), and Oracle (via Exadata). They
are robust, high performance computing platforms. Like MapRe-
duce, they provide a high-level programming environment and par-
allelize readily. Though it may seem that MR and parallel databases
target different audiences, it is in fact possible to write almost any
parallel processing task as either a set of database queries (possibly
using user defined functions and aggregates to lter and combine
data) or a set of MapReduce jobs. Inspired by this question, our goal
is to understand the differences between the MapReduce approach
to performing large-scale data analysis and the approach taken by
parallel database systems. The two classes of systems make differ-
ent choices in several key areas. For example, all DBMSs require
that data conform to a well-defined schema, whereas MR permits
data to be in any arbitrary format. Other differences also include
how each system provides indexing and compression optimizations,
programming models, the way in which data is distributed, and
query execution strategies.
The purpose of this paper is to consider these choices, and the
trade-offs that they entail. We begin in Section 2 with a brief review
of the two alternative classes of systems, followed by a discussion
in Section 3 of the architectural trade-offs. Then, in Section 4 we
present our benchmark consisting of a variety of tasks, one taken
from the MR paper [8], and the rest a collection of more demanding
tasks. In addition, we present the results of running the benchmark
on a 100-node cluster to execute each task. We tested the publicly
available open-source version of MapReduce, Hadoop [1], against
two parallel SQL DBMSs, Vertica [3] and a second system from a
major relational vendor. We also present results on the time each
system took to load the test data and report informally on the pro-
cedures needed to set up and tune the software for each task.
In general, the SQL DBMSs were significantly faster and re-
quired less code to implement each task, but took longer to tune and
load the data. Hence, we conclude with a discussion on the reasons
for the differences between the approaches and provide suggestions
on the best practices for any large-scale data analysis engine.
Some readers may feel that experiments conducted using 100

nodes are not interesting or representative of real world data pro-
cessing systems. We disagree with this conjecture on two points.
First, as we demonstrate in Section 4, at 100 nodes the two parallel
DBMSs range from a factor of 3.1 to 6.5 faster than MapReduce
on a variety of analytic tasks. While MR may indeed be capable
of scaling up to 1000s of nodes, the superior efficiency of mod-
ern DBMSs alleviates the need to use such massive hardware on
datasets in the range of 1–2PB (1000 nodes with 2TB of disk/node
has a total disk capacity of 2PB). For example, eBay’s Teradata con-
figuration uses just 72 nodes (two quad-core CPUs, 32GB RAM,
104 300GB disks per node) to manage approximately 2.4PB of re-
lational data. As another example, Fox Interactive Media’s ware-
house is implemented using a 40-node Greenplum DBMS. Each
node is a Sun X4500 machine with two dual-core CPUs, 48 500GB
disks, and 16 GB RAM (1PB total disk space) [7]. Since few data
sets in the world even approach a petabyte in size, it is not at all
clear how many MR users really need 1,000 nodes.
2. TWO APPROACHES TO LARGE SCALE
DATA ANALYSIS
The two classes of systems we consider in this paper run on a
“shared nothing” collection of computers [19]. That is, the sys-
tem is deployed on a collection of independent machines, each with
local disk and local main memory, connected together on a high-
speed local area network. Both systems achieve parallelism by
dividing any data set to be utilized into partitions, which are al-
located to different nodes to facilitate parallel processing. In this
section, we provide an overview of how both the MR model and
traditional parallel DBMSs operate in this environment.
2.1 MapReduce
One of the attractive qualities about the MapReduce program-
ming model is its simplicity: an MR program consists only of two
functions, called Map and Reduce, that are written by a user to
process key/value data pairs. The input data set is stored in a col-
lection of partitions in a distributed file system deployed on each
node in the cluster. The program is then injected into a distributed
processing framework and executed in a manner to be described.
The Map function reads a set of “records” from an input le,
does any desired filtering and/or transformations, and then outputs
a set of intermediate records in the form of new key/value pairs. As
the Map function produces these output records, a “split” function
partitions the records into R disjoint buckets by applying a function
to the key of each output record. This split function is typically a
hash function, though any deterministic function will suffice. Each
map bucket is written to the processing node’s local disk. The Map
function terminates having produced R output files, one for each
bucket. In general, there are multiple instances of the Map function
running on different nodes of a compute cluster. We use the term
instance to mean a unique running invocation of either the Map or
Reduce function. Each Map instance is assigned a distinct portion
of the input file by the MR scheduler to process. If there are M
such distinct portions of the input file, then there are R files on disk
storage for each of the M Map tasks, for a total of M × R files;
F
ij
, 1 i M, 1 j R. The key observation is that all Map
instances use the same hash function; thus, all output records with
the same hash value are stored in the same output file.
The second phase of a MR program executes R instances of the
Reduce program, where R is typically the number of nodes. The
input for each Reduce instance R
j
consists of the files F
ij
, 1
i M . These files are transferred over the network from the Map
nodes’ local disks. Note that again all output records from the Map
phase with the same hash value are consumed by the same Reduce
instance, regardless of which Map instance produced the data. Each
Reduce processes or combines the records assigned to it in some
way, and then writes records to an output file (in the distributed file
system), which forms part of the computation’s final output.
The input data set exists as a collection of one or more partitions
in the distributed file system. It is the job of the MR scheduler to
decide how many Map instances to run and how to allocate them
to available nodes. Likewise, the scheduler must also decide on
the number and location of nodes running Reduce instances. The
MR central controller is responsible for coordinating the system
activities on each node. A MR program finishes execution once the
final result is written as new les in the distributed file system.
2.2 Parallel DBMSs
Database systems capable of running on clusters of shared noth-
ing nodes have existed since the late 1980s. These systems all sup-
port standard relational tables and SQL, and thus the fact that the
data is stored on multiple machines is transparent to the end-user.
Many of these systems build on the pioneering research from the
Gamma [10] and Grace [11] parallel DBMS projects. The two key
aspects that enable parallel execution are that (1) most (or even all)
tables are partitioned over the nodes in a cluster and that (2) the sys-
tem uses an optimizer that translates SQL commands into a query
plan whose execution is divided amongst multiple nodes. Because
programmers only need to specify their goal in a high level lan-
guage, they are not burdened by the underlying storage details, such
as indexing options and join strategies.
Consider a SQL command to filter the records in a table T
1
based
on a predicate, along with a join to a second table T
2
with an aggre-
gate computed on the result of the join. A basic sketch of how this
command is processed in a parallel DBMS consists of three phases.
Since the database will have already stored T
1
on some collection
of the nodes partitioned on some attribute, the lter sub-query is
first performed in parallel at these sites similar to the filtering per-
formed in a Map function. Following this step, one of two common
parallel join algorithms are employed based on the size of data ta-
bles. For example, if the number of records in T
2
is small, then the
DBMS could replicate it on all nodes when the data is rst loaded.
This allows the join to execute in parallel at all nodes. Following
this, each node then computes the aggregate using its portion of the
answer to the join. A final “roll-up” step is required to compute the
final answer from these partial aggregates [9].
If the size of the data in T
2
is large, then T
2
s contents will be
distributed across multiple nodes. If these tables are partitioned on
different attributes than those used in the join, the system will have
to hash both T
2
and the filtered version of T
1
on the join attribute us-
ing a common hash function. The redistribution of both T
2
and the
filtered version of T
1
to the nodes is similar to the processing that
occurs between the Map and the Reduce functions. Once each node
has the necessary data, it then performs a hash join and calculates
the preliminary aggregate function. Again, a roll-up computation
must be performed as a last step to produce the final answer.
At first glance, these two approaches to data analysis and pro-
cessing have many common elements; however, there are notable
differences that we consider in the next section.
3. ARCHITECTURAL ELEMENTS
In this section, we consider aspects of the two system architec-
tures that are necessary for processing large amounts of data in a
distributed environment. One theme in our discussion is that the na-
ture of the MR model is well suited for development environments
with a small number of programmers and a limited application do-
main. This lack of constraints, however, may not be appropriate for
longer-term and larger-sized projects.

3.1 Schema Support
Parallel DBMSs require data to fit into the relational paradigm
of rows and columns. In contrast, the MR model does not require
that data files adhere to a schema defined using the relational data
model. That is, the MR programmer is free to structure their data in
any manner or even to have no structure at all.
One might think that the absence of a rigid schema automati-
cally makes MR the preferable option. For example, SQL is often
criticized for its requirement that the programmer must specify the
“shape” of the data in a data definition facility. On the other hand,
the MR programmer must often write a custom parser in order to
derive the appropriate semantics for their input records, which is at
least an equivalent amount of work. But there are also other poten-
tial problems with not using a schema for large data sets.
Whatever structure exists in MR input files must be built into
the Map and Reduce programs. Existing MR implementations pro-
vide built-in functionality to handle simple key/value pair formats,
but the programmer must explicitly write support for more com-
plex data structures, such as compound keys. This is possibly an
acceptable approach if a MR data set is not accessed by multiple
applications. If such data sharing exists, however, a second pro-
grammer must decipher the code written by the first programmer to
decide how to process the input file. A better approach, followed
by all SQL DBMSs, is to separate the schema from the application
and store it in a set of system catalogs that can be queried.
But even if the schema is separated from the application and
made available to multiple MR programs through a description fa-
cility, the developers must also agree on a single schema. This ob-
viously requires some commitment to a data model or models, and
the input files must obey this commitment as it is cumbersome to
modify data attributes once the files are created.
Once the programmers agree on the structure of data, something
or someone must ensure that any data added or modified does not
violate integrity or other high-level constraints (e.g., employee salaries
must be non negative). Such conditions must be known and explic-
itly adhered to by all programmers modifying a particular data set;
a MR framework and its underlying distributed storage system has
no knowledge of these rules, and thus allows input data to be easily
corrupted with bad data. By again separating such constraints from
the application and enforcing them automatically by the run time
system, as is done by all SQL DBMSs, the integrity of the data is
enforced without additional work on the programmer’s behalf.
In summary, when no sharing is anticipated, the MR paradigm is
quite flexible. If sharing is needed, however, then we argue that it is
advantageous for the programmer to use a data description language
and factor schema definitions and integrity constraints out of appli-
cation programs. This information should be installed in common
system catalogs accessible to the appropriate users and applications.
3.2 Indexing
All modern DBMSs use hash or B-tree indexes to accelerate ac-
cess to data. If one is looking for a subset of records (e.g., em-
ployees with a salary greater than $100,000), then using a proper
index reduces the scope of the search dramatically. Most database
systems also support multiple indexes per table. Thus, the query
optimizer can decide which index to use for each query or whether
to simply perform a brute-force sequential search.
Because the MR model is so simple, MR frameworks do not pro-
vide built-in indexes. The programmer must implement any indexes
that they may desire to speed up access to the data inside of their
application. This is not easily accomplished, as the framework’s
data fetching mechanisms must also be instrumented to use these
indexes when pushing data to running Map instances. Once more,
this is an acceptable strategy if the indexes do not need to be shared
between multiple programmers, despite requiring every MR pro-
grammer re-implement the same basic functionality.
If sharing is needed, however, then the specifications of what in-
dexes are present and how to use them must be transferred between
programmers. It is again preferable to store this index information
in a standard format in the system catalogs, so that programmers
can query this structure to discover such knowledge.
3.3 Programming Model
During the 1970s, the database research community engaged in a
contentious debate between the relational advocates and the Coda-
syl advocates [18]. The salient issue of this discussion was whether
a program to access data in a DBMS should be written either by:
1. Stating what you want rather than presenting an algorithm
for how to get it (Relational)
2. Presenting an algorithm for data access (Codasyl)
In the end, the former view prevailed and the last 30 years is
a testament to the value of relational database systems. Programs
in high-level languages, such as SQL, are easier to write, easier
to modify, and easier for a new person to understand. Codasyl
was criticized for being “the assembly language of DBMS access”.
We argue that MR programming is somewhat analogous to Codasyl
programming: one is forced to write algorithms in a low-level lan-
guage in order to perform record-level manipulation. On the other
hand, to many people brought up programming in procedural lan-
guages, such as C/C++ or Java, describing tasks in a declarative
language like SQL can be challenging.
Anecdotal evidence from the MR community suggests that there
is widespread sharing of MR code fragments to do common tasks,
such as joining data sets. To alleviate the burden of having to re-
implement repetitive tasks, the MR community is migrating high-
level languages on top of the current interface to move such func-
tionality into the run time. Pig [15] and Hive [2] are two notable
projects in this direction.
3.4 Data Distribution
The conventional wisdom for large-scale databases is to always
send the computation to the data, rather than the other way around.
In other words, one should send a small program over the network
to a node, rather than importing a large amount of data from the
node. Parallel DBMSs use knowledge of data distribution and loca-
tion to their advantage: a parallel query optimizer strives to balance
computational workloads while minimizing the amount data trans-
mitted over the network connecting the nodes of the cluster.
Aside from the initial decision on where to schedule Map in-
stances, a MR programmer must perform these tasks manually. For
example, suppose a user writes a MR program to process a collec-
tion of documents in two parts. First, the Map function scans the
documents and creates a histogram of frequently occurring words.
The documents are then passed to a Reduce function that groups
files by their site of origin. Using this data, the user, or another
user building on the rst user’s work, now wants to find sites with
a document that contains more than five occurrences of the word
‘Google’ or the word ‘IBM’. In the naive implementation of this
query, where the Map is executed over the accumulated statistics,
the filtration is done after the statistics for all documents are com-
puted and shipped to reduce workers, even though only a small sub-
set of documents satisfy the keyword filter.
In contrast, the following SQL view and select queries perform a
similar computation:

CREATE VIEW Keywords AS
SELECT siteid, docid, word, COUNT(
*
) AS wordcount
FROM Documents
GROUP BY siteid, docid, word;
SELECT DISTINCT siteid
FROM Keywords
WHERE (word = ‘IBM’ OR word = ‘Google’) AND wordcount > 5;
A modern DBMS would rewrite the second query such that the
view definition is substituted for the Keywords table in the FROM
clause. Then, the optimizer can push the WHERE clause in the query
down so that it is applied to the Documents table before the COUNT
is computed, substantially reducing computation. If the documents
are spread across multiple nodes, then this filter can be applied on
each node before documents belonging to the same site are grouped
together, generating much less network I/O.
3.5 Execution Strategy
There is a potentially serious performance problem related to
MR’s handling of data transfer between Map and Reduce jobs. Re-
call that each of the N Map instances produces M output files,
each destined for a different Reduce instance. These files are writ-
ten to the local disk on the node executing each particular Map in-
stance. If N is 1000 and M is 500, the Map phase of the program
produces 500,000 local files. When the Reduce phase starts, each
of the 500 Reduce instances needs to read its 1000 input files and
must use a le-transfer protocol to “pull” each of its input files from
the nodes on which the Map instances were run. With 100s of Re-
duce instances running simultaneously, it is inevitable that two or
more Reduce instances will attempt to read their input les from
the same map node simultaneously, inducing large numbers of disk
seeks and slowing the effective disk transfer rate. This is why par-
allel database systems do not materialize their split files and instead
use a push approach to transfer data instead of a pull.
3.6 Flexibility
Despite its widespread adoption, SQL is routinely criticized for
its insufficient expressive prowess. Some believe that it was a mis-
take for the database research community in the 1970s to focus on
data sub-languages that could be embedded in any programming
language, rather than adding high-level data access to all program-
ming languages. Fortunately, new application frameworks, such as
Ruby on Rails [21] and LINQ [14], have started to reverse this sit-
uation by leveraging new programming language functionality to
implement an object-relational mapping pattern. These program-
ming environments allow developers to benefit from the robustness
of DBMS technologies without the burden of writing complex SQL.
Proponents of the MR model argue that SQL does not facilitate
the desired generality that MR provides. But almost all of the major
DBMS products (commercial and open-source) now provide sup-
port for user-defined functions, stored procedures, and user-dened
aggregates in SQL. Although this does not have the full generality
of MR, it does improve the flexibility of database systems.
3.7 Fault Tolerance
The MR frameworks provide a more sophisticated failure model
than parallel DBMSs. While both classes of systems use some form
of replication to deal with disk failures, MR is far more adept at
handling node failures during the execution of a MR computation.
In a MR system, if a unit of work (i.e., processing a block of data)
fails, then the MR scheduler can automatically restart the task on
an alternate node. Part of the flexibility is the result of the fact that
the output files of the Map phase are materialized locally instead of
being streamed to the nodes running the Reduce tasks. Similarly,
pipelines of MR jobs, such as the one described in Section 4.3.4,
materialize intermediate results to files each step of the way. This
differs from parallel DBMSs, which have larger granules of work
(i.e., transactions) that are restarted in the event of a failure. Part of
the reason for this approach is that DBMSs avoid saving interme-
diate results to disk whenever possible. Thus, if a single node fails
during a long running query in a DBMS, the entire query must be
completely restarted.
4. PERFORMANCE BENCHMARKS
In this section, we present our benchmark consisting of five tasks
that we use to compare the performance of the MR model with that
of parallel DBMSs. The rst task is taken directly from the origi-
nal MapReduce paper [8] that the authors’ claim is representative of
common MR tasks. Because this task is quite simple, we also devel-
oped four additional tasks, comprised of more complex analytical
workloads designed to explore the trade-offs discussed in the pre-
vious section. We executed our benchmarks on a well-known MR
implementation and two parallel DBMSs.
4.1 Benchmark Environment
As we describe the details of our benchmark environment, we
note how the different data analysis systems that we test differ in
operating assumptions and discuss the ways in which we dealt with
them in order to make the experiments uniform.
4.1.1 Tested Systems
Hadoop: The Hadoop system is the most popular open-source im-
plementation of the MapReduce framework, under development
by Yahoo! and the Apache Software Foundation [1]. Unlike the
Google implementation of the original MR framework written in
C++, the core Hadoop system is written entirely in Java. For our
experiments in this paper, we use Hadoop version 0.19.0 running
on Java 1.6.0. We deployed the system with the default configura-
tion settings, except for the following changes that we found yielded
better performance without diverging from core MR fundamentals:
(1) data is stored using 256MB data blocks instead of the default
64MB, (2) each task executor JVM ran with a maximum heap size
of 512MB and the DataNode/JobTracker JVMs ran with a maxi-
mum heap size of a 1024MB (for a total size of 3.5GB per node),
(3) we enabled Hadoop’s “rack awareness” feature for data locality
in the cluster, and (4) we allowed Hadoop to reuse the task JVM
executor instead starting a new process for each Map/Reduce task.
Moreover, we configured the system to run two Map instances and
a single Reduce instance concurrently on each node.
The Hadoop framework also provides an implementation of the
Google distributed file system [12]. For each benchmark trial, we
store all input and output data in the Hadoop distributed file system
(HDFS). We used the default settings of HDFS of three replicas
per block and without compression; we also tested other configura-
tions, such as using only a single replica per block as well as block-
and record-level compression, but we found that our tests almost
always executed at the same speed or worse with these features en-
abled (see Section 5.1.3). After each benchmark run finishes for a
particular node scaling level, we delete the data directories on each
node and reformat HDFS so that the next set of input data is repli-
cated uniformly across all nodes.
Hadoop uses a central job tracker and a “master” HDFS daemon
to coordinate node activities. To ensure that these daemons do not
affect the performance of worker nodes, we execute both of these
additional framework components on a separate node in the cluster.
DBMS-X: We used the latest release of DBMS-X, a parallel SQL
DBMS from a major relational database vendor that stores data in

a row-based format. The system is installed on each node and con-
figured to use 4GB shared memory segments for the buffer pool
and other temporary space. Each table is hash partitioned across
all nodes on the salient attribute for that particular table, and then
sorted and indexed on different attributes (see Sections 4.2.1 and
4.3.1). Like the Hadoop experiments, we deleted the tables in DBMS-
X and reloaded the data for each trial to ensure that the tuples was
uniformly distributed in the cluster.
By default DBMS-X does not compress data in its internal stor-
age, but it does provide ability to compress tables using a well-
known dictionary-based scheme. We found that enabling compres-
sion reduced the execution times for almost all the benchmark tasks
by 50%, and thus we only report results with compression enabled.
In only one case did we find that using compression actually per-
formed worse. Furthermore, because all of our benchmarks are
read-only, we did not enable replication features in DBMS-X, since
this would not have improved performance and complicates the in-
stallation process.
Vertica: The Vertica database is a parallel DBMS designed for
large data warehouses [3]. The main distinction of Vertica from
other DBMSs (including DBMS-X) is that all data is stored as columns,
rather than rows [20]. It uses a unique execution engine designed
specifically for operating on top of a column-oriented storage layer.
Unlike DBMS-X, Vertica compresses data by default since its ex-
ecutor can operate directly on compressed tables. Because dis-
abling this feature is not typical in Vertica deployments, the Ver-
tica results in this paper are generated using only compressed data.
Vertica also sorts every table by one or more attributes based on a
clustered index.
We found that the default 256MB buffer size per node performed
well in our experiments. The Vertica resource manager is respon-
sible for setting the amount of memory given to queries, but we
provide a hint to the system to expect to execute only one query at
a time. Thus, each query receives most the maximum amount of
memory available on each node at runtime.
4.1.2 Node Configuration
All three systems were deployed on a 100-node cluster. Each
node has a single 2.40 GHz Intel Core 2 Duo processor running 64-
bit Red Hat Enterprise Linux 5 (kernel version 2.6.18) with 4GB
RAM and two 250GB SATA-I hard disks. According to hdparm,
the hard disks deliver 7GB/sec for cached reads and about 74MB/sec
for buffered reads. The nodes are connected with Cisco Catalyst
3750E-48TD switches. This switch has gigabit Ethernet ports for
each node and an internal switching fabric of 128Gbps [6]. There
are 50 nodes per switch. The switches are linked together via Cisco
StackWise Plus, which creates a 64Gbps ring between the switches.
Traffic between two nodes on the same switch is entirely local to the
switch and does not impact traffic on the ring.
4.1.3 Benchmark Execution
For each benchmark task, we describe the steps used to imple-
ment the MR program as well as provide the equivalent SQL state-
ment(s) executed by the two database systems. We executed each
task three times and report the average of the trials. Each system ex-
ecutes the benchmark tasks separately to ensure exclusive access to
the cluster’s resources. To measure the basic performance without
the overhead of coordinating parallel tasks, we first execute each
task on a single node. We then execute the task on different cluster
sizes to show how each system scales as both the amount of data
processed and available resources are increased. We only report
results using trials where all nodes are available and the system’s
software operates correctly during the benchmark execution.
We also measured the time it takes for each system to load the
test data. The results from these measurements are split between
the actual loading of the data and any additional operations after the
loading that each system performs, such as compressing or building
indexes. The initial input data on each node is stored on one of its
two locally installed disks.
Unless otherwise indicated, the final results from the queries ex-
ecuting in Vertica and DBMS-X are piped from a shell command
into a file on the disk not used by the DBMS. Although it is possi-
ble to do an equivalent operation in Hadoop, it is easier (and more
common) to store the results of a MR program into the distributed
file system. This procedure, however, is not analogous to how the
DBMSs produce their output data; rather than storing the results in
a single file, the MR program produces one output file for each Re-
duce instance and stores them in a single directory. The standard
practice is for developers then to use these output directories as a
single input unit for other MR jobs. If, however, a user wishes to
use this data in a non-MR application, they must first combine the
results into a single file and download it to the local file system.
Because of this discrepancy, we execute an extra Reduce func-
tion for each MR benchmark task that simply combines the final
output into a single file in HDFS. Our results differentiate between
the execution times for Hadoop running the actual benchmark task
versus the additional combine operation. Thus, the Hadoop results
displayed in the graphs for this paper are shown as stacked bars: the
lower portion of each bar is the execution time for just the specific
benchmark task, while the upper portion is the execution time for
the single Reduce function to combine all of the program’s output
data into a single file.
4.2 The Original MR Task
Our first benchmark task is the “Grep task” taken from the orig-
inal MapReduce paper, which the authors describe as “represen-
tative of a large subset of the real programs written by users of
MapReduce” [8]. For this task, each system must scan through a
data set of 100-byte records looking for a three-character pattern.
Each record consists of a unique key in the first 10 bytes, followed
by a 90-byte random value. The search pattern is only found in the
last 90 bytes once in every 10,000 records.
The input data is stored on each node in plain text files, with one
record per line. For the Hadoop trials, we uploaded these files unal-
tered directly into HDFS. To load the data into Vertica and DBMS-
X, we execute each system’s proprietary load commands in parallel
on each node and store the data using the following schema:
CREATE TABLE Data (
key VARCHAR(10) PRIMARY KEY,
field VARCHAR(90) );
We execute the Grep task using two different data sets. The mea-
surements in the original MapReduce paper are based on process-
ing 1TB of data on approximately 1800 nodes, which is 5.6 million
records or roughly 535MB of data per node. For each system, we
execute the Grep task on cluster sizes of 1, 10, 25, 50, and 100
nodes. The total number of records processed for each cluster size
is therefore 5.6 million times the number of nodes. The perfor-
mance of each system not only illustrates how each system scales
as the amount of data is increased, but also allows us to (to some
extent) compare the results to the original MR system.
While our first dataset fixes the size of the data per node to be the
same as the original MR benchmark and only varies the number of
nodes, our second dataset fixes the total dataset size to be the same
as the original MR benchmark (1TB) and evenly divides the data
amongst a variable number of nodes. This task measures how well
each system scales as the number of available nodes is increased.

Citations
More filters
Proceedings ArticleDOI

Benchmarking cloud serving systems with YCSB

TL;DR: This work presents the "Yahoo! Cloud Serving Benchmark" (YCSB) framework, with the goal of facilitating performance comparisons of the new generation of cloud data serving systems, and defines a core set of benchmarks and reports results for four widely used systems.
Journal ArticleDOI

Data-intensive applications, challenges, techniques and technologies: A survey on Big Data

TL;DR: This paper is aimed to demonstrate a close-up view about Big Data, including Big Data applications, Big Data opportunities and challenges, as well as the state-of-the-art techniques and technologies currently adopt to deal with the Big Data problems.
Journal ArticleDOI

Hive: a warehousing solution over a map-reduce framework

TL;DR: Hadoop is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware.
Proceedings ArticleDOI

Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

TL;DR: This work proposes a simple algorithm called delay scheduling, which achieves nearly optimal data locality in a variety of workloads and can increase throughput by up to 2x while preserving fairness.
Proceedings ArticleDOI

Spanner: Google's globally-distributed database

TL;DR: This article describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty, critical to supporting external consistency and a variety of powerful features.
References
More filters
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Journal ArticleDOI

MapReduce: simplified data processing on large clusters

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Journal ArticleDOI

The Google file system

TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.
Proceedings ArticleDOI

Dryad: distributed data-parallel programs from sequential building blocks

TL;DR: The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.
Proceedings ArticleDOI

Pig latin: a not-so-foreign language for data processing

TL;DR: A new language called Pig Latin is described, designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce, which is an open-source, Apache-incubator project, and available for general use.
Frequently Asked Questions (17)
Q1. What are the contributions mentioned in the paper "A comparison of approaches to large-scale data analysis" ?

Although the basic control flow of this framework has existed in parallel SQL database management systems ( DBMS ) for over 20 years, some have called MR a dramatically new computing model [ 8, 17 ]. In this paper, the authors describe and compare both paradigms. Furthermore, the authors evaluate both kinds of systems in terms of performance and development complexity. For each task, the authors measure each system ’ s performance for various degrees of parallelism on a cluster of 100 nodes. The authors speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures. 

The measurements in the original MapReduce paper are based on processing 1TB of data on approximately 1800 nodes, which is 5.6 million records or roughly 535MB of data per node. 

it takes approximately 600 seconds of raw I/O to read the UserVisits and Rankings tables off of disk and then another 300 seconds to split, parse, and deserialize the various attributes. 

One of the attractive qualities about the MapReduce programming model is its simplicity: an MR program consists only of two functions, called Map and Reduce, that are written by a user to process key/value data pairs. 

The authors also used MR’s Combine feature to perform the pre-aggregate before data is transmitted to the Reduce instances, improving the first query’s execution time by a factor of two [8]. 

Most programmers are more familiar with object-oriented, imperative programming than with other language technologies, such as SQL. 

In addition, if a MR system needs 1,000 nodes to match the performance of a 100 node parallel database system, it is ten times more likely that a node will fail while a query is executing. 

Since parallel DBMSs will be deployed on larger clusters over time, the probability of mid-query hardware failures will increase. 

Because programmers only need to specify their goal in a high level language, they are not burdened by the underlying storage details, such as indexing options and join strategies. 

By again separating such constraints from the application and enforcing them automatically by the run time system, as is done by all SQL DBMSs, the integrity of the data is enforced without additional work on the programmer’s behalf. 

Given these records, the Reduce function then simply counts the number of values for a given key and outputs the URL and the calculated inlink count as the program’s final output. 

The authors initially believed that this would improve CPU-bound tasks, because the Map and Reduce tasks no longer needed to split the fields by the delimiter. 

as the total number of allocated Map tasks increases, there is additional overhead required for the central job tracker to coordinate node activities. 

The authors found that enabling compression reduced the execution times for almost all the benchmark tasks by 50%, and thus the authors only report results with compression enabled. 

the authors found that other data format options, such as SequenceFileInputFormat or custom Writable tuples, resulted in both slower load and execution times. 

because all of their benchmarks are read-only, the authors did not enable replication features in DBMS-X, since this would not have improved performance and complicates the installation process. 

To measure the basic performance without the overhead of coordinating parallel tasks, the authors first execute each task on a single node.