How long does it take to read the UserVisits and Rankings tables off disk?

it takes approximately 600 seconds of raw I/O to read the UserVisits and Rankings tables off of disk and then another 300 seconds to split, parse, and deserialize the various attributes.

How did MR perform the pre-aggregate before data was transmitted to the Reduce instances?

The authors also used MR’s Combine feature to perform the pre-aggregate before data is transmitted to the Reduce instances, improving the first query’s execution time by a factor of two [8].

What language is more familiar to programmers than SQL?

Most programmers are more familiar with object-oriented, imperative programming than with other language technologies, such as SQL.

How many nodes do a MR system need to perform a query?

In addition, if a MR system needs 1,000 nodes to match the performance of a 100 node parallel database system, it is ten times more likely that a node will fail while a query is executing.

What is the probability of mid-query hardware failures in parallel DBMSs?

Since parallel DBMSs will be deployed on larger clusters over time, the probability of mid-query hardware failures will increase.

What is the function that calculates the inlink count for a given key?

Given these records, the Reduce function then simply counts the number of values for a given key and outputs the URL and the calculated inlink count as the program’s final output.

Why did the authors initially think that block-level compression would improve the performance of the Map and Reduce?

The authors initially believed that this would improve CPU-bound tasks, because the Map and Reduce tasks no longer needed to split the fields by the delimiter.

What is the main reason why the central job tracker is required to coordinate node activities?

as the total number of allocated Map tasks increases, there is additional overhead required for the central job tracker to coordinate node activities.

What other data format options resulted in slower load and execution times?

the authors found that other data format options, such as SequenceFileInputFormat or custom Writable tuples, resulted in both slower load and execution times.

(Open Access) A comparison of approaches to large-scale data analysis (2009) | Andrew Pavlo

Q: What are the contributions mentioned in the paper "A comparison of approaches to large-scale data analysis" ?

Although the basic control flow of this framework has existed in parallel SQL database management systems ( DBMS ) for over 20 years, some have called MR a dramatically new computing model [ 8, 17 ]. In this paper, the authors describe and compare both paradigms. Furthermore, the authors evaluate both kinds of systems in terms of performance and development complexity. For each task, the authors measure each system ’ s performance for various degrees of parallelism on a cluster of 100 nodes. The authors speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.

Q: How many records are processed in the original mapreduce paper?

The measurements in the original MapReduce paper are based on processing 1TB of data on approximately 1800 nodes, which is 5.6 million records or roughly 535MB of data per node.

Q: What is the attractive aspect of the MapReduce programming model?

One of the attractive qualities about the MapReduce programming model is its simplicity: an MR program consists only of two functions, called Map and Reduce, that are written by a user to process key/value data pairs.

A Comparison of Approaches to Large-Scale Data Analysis

Andrew Pavlo Erik Paulson Alexander Rasin

Brown University University of Wisconsin Brown University

pavlo@cs.brown.edu epaulson@cs.wisc.edu alexr@cs.brown.edu

Daniel J. Abadi David J. DeWitt Samuel Madden Michael Stonebraker

Yale University Microsoft Inc. M.I.T. CSAIL M.I.T. CSAIL

dna@cs.yale.edu dewitt@microsoft.com madden@csail.mit.edu stonebraker@csail.mit.edu

ABSTRACT

There is currently considerable enthusiasm around the MapReduce

(MR) paradigm for large-scale data analysis [17]. Although the

basic control ﬂow of this framework has existed in parallel SQL

database management systems (DBMS) for over 20 years, some

have called MR a dramatically new computing model [8, 17]. In

this paper, we describe and compare both paradigms. Furthermore,

we evaluate both kinds of systems in terms of performance and de-

velopment complexity. To this end, we deﬁne a benchmark con-

sisting of a collection of tasks that we have run on an open source

version of MR as well as on two parallel DBMSs. For each task,

we measure each system’s performance for various degrees of par-

allelism on a cluster of 100 nodes. Our results reveal some inter-

esting trade-offs. Although the process to load data into and tune

the execution of parallel DBMSs took much longer than the MR

system, the observed performance of these DBMSs was strikingly

better. We speculate about the causes of the dramatic performance

difference and consider implementation concepts that future sys-

tems should take from both kinds of architectures.

Categories and Subject Descriptors

H.2.4 [Database Management]: Systems—Parallel databases

General Terms

Database Applications, Use Cases, Database Programming

1. INTRODUCTION

Recently the trade press has been ﬁlled with news of the rev-

olution of “cluster computing”. This paradigm entails harnessing

large numbers of (low-end) processors working in parallel to solve

a computing problem. In effect, this suggests constructing a data

center by lining up a large number of low-end servers instead of

deploying a smaller set of high-end servers. With this rise of in-

terest in clusters has come a proliferation of tools for programming

them. One of the earliest and best known such tools in MapReduce

(MR) [8]. MapReduce is attractive because it provides a simple

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

SIGMOD’09, June 29–July 2, 2009, Providence, Rhode Island, USA.

model through which users can express relatively sophisticated dis-

tributed programs, leading to signiﬁcant interest in the educational

community. For example, IBM and Google have announced plans

to make a 1000 processor MapReduce cluster available to teach stu-

dents distributed programming.

Given this interest in MapReduce, it is natural to ask “Why not

use a parallel DBMS instead?” Parallel database systems (which

all share a common architectural design) have been commercially

available for nearly two decades, and there are now about a dozen in

the marketplace, including Teradata, Aster Data, Netezza, DATAl-

legro (and therefore soon Microsoft SQL Server via Project Madi-

son), Dataupia, Vertica, ParAccel, Neoview, Greenplum, DB2 (via

the Database Partitioning Feature), and Oracle (via Exadata). They

are robust, high performance computing platforms. Like MapRe-

duce, they provide a high-level programming environment and par-

allelize readily. Though it may seem that MR and parallel databases

target different audiences, it is in fact possible to write almost any

parallel processing task as either a set of database queries (possibly

using user deﬁned functions and aggregates to ﬁlter and combine

data) or a set of MapReduce jobs. Inspired by this question, our goal

is to understand the differences between the MapReduce approach

to performing large-scale data analysis and the approach taken by

parallel database systems. The two classes of systems make differ-

ent choices in several key areas. For example, all DBMSs require

that data conform to a well-deﬁned schema, whereas MR permits

data to be in any arbitrary format. Other differences also include

how each system provides indexing and compression optimizations,

programming models, the way in which data is distributed, and

query execution strategies.

The purpose of this paper is to consider these choices, and the

trade-offs that they entail. We begin in Section 2 with a brief review

of the two alternative classes of systems, followed by a discussion

in Section 3 of the architectural trade-offs. Then, in Section 4 we

present our benchmark consisting of a variety of tasks, one taken

from the MR paper [8], and the rest a collection of more demanding

tasks. In addition, we present the results of running the benchmark

on a 100-node cluster to execute each task. We tested the publicly

available open-source version of MapReduce, Hadoop [1], against

two parallel SQL DBMSs, Vertica [3] and a second system from a

major relational vendor. We also present results on the time each

system took to load the test data and report informally on the pro-

cedures needed to set up and tune the software for each task.

In general, the SQL DBMSs were signiﬁcantly faster and re-

quired less code to implement each task, but took longer to tune and

load the data. Hence, we conclude with a discussion on the reasons

for the differences between the approaches and provide suggestions

on the best practices for any large-scale data analysis engine.

Some readers may feel that experiments conducted using 100

nodes are not interesting or representative of real world data pro-

cessing systems. We disagree with this conjecture on two points.

First, as we demonstrate in Section 4, at 100 nodes the two parallel

DBMSs range from a factor of 3.1 to 6.5 faster than MapReduce

on a variety of analytic tasks. While MR may indeed be capable

of scaling up to 1000s of nodes, the superior efﬁciency of mod-

ern DBMSs alleviates the need to use such massive hardware on

datasets in the range of 1–2PB (1000 nodes with 2TB of disk/node

has a total disk capacity of 2PB). For example, eBay’s Teradata con-

ﬁguration uses just 72 nodes (two quad-core CPUs, 32GB RAM,

104 300GB disks per node) to manage approximately 2.4PB of re-

lational data. As another example, Fox Interactive Media’s ware-

house is implemented using a 40-node Greenplum DBMS. Each

node is a Sun X4500 machine with two dual-core CPUs, 48 500GB

disks, and 16 GB RAM (1PB total disk space) [7]. Since few data

sets in the world even approach a petabyte in size, it is not at all

clear how many MR users really need 1,000 nodes.

2. TWO APPROACHES TO LARGE SCALE

DATA ANALYSIS

The two classes of systems we consider in this paper run on a

“shared nothing” collection of computers [19]. That is, the sys-

tem is deployed on a collection of independent machines, each with

local disk and local main memory, connected together on a high-

speed local area network. Both systems achieve parallelism by

dividing any data set to be utilized into partitions, which are al-

located to different nodes to facilitate parallel processing. In this

section, we provide an overview of how both the MR model and

traditional parallel DBMSs operate in this environment.

2.1 MapReduce

One of the attractive qualities about the MapReduce program-

ming model is its simplicity: an MR program consists only of two

functions, called Map and Reduce, that are written by a user to

process key/value data pairs. The input data set is stored in a col-

lection of partitions in a distributed ﬁle system deployed on each

node in the cluster. The program is then injected into a distributed

processing framework and executed in a manner to be described.

The Map function reads a set of “records” from an input ﬁle,

does any desired ﬁltering and/or transformations, and then outputs

a set of intermediate records in the form of new key/value pairs. As

the Map function produces these output records, a “split” function

partitions the records into R disjoint buckets by applying a function

to the key of each output record. This split function is typically a

hash function, though any deterministic function will sufﬁce. Each

map bucket is written to the processing node’s local disk. The Map

function terminates having produced R output ﬁles, one for each

bucket. In general, there are multiple instances of the Map function

running on different nodes of a compute cluster. We use the term

instance to mean a unique running invocation of either the Map or

Reduce function. Each Map instance is assigned a distinct portion

of the input ﬁle by the MR scheduler to process. If there are M

such distinct portions of the input ﬁle, then there are R ﬁles on disk

storage for each of the M Map tasks, for a total of M × R ﬁles;

, 1 ≤ i ≤ M, 1 ≤ j ≤ R. The key observation is that all Map

instances use the same hash function; thus, all output records with

the same hash value are stored in the same output ﬁle.

The second phase of a MR program executes R instances of the

Reduce program, where R is typically the number of nodes. The

input for each Reduce instance R

consists of the ﬁles F

, 1 ≤

i ≤ M . These ﬁles are transferred over the network from the Map

nodes’ local disks. Note that again all output records from the Map

phase with the same hash value are consumed by the same Reduce

instance, regardless of which Map instance produced the data. Each

Reduce processes or combines the records assigned to it in some

way, and then writes records to an output ﬁle (in the distributed ﬁle

system), which forms part of the computation’s ﬁnal output.

The input data set exists as a collection of one or more partitions

in the distributed ﬁle system. It is the job of the MR scheduler to

decide how many Map instances to run and how to allocate them

to available nodes. Likewise, the scheduler must also decide on

the number and location of nodes running Reduce instances. The

MR central controller is responsible for coordinating the system

activities on each node. A MR program ﬁnishes execution once the

ﬁnal result is written as new ﬁles in the distributed ﬁle system.

2.2 Parallel DBMSs

Database systems capable of running on clusters of shared noth-

ing nodes have existed since the late 1980s. These systems all sup-

port standard relational tables and SQL, and thus the fact that the

data is stored on multiple machines is transparent to the end-user.

Many of these systems build on the pioneering research from the

Gamma [10] and Grace [11] parallel DBMS projects. The two key

aspects that enable parallel execution are that (1) most (or even all)

tables are partitioned over the nodes in a cluster and that (2) the sys-

tem uses an optimizer that translates SQL commands into a query

plan whose execution is divided amongst multiple nodes. Because

programmers only need to specify their goal in a high level lan-

guage, they are not burdened by the underlying storage details, such

as indexing options and join strategies.

Consider a SQL command to ﬁlter the records in a table T

based

on a predicate, along with a join to a second table T

with an aggre-

gate computed on the result of the join. A basic sketch of how this

command is processed in a parallel DBMS consists of three phases.

Since the database will have already stored T

on some collection

of the nodes partitioned on some attribute, the ﬁlter sub-query is

ﬁrst performed in parallel at these sites similar to the ﬁltering per-

formed in a Map function. Following this step, one of two common

parallel join algorithms are employed based on the size of data ta-

bles. For example, if the number of records in T

is small, then the

DBMS could replicate it on all nodes when the data is ﬁrst loaded.

This allows the join to execute in parallel at all nodes. Following

this, each node then computes the aggregate using its portion of the

answer to the join. A ﬁnal “roll-up” step is required to compute the

ﬁnal answer from these partial aggregates [9].

If the size of the data in T

is large, then T

’s contents will be

distributed across multiple nodes. If these tables are partitioned on

different attributes than those used in the join, the system will have

to hash both T

and the ﬁltered version of T

on the join attribute us-

ing a common hash function. The redistribution of both T

and the

ﬁltered version of T

to the nodes is similar to the processing that

occurs between the Map and the Reduce functions. Once each node

has the necessary data, it then performs a hash join and calculates

the preliminary aggregate function. Again, a roll-up computation

must be performed as a last step to produce the ﬁnal answer.

At ﬁrst glance, these two approaches to data analysis and pro-

cessing have many common elements; however, there are notable

differences that we consider in the next section.

3. ARCHITECTURAL ELEMENTS

In this section, we consider aspects of the two system architec-

tures that are necessary for processing large amounts of data in a

distributed environment. One theme in our discussion is that the na-

ture of the MR model is well suited for development environments

with a small number of programmers and a limited application do-

main. This lack of constraints, however, may not be appropriate for

longer-term and larger-sized projects.

3.1 Schema Support

Parallel DBMSs require data to ﬁt into the relational paradigm

of rows and columns. In contrast, the MR model does not require

that data ﬁles adhere to a schema deﬁned using the relational data

model. That is, the MR programmer is free to structure their data in

any manner or even to have no structure at all.

One might think that the absence of a rigid schema automati-

cally makes MR the preferable option. For example, SQL is often

criticized for its requirement that the programmer must specify the

“shape” of the data in a data deﬁnition facility. On the other hand,

the MR programmer must often write a custom parser in order to

derive the appropriate semantics for their input records, which is at

least an equivalent amount of work. But there are also other poten-

tial problems with not using a schema for large data sets.

Whatever structure exists in MR input ﬁles must be built into

the Map and Reduce programs. Existing MR implementations pro-

vide built-in functionality to handle simple key/value pair formats,

but the programmer must explicitly write support for more com-

plex data structures, such as compound keys. This is possibly an

acceptable approach if a MR data set is not accessed by multiple

applications. If such data sharing exists, however, a second pro-

grammer must decipher the code written by the ﬁrst programmer to

decide how to process the input ﬁle. A better approach, followed

by all SQL DBMSs, is to separate the schema from the application

and store it in a set of system catalogs that can be queried.

But even if the schema is separated from the application and

made available to multiple MR programs through a description fa-

cility, the developers must also agree on a single schema. This ob-

viously requires some commitment to a data model or models, and

the input ﬁles must obey this commitment as it is cumbersome to

modify data attributes once the ﬁles are created.

Once the programmers agree on the structure of data, something

or someone must ensure that any data added or modiﬁed does not

violate integrity or other high-level constraints (e.g., employee salaries

must be non negative). Such conditions must be known and explic-

itly adhered to by all programmers modifying a particular data set;

a MR framework and its underlying distributed storage system has

no knowledge of these rules, and thus allows input data to be easily

corrupted with bad data. By again separating such constraints from

the application and enforcing them automatically by the run time

system, as is done by all SQL DBMSs, the integrity of the data is

enforced without additional work on the programmer’s behalf.

In summary, when no sharing is anticipated, the MR paradigm is

quite ﬂexible. If sharing is needed, however, then we argue that it is

advantageous for the programmer to use a data description language

and factor schema deﬁnitions and integrity constraints out of appli-

cation programs. This information should be installed in common

system catalogs accessible to the appropriate users and applications.

3.2 Indexing

All modern DBMSs use hash or B-tree indexes to accelerate ac-

cess to data. If one is looking for a subset of records (e.g., em-

ployees with a salary greater than $100,000), then using a proper

index reduces the scope of the search dramatically. Most database

systems also support multiple indexes per table. Thus, the query

optimizer can decide which index to use for each query or whether

to simply perform a brute-force sequential search.

Because the MR model is so simple, MR frameworks do not pro-

vide built-in indexes. The programmer must implement any indexes

that they may desire to speed up access to the data inside of their

application. This is not easily accomplished, as the framework’s

data fetching mechanisms must also be instrumented to use these

indexes when pushing data to running Map instances. Once more,

this is an acceptable strategy if the indexes do not need to be shared

between multiple programmers, despite requiring every MR pro-

grammer re-implement the same basic functionality.

If sharing is needed, however, then the speciﬁcations of what in-

dexes are present and how to use them must be transferred between

programmers. It is again preferable to store this index information

in a standard format in the system catalogs, so that programmers

can query this structure to discover such knowledge.

3.3 Programming Model

During the 1970s, the database research community engaged in a

contentious debate between the relational advocates and the Coda-

syl advocates [18]. The salient issue of this discussion was whether

a program to access data in a DBMS should be written either by:

1. Stating what you want – rather than presenting an algorithm

for how to get it (Relational)

2. Presenting an algorithm for data access (Codasyl)

In the end, the former view prevailed and the last 30 years is

a testament to the value of relational database systems. Programs

in high-level languages, such as SQL, are easier to write, easier

to modify, and easier for a new person to understand. Codasyl

was criticized for being “the assembly language of DBMS access”.

We argue that MR programming is somewhat analogous to Codasyl

programming: one is forced to write algorithms in a low-level lan-

guage in order to perform record-level manipulation. On the other

hand, to many people brought up programming in procedural lan-

guages, such as C/C++ or Java, describing tasks in a declarative

language like SQL can be challenging.

Anecdotal evidence from the MR community suggests that there

is widespread sharing of MR code fragments to do common tasks,

such as joining data sets. To alleviate the burden of having to re-

implement repetitive tasks, the MR community is migrating high-

level languages on top of the current interface to move such func-

tionality into the run time. Pig [15] and Hive [2] are two notable

projects in this direction.

3.4 Data Distribution

The conventional wisdom for large-scale databases is to always

send the computation to the data, rather than the other way around.

In other words, one should send a small program over the network

to a node, rather than importing a large amount of data from the

node. Parallel DBMSs use knowledge of data distribution and loca-

tion to their advantage: a parallel query optimizer strives to balance

computational workloads while minimizing the amount data trans-

mitted over the network connecting the nodes of the cluster.

Aside from the initial decision on where to schedule Map in-

stances, a MR programmer must perform these tasks manually. For

example, suppose a user writes a MR program to process a collec-

tion of documents in two parts. First, the Map function scans the

documents and creates a histogram of frequently occurring words.

The documents are then passed to a Reduce function that groups

ﬁles by their site of origin. Using this data, the user, or another

user building on the ﬁrst user’s work, now wants to ﬁnd sites with

a document that contains more than ﬁve occurrences of the word

‘Google’ or the word ‘IBM’. In the naive implementation of this

query, where the Map is executed over the accumulated statistics,

the ﬁltration is done after the statistics for all documents are com-

puted and shipped to reduce workers, even though only a small sub-

set of documents satisfy the keyword ﬁlter.

In contrast, the following SQL view and select queries perform a

similar computation:

CREATE VIEW Keywords AS

SELECT siteid, docid, word, COUNT(

) AS wordcount

FROM Documents

GROUP BY siteid, docid, word;

SELECT DISTINCT siteid

FROM Keywords

WHERE (word = ‘IBM’ OR word = ‘Google’) AND wordcount > 5;

A modern DBMS would rewrite the second query such that the

view deﬁnition is substituted for the Keywords table in the FROM

clause. Then, the optimizer can push the WHERE clause in the query

down so that it is applied to the Documents table before the COUNT

is computed, substantially reducing computation. If the documents

are spread across multiple nodes, then this ﬁlter can be applied on

each node before documents belonging to the same site are grouped

together, generating much less network I/O.

3.5 Execution Strategy

There is a potentially serious performance problem related to

MR’s handling of data transfer between Map and Reduce jobs. Re-

call that each of the N Map instances produces M output ﬁles,

each destined for a different Reduce instance. These ﬁles are writ-

ten to the local disk on the node executing each particular Map in-

stance. If N is 1000 and M is 500, the Map phase of the program

produces 500,000 local ﬁles. When the Reduce phase starts, each

of the 500 Reduce instances needs to read its 1000 input ﬁles and

must use a ﬁle-transfer protocol to “pull” each of its input ﬁles from

the nodes on which the Map instances were run. With 100s of Re-

duce instances running simultaneously, it is inevitable that two or

more Reduce instances will attempt to read their input ﬁles from

the same map node simultaneously, inducing large numbers of disk

seeks and slowing the effective disk transfer rate. This is why par-

allel database systems do not materialize their split ﬁles and instead

use a push approach to transfer data instead of a pull.

3.6 Flexibility

Despite its widespread adoption, SQL is routinely criticized for

its insufﬁcient expressive prowess. Some believe that it was a mis-

take for the database research community in the 1970s to focus on

data sub-languages that could be embedded in any programming

language, rather than adding high-level data access to all program-

ming languages. Fortunately, new application frameworks, such as

Ruby on Rails [21] and LINQ [14], have started to reverse this sit-

uation by leveraging new programming language functionality to

implement an object-relational mapping pattern. These program-

ming environments allow developers to beneﬁt from the robustness

of DBMS technologies without the burden of writing complex SQL.

Proponents of the MR model argue that SQL does not facilitate

the desired generality that MR provides. But almost all of the major

DBMS products (commercial and open-source) now provide sup-

port for user-deﬁned functions, stored procedures, and user-deﬁned

aggregates in SQL. Although this does not have the full generality

of MR, it does improve the ﬂexibility of database systems.

3.7 Fault Tolerance

The MR frameworks provide a more sophisticated failure model

than parallel DBMSs. While both classes of systems use some form

of replication to deal with disk failures, MR is far more adept at

handling node failures during the execution of a MR computation.

In a MR system, if a unit of work (i.e., processing a block of data)

fails, then the MR scheduler can automatically restart the task on

an alternate node. Part of the ﬂexibility is the result of the fact that

the output ﬁles of the Map phase are materialized locally instead of

being streamed to the nodes running the Reduce tasks. Similarly,

pipelines of MR jobs, such as the one described in Section 4.3.4,

materialize intermediate results to ﬁles each step of the way. This

differs from parallel DBMSs, which have larger granules of work

(i.e., transactions) that are restarted in the event of a failure. Part of

the reason for this approach is that DBMSs avoid saving interme-

diate results to disk whenever possible. Thus, if a single node fails

during a long running query in a DBMS, the entire query must be

completely restarted.

4. PERFORMANCE BENCHMARKS

In this section, we present our benchmark consisting of ﬁve tasks

that we use to compare the performance of the MR model with that

of parallel DBMSs. The ﬁrst task is taken directly from the origi-

nal MapReduce paper [8] that the authors’ claim is representative of

common MR tasks. Because this task is quite simple, we also devel-

oped four additional tasks, comprised of more complex analytical

workloads designed to explore the trade-offs discussed in the pre-

vious section. We executed our benchmarks on a well-known MR

implementation and two parallel DBMSs.

4.1 Benchmark Environment

As we describe the details of our benchmark environment, we

note how the different data analysis systems that we test differ in

operating assumptions and discuss the ways in which we dealt with

them in order to make the experiments uniform.

4.1.1 Tested Systems

Hadoop: The Hadoop system is the most popular open-source im-

plementation of the MapReduce framework, under development

by Yahoo! and the Apache Software Foundation [1]. Unlike the

Google implementation of the original MR framework written in

C++, the core Hadoop system is written entirely in Java. For our

experiments in this paper, we use Hadoop version 0.19.0 running

on Java 1.6.0. We deployed the system with the default conﬁgura-

tion settings, except for the following changes that we found yielded

better performance without diverging from core MR fundamentals:

(1) data is stored using 256MB data blocks instead of the default

64MB, (2) each task executor JVM ran with a maximum heap size

of 512MB and the DataNode/JobTracker JVMs ran with a maxi-

mum heap size of a 1024MB (for a total size of 3.5GB per node),

(3) we enabled Hadoop’s “rack awareness” feature for data locality

in the cluster, and (4) we allowed Hadoop to reuse the task JVM

executor instead starting a new process for each Map/Reduce task.

Moreover, we conﬁgured the system to run two Map instances and

a single Reduce instance concurrently on each node.

The Hadoop framework also provides an implementation of the

Google distributed ﬁle system [12]. For each benchmark trial, we

store all input and output data in the Hadoop distributed ﬁle system

(HDFS). We used the default settings of HDFS of three replicas

per block and without compression; we also tested other conﬁgura-

tions, such as using only a single replica per block as well as block-

and record-level compression, but we found that our tests almost

always executed at the same speed or worse with these features en-

abled (see Section 5.1.3). After each benchmark run ﬁnishes for a

particular node scaling level, we delete the data directories on each

node and reformat HDFS so that the next set of input data is repli-

cated uniformly across all nodes.

Hadoop uses a central job tracker and a “master” HDFS daemon

to coordinate node activities. To ensure that these daemons do not

affect the performance of worker nodes, we execute both of these

additional framework components on a separate node in the cluster.

DBMS-X: We used the latest release of DBMS-X, a parallel SQL

DBMS from a major relational database vendor that stores data in

a row-based format. The system is installed on each node and con-

ﬁgured to use 4GB shared memory segments for the buffer pool

and other temporary space. Each table is hash partitioned across

all nodes on the salient attribute for that particular table, and then

sorted and indexed on different attributes (see Sections 4.2.1 and

4.3.1). Like the Hadoop experiments, we deleted the tables in DBMS-

X and reloaded the data for each trial to ensure that the tuples was

uniformly distributed in the cluster.

By default DBMS-X does not compress data in its internal stor-

age, but it does provide ability to compress tables using a well-

known dictionary-based scheme. We found that enabling compres-

sion reduced the execution times for almost all the benchmark tasks

by 50%, and thus we only report results with compression enabled.

In only one case did we ﬁnd that using compression actually per-

formed worse. Furthermore, because all of our benchmarks are

read-only, we did not enable replication features in DBMS-X, since

this would not have improved performance and complicates the in-

stallation process.

Vertica: The Vertica database is a parallel DBMS designed for

large data warehouses [3]. The main distinction of Vertica from

other DBMSs (including DBMS-X) is that all data is stored as columns,

rather than rows [20]. It uses a unique execution engine designed

speciﬁcally for operating on top of a column-oriented storage layer.

Unlike DBMS-X, Vertica compresses data by default since its ex-

ecutor can operate directly on compressed tables. Because dis-

abling this feature is not typical in Vertica deployments, the Ver-

tica results in this paper are generated using only compressed data.

Vertica also sorts every table by one or more attributes based on a

clustered index.

We found that the default 256MB buffer size per node performed

well in our experiments. The Vertica resource manager is respon-

sible for setting the amount of memory given to queries, but we

provide a hint to the system to expect to execute only one query at

a time. Thus, each query receives most the maximum amount of

memory available on each node at runtime.

4.1.2 Node Conﬁguration

All three systems were deployed on a 100-node cluster. Each

node has a single 2.40 GHz Intel Core 2 Duo processor running 64-

bit Red Hat Enterprise Linux 5 (kernel version 2.6.18) with 4GB

RAM and two 250GB SATA-I hard disks. According to hdparm,

the hard disks deliver 7GB/sec for cached reads and about 74MB/sec

for buffered reads. The nodes are connected with Cisco Catalyst

3750E-48TD switches. This switch has gigabit Ethernet ports for

each node and an internal switching fabric of 128Gbps [6]. There

are 50 nodes per switch. The switches are linked together via Cisco

StackWise Plus, which creates a 64Gbps ring between the switches.

Trafﬁc between two nodes on the same switch is entirely local to the

switch and does not impact trafﬁc on the ring.

4.1.3 Benchmark Execution

For each benchmark task, we describe the steps used to imple-

ment the MR program as well as provide the equivalent SQL state-

ment(s) executed by the two database systems. We executed each

task three times and report the average of the trials. Each system ex-

ecutes the benchmark tasks separately to ensure exclusive access to

the cluster’s resources. To measure the basic performance without

the overhead of coordinating parallel tasks, we ﬁrst execute each

task on a single node. We then execute the task on different cluster

sizes to show how each system scales as both the amount of data

processed and available resources are increased. We only report

results using trials where all nodes are available and the system’s

software operates correctly during the benchmark execution.

We also measured the time it takes for each system to load the

test data. The results from these measurements are split between

the actual loading of the data and any additional operations after the

loading that each system performs, such as compressing or building

indexes. The initial input data on each node is stored on one of its

two locally installed disks.

Unless otherwise indicated, the ﬁnal results from the queries ex-

ecuting in Vertica and DBMS-X are piped from a shell command

into a ﬁle on the disk not used by the DBMS. Although it is possi-

ble to do an equivalent operation in Hadoop, it is easier (and more

common) to store the results of a MR program into the distributed

ﬁle system. This procedure, however, is not analogous to how the

DBMSs produce their output data; rather than storing the results in

a single ﬁle, the MR program produces one output ﬁle for each Re-

duce instance and stores them in a single directory. The standard

practice is for developers then to use these output directories as a

single input unit for other MR jobs. If, however, a user wishes to

use this data in a non-MR application, they must ﬁrst combine the

results into a single ﬁle and download it to the local ﬁle system.

Because of this discrepancy, we execute an extra Reduce func-

tion for each MR benchmark task that simply combines the ﬁnal

output into a single ﬁle in HDFS. Our results differentiate between

the execution times for Hadoop running the actual benchmark task

versus the additional combine operation. Thus, the Hadoop results

displayed in the graphs for this paper are shown as stacked bars: the

lower portion of each bar is the execution time for just the speciﬁc

benchmark task, while the upper portion is the execution time for

the single Reduce function to combine all of the program’s output

data into a single ﬁle.

4.2 The Original MR Task

Our ﬁrst benchmark task is the “Grep task” taken from the orig-

inal MapReduce paper, which the authors describe as “represen-

tative of a large subset of the real programs written by users of

MapReduce” [8]. For this task, each system must scan through a

data set of 100-byte records looking for a three-character pattern.

Each record consists of a unique key in the ﬁrst 10 bytes, followed

by a 90-byte random value. The search pattern is only found in the

last 90 bytes once in every 10,000 records.

The input data is stored on each node in plain text ﬁles, with one

record per line. For the Hadoop trials, we uploaded these ﬁles unal-

tered directly into HDFS. To load the data into Vertica and DBMS-

X, we execute each system’s proprietary load commands in parallel

on each node and store the data using the following schema:

CREATE TABLE Data (

key VARCHAR(10) PRIMARY KEY,

field VARCHAR(90) );

We execute the Grep task using two different data sets. The mea-

surements in the original MapReduce paper are based on process-

ing 1TB of data on approximately 1800 nodes, which is 5.6 million

records or roughly 535MB of data per node. For each system, we

execute the Grep task on cluster sizes of 1, 10, 25, 50, and 100

nodes. The total number of records processed for each cluster size

is therefore 5.6 million times the number of nodes. The perfor-

mance of each system not only illustrates how each system scales

as the amount of data is increased, but also allows us to (to some

extent) compare the results to the original MR system.

While our ﬁrst dataset ﬁxes the size of the data per node to be the

same as the original MR benchmark and only varies the number of

nodes, our second dataset ﬁxes the total dataset size to be the same

as the original MR benchmark (1TB) and evenly divides the data

amongst a variable number of nodes. This task measures how well

each system scales as the number of available nodes is increased.

A comparison of approaches to large-scale data analysis

Citations

Benchmarking cloud serving systems with YCSB

Data-intensive applications, challenges, techniques and technologies: A survey on Big Data

Hive: a warehousing solution over a map-reduce framework

Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Spanner: Google's globally-distributed database

References

MapReduce: simplified data processing on large clusters

MapReduce: simplified data processing on large clusters

The Google file system

Dryad: distributed data-parallel programs from sequential building blocks

Pig latin: a not-so-foreign language for data processing

Related Papers (5)

MapReduce: simplified data processing on large clusters

Pig latin: a not-so-foreign language for data processing

Hive: a warehousing solution over a map-reduce framework

Dryad: distributed data-parallel programs from sequential building blocks

The Google file system

Frequently Asked Questions (17)

Q1. What are the contributions mentioned in the paper "A comparison of approaches to large-scale data analysis" ?

Q2. How many records are processed in the original mapreduce paper?

Q3. How long does it take to read the UserVisits and Rankings tables off disk?

Q4. What is the attractive aspect of the MapReduce programming model?

Q5. How did MR perform the pre-aggregate before data was transmitted to the Reduce instances?

Q6. What language is more familiar to programmers than SQL?

Q7. How many nodes do a MR system need to perform a query?

Q8. What is the probability of mid-query hardware failures in parallel DBMSs?

Q9. Why do programmers need to specify their goal in a high level language?

Q10. What is the approach to enforce the integrity of data?

Q11. What is the function that calculates the inlink count for a given key?

Q12. Why did the authors initially think that block-level compression would improve the performance of the Map and Reduce?

Q13. What is the main reason why the central job tracker is required to coordinate node activities?

Q14. How did the authors find that compression reduced the execution times for almost all the benchmark tasks?

Q15. What other data format options resulted in slower load and execution times?

Q16. Why did the authors not enable replication in DBMS-X?

Q17. How do the authors measure the basic performance without the overhead of coordinating parallel tasks?