What are the future works mentioned in the paper "Performance evaluation of big data frameworks for large-scale data analytics" ?

As future work, the authors plan to investigate further on the impact of more configuration parameters on the performance of these frameworks ( e. g., spilling threshold, network buffers ).

What are the network interconnects that have been evaluated?

The network interconnects that have been evaluated are GbE and IPoIB, configuring the frameworks to use each interface for shuffle operations and HDFS replications.

What is the way to minimize the overhead of garbage collections?

The transparent use of persistent memory management using a custom object serializer for Flink operations minimizes the overhead of garbage collections.

What is the impact of the network on the internode communications?

The impact of the network not only affects the internode communications during the shuffle phase, but also the writing operations to HDFS, which replicate the data blocks among the slave nodes.

What makes Flink the scalable framework for PageRank?

along with efficient memory management to avoid major garbage collections, makes Flink the most scalable framework for PageRank, obtaining execution times up to 6.9x and 3.6x faster than Hadoop and Spark, respectively.

What is the main reason why the Flink optimizer is better than Hadoop?

Authors concludethat the processing of some operators like groupBy or join and the pipelining of data between operators are more efficient in Flink, and that the Flink optimizer provides better performance with complex algorithms.

(Open Access) Performance evaluation of big data frameworks for large-scale data analytics (2016) | Jorge Veiga

Q: How many times did Flink outperform Spark?

Results show that Flink outperforms Spark by up to 3x for the Histogram and Mapping to Region workloads and that, contrary to the results of the relational query in [8], Spark was better by up to 4x for the Join workload.

Performance Evaluation of Big Data Frameworks for Large-Scale Data Analytics

Jorge Veiga, Roberto R. Exp

osito, Xo

an C. Pardo, Guillermo L. Taboada, Juan Touri

Computer Architecture Group, Universidade da Coru

na, Spain

{jorge.veiga, rreye, pardo, taboada, juan}@udc.es

Abstract—The increasing adoption of Big Data analytics

has led to a high demand for efﬁcient technologies in order

to manage and process large datasets. Popular MapReduce

frameworks such as Hadoop are being replaced by emerging

ones like Spark or Flink, which improve both the programming

APIs and performance. However, few works have focused

on comparing these frameworks. This paper addresses this

issue by performing a comparative evaluation of Hadoop,

Spark and Flink using representative Big Data workloads and

considering factors like performance and scalability. Moreover,

the behavior of these frameworks has been characterized by

modifying some of the main parameters of the workloads such

as HDFS block size, input data size, interconnect network or

thread conﬁguration. The analysis of the results has shown that

replacing Hadoop with Spark or Flink can lead to a reduction

in execution times by 77% and 70% on average, respectively,

for non-sort benchmarks.

Keywords-Big Data; MapReduce; Hadoop; Spark; Flink

I. INTRODUCTION

In the last decade, Big Data analytics has been widely

adopted by many organizations to obtain valuable informa-

tion from the large datasets they manage. This is mainly

caused by the appearance of new technologies that provide

powerful functionalities to the end users, who can focus on

the transformations to be performed on the data rather than

on the parallelization of the algorithms.

One of these technologies is Apache Hadoop [1], an

open-source implementation of the MapReduce model [2].

The success of Hadoop is mainly due to its parallelization

abstraction, fault-tolerance and scalable architecture, which

supports both distributed storage and processing of large

datasets. However, the performance of Hadoop is severely

limited by redundant memory copies and disk operations

that it performs when processing the data [3]. As a result,

new Big Data frameworks have appeared on the scene in

the last several years, claiming to be a worthy alternative

to Hadoop as they improve its performance by means of

in-memory computing techniques. Apache Spark [4] and

Apache Flink [5] are the ones that have attracted more atten-

tion due to their easy-to-use programming APIs, high-level

data processing operators and performance enhancements.

Although both authors of Spark and Flink provide ex-

perimental results about their performance, there is lack

of impartial studies comparing the frameworks. This kind

of analysis is extremely important to identify the strengths

and weaknesses of current technologies and help developers

to determine the best characteristics for future Big Data

systems. Hence, this paper aims to assess the performance of

Hadoop, Spark and Flink on equal terms, with the following

contributions:

• Comparative performance evaluation of Hadoop, Spark

and Flink with various batch and iterative workloads.

• Characterization of the impact of several experimental

parameters on the overall performance.

The rest of this paper is organized as follows: Section II

presents the related work. Section III brieﬂy introduces the

main characteristics of Hadoop, Spark and Flink. Section IV

describes the experimental conﬁguration and Section V

analyzes the performance results. Finally, Section VI sum-

marizes our concluding remarks and proposes future work.

II. RELATED WORK

Despite the importance of having performance studies

that compare Hadoop, Spark and Flink, there are still not

many publications on the subject. Authors of Spark [4]

and Flink [5] have shown that their frameworks provide

better performance than Hadoop using several workloads.

A few impartial references compare Spark with Hadoop.

In [6], several frameworks including Hadoop (1.0.3) and

Spark (0.8.0) are evaluated on Amazon EC2 using iterative

algorithms. Results show that Spark outperforms Hadoop

up to 48x for the PAM clustering algorithm and up to 99x

for the CG linear system solver. But for the CLARA k-

medoid clustering algorithm Spark is slower than Hadoop

due to difﬁculties in handling a dataset with large number

of small objects. In [7], Hadoop (2.4.0) and Spark (1.3.0)

are evaluated using a set of data analytics workloads on a

4-node cluster. Results show that Spark outperforms Hadoop

by 2.5x for WordCount and 5x for K-Means and PageRank.

Authors point out the efﬁciency of the hash-based aggre-

gation component for combine and the RDD caching as the

main reasons. An exception is the Sort benchmark for which

Hadoop is 2x faster than Spark, showing a more efﬁcient

execution model for data shufﬂing.

It was only recently that some comparisons between Spark

and Flink have been published. In [8], both frameworks

(versions are not mentioned) are compared on a 4-node

cluster using real-world datasets. Results show that Spark

outperforms Flink by up to 2x for WordCount, while Flink

is better than Spark by up to 3x for K-Means, 2.5x for

PageRank and 5x for a relational query. Authors conclude

that the processing of some operators like groupBy or join

and the pipelining of data between operators are more

efﬁcient in Flink, and that the Flink optimizer provides better

performance with complex algorithms. In [9], Flink (0.9.0)

and Spark (1.3.1) are evaluated on Amazon EC2 using three

workloads from genomic applications over datasets of up to

billions of genomic regions. Results show that Flink outper-

forms Spark by up to 3x for the Histogram and Mapping

to Region workloads and that, contrary to the results of the

relational query in [8], Spark was better by up to 4x for the

Join workload. Authors conclude that concurrent execution

in Flink is more efﬁcient because it produces less sequential

stages and that the tuple-based pipelining of data between

operators of Flink is more efﬁcient than the block-based

Spark counterpart.

Finally, and although not in the scope of this work,

[10], [11] and [12] provide recent comparisons of Big Data

frameworks from the streaming point of view.

III. TECHNOLOGICAL BACKGROUND

The main characteristics of Hadoop [1], Spark [4] and

Flink [5] are explained below.

Hadoop: As the de-facto standard implementation of

the MapReduce model [2], Hadoop has been widely adopted

by many organizations to store and compute large datasets.

It mainly consists of two components: (1) the Hadoop

Distributed File System (HDFS) and (2) the Hadoop MapRe-

duce engine. The MapReduce model is based on two user-

deﬁned functions, map and reduce, which compute the data

records represented by key-value pairs. The map function

extracts the relevant characteristics of each pair and the

reduce function operates these characteristics to obtain the

desired result. Although it can provide good scalability for

batch workloads, the Hadoop MapReduce engine is strictly

disk-based incurring high disk overhead and requiring extra

memory copies during data processing.

Spark: Spark increases the variety of transformations

that the user can perform over the data, while still including

several operators for key-based computations (e.g., sort-

ByKey), which makes Spark particularly suited to implement

classic key-based MapReduce algorithms. The programming

model of Spark is based on the abstraction called Resilient

Distributed Datasets (RDDs) [13], which holds the data

objects in memory to reduce the overhead caused by disk and

network operations [4]. This kind of processing is specially

well suited for algorithms that carry out several transforma-

tions over the same dataset, like iterative algorithms. By

storing intermediate results in memory, Spark avoids the

use of HDFS between iterations and thus optimizes the

performance of these workloads.

Flink: Evolved from Stratosphere [5], Flink uses a sim-

ilar approach to Spark to improve Hadoop performance by

using in-memory processing techniques. One of them is the

Table I: DAS-4 node conﬁguration

Hardware conﬁguration

CPU 2 × Intel Xeon E5620 Westmere

CPU Speed/Turbo 2.4 GHz/2.66 GHz

#Cores 8

Memory 24 GB DDR3

Disk 2 × 1 TB HDD

Network IB (40 Gbps) & GbE

Software conﬁguration

OS version CentOS release 6.6

Kernel 2.6.32-358.18.1.el6.x86 64

Java Oracle JDK 1.8.0 25

use of efﬁcient memory data structures that contain serial-

ized data instead of Java objects, avoiding excessive garbage

collections. Its programming model for batch processing is

based on the notion of DataSet, which is transformed by

high-level operations (e.g., FlatMap). Unlike Spark, Flink is

presented as a true streaming engine as it is able to send

data, tuple by tuple, from one operation to the next without

executing computations on batches. For batch processing,

batches are considered as ﬁnite sets of streaming data. It is

worth noting that Flink includes explicit iteration operators.

The bulk iterator is applied to complete DataSets, while the

delta iterator is only applied to the items that changed during

the last iteration.

IV. EXPERIMENTAL SETUP

This section presents the characteristics of the system

where the evaluations have been carried out, the conﬁgura-

tion of the frameworks and the settings for the workloads, as

well as the different experiments that have been performed.

A. Testbed conﬁguration

The evaluations have been conducted on DAS-4 [14], a

multi-core cluster interconnected via InﬁniBand (IB) and

Gigabit Ethernet (GbE). Table I shows the main hardware

and software characteristics of this system. Each node has

8 cores, 24 GB of memory and 2 disks of 1 TB.

The experiments have been carried out by using the

Big Data Evaluator tool (BDEv), which is an evolution

of the MapReduce Evaluator [15]. BDEv automates the

conﬁguration of the frameworks, the generation of the input

datasets, the execution of the experiments and the collection

of the results.

B. Frameworks

Regarding software settings, the evaluations have used

stable versions of Hadoop (2.7.2), Spark (1.6.1) and Flink

(1.0.2). Both Spark and Flink have been deployed in the

stand-alone mode with HDFS 2.7.2. The frameworks have

been carefully conﬁgured according to their corresponding

user guides and the characteristics of the system (e.g., num-

ber of CPU cores, memory size). Table II shows the most

Table II: Conﬁguration of the frameworks

Hadoop Spark Flink

HDFS block size 128 MB HDFS block size 128 MB HDFS block size 128 MB

Replication factor 3 Replication factor 3 Replication factor 3

Mapper/Reducer heap size 2.3 GB Executor heap size 18.8 GB TaskManager heap size 18.8 GB

Mappers per node 4 Workers per node 1 TaskManagers per node 1

Reducers per node 4 Worker cores 8 TaskManager cores 8

Shufﬂe parallel copies 20 Network buffers per node 512

IO sort MB 600 MB TaskManager memory preallocate false

IO sort spill percent 80% IO sort spill percent 80%

Table III: Benchmark sources

Benchmark Characterization Input data size Input generator Hadoop Spark Flink

WordCount CPU bound 100 GB RandomTextWriter Hadoop ex. Adapted from ex. Adapted from ex.

Grep CPU bound 10 GB RandomTextWriter Hadoop ex. Adapted from ex. Adapted from ex.

TeraSort I/O bound 100 GB TeraGen Hadoop ex. Adapted from [16] Adapted from [16]

Connected Components Iterative (8 iter.) 9 GB DataGen Pegasus Graphx Gelly

PageRank Iterative (8 iter.) 9 GB DataGen Pegasus Adapted from ex. Adapted from ex.

K-Means Iterative (8 iter.) 26 GB GenKMeansDataset Mahout MLlib Adapted from ex.

important parameters of the resulting conﬁguration. The

network interface of the frameworks was conﬁgured to use

IP over InﬁniBand (IPoIB), except in the GbE experiments

(see Section V-B).

C. Benchmarks

Table III describes the benchmarks used in the experi-

ments, along with their characterization as CPU bound, I/O

bound (disk and network) or iterative. The size of the input

datasets and the generators used for setting them up are

shown in the next two columns. The table also includes the

source of the benchmark codes, which have been carefully

studied in order to provide a fair performance comparison.

Hence, each framework uses a benchmark implementation

based on the same algorithm, taking the same input and

writing the same output to HDFS. Although the algorithm

remains unchanged, each framework employs an optimized

version adapted to its available functionalities. Therefore,

each benchmark uses a different implementation of the same

algorithm in order to obtain the same result. Further details

about each benchmark are given next.

WordCount: Counts the number of times each word

appears in the input dataset. Both WordCount and its input

data generator, RandomTextWriter, are provided as an ex-

ample (“ex.” in the table) in the Hadoop distribution. In the

case of Spark and Flink, the source code has been adapted

from their examples.

Grep: Counts the matches of a regular expression in

the input dataset. It is included in the Hadoop distribution,

and in the case of Spark and Flink it has been adapted from

their examples. Its data generator is also RandomTextWriter.

TeraSort: Sorts 100 Byte-sized key-value tuples. Its

implementation, as well as the TeraGen data generator, is

included in Hadoop. However, TeraSort is not provided as

an example for Spark and Flink, and so their source codes

have been adapted from [16].

Connected Components: Iterative graph algorithm that

ﬁnds the connected components of a graph. It is included in

Pegasus [17], a graph mining system built on top of Hadoop.

In the case of Spark and Flink, Connected Components

is supported by Graphx [18] and Gelly [19], respectively,

which are graph-oriented APIs. The input dataset is set up

by using the DataGen tool, included in the HiBench [20]

benchmark suite.

PageRank: Iterative graph algorithm which ranks ele-

ments by counting the number and quality of the links to

each one. Pegasus includes PageRank for Hadoop, and the

source codes for Spark and Flink have been adapted from

their examples. Although there are implementations avail-

able in Graphx and Gelly, these versions did not improve

the performance of the examples, and so they have not been

used in the experiments. The input dataset of PageRank is

also set up by DataGen.

K-Means: Iterative clustering algorithm that partitions

a set of samples into K clusters. Apache Mahout [21]

includes this algorithm for Hadoop and provides the dataset

generator, GenKMeansDataSet, while Spark uses the ef-

ﬁcient implementation provided by its machine learning

library MLlib [22]. As the Flink machine library, Flink-ML,

does not include an implementation of K-Means yet, its

source code has been adapted from the example.

All these benchmarks are included in the last version

(2.2) of our BDEv tool, which is available to download at

http://bdev.des.udc.es.

D. Conducted evaluations

In order to perform a thorough experimental analysis, the

evaluations have studied two different aspects: performance

of the frameworks and impact of several conﬁguration

parameters.

The ﬁrst set of experiments, shown in Section V-A,

compares the performance and the strong scalability of

the frameworks. For doing so, the benchmarks have been

executed using 13, 25, 37 and 49 nodes. Each cluster size n

means 1 master and n-1 slave nodes. The input data size of

each benchmark is shown in Table III.

Section V-B analyzes the impact of some conﬁguration

parameters on the overall performance of the frameworks.

Experiments have been carried out using different HDFS

block sizes, input data sizes, network interconnects and

thread conﬁgurations with the maximum cluster size that

has been considered (i.e., 49 nodes). Three benchmarks have

been selected for these experiments: WordCount, TeraSort

and PageRank that represent, respectively, three types of

workloads: CPU bound, I/O bound, and iterative.

The HDFS block sizes that have been evaluated are 64,

128, 256 and 512 MB, using the same input data size as in

Section V-A (see Table III). In the data size experiments,

WordCount and TeraSort have processed 100, 150, 200

and 250 GB, whereas PageRank has processed 9, 12.5,

16 and 19.5 GB, which correspond with 15, 20, 25 and

30 million pages, respectively. The network interconnects

that have been evaluated are GbE and IPoIB, conﬁguring the

frameworks to use each interface for shufﬂe operations and

HDFS replications. These experiments (as well as the thread

conﬁguration experiments described later) have used the

maximum data size considered, 250 GB for WordCount and

TeraSort, and 19.5 GB for PageRank, in order to maximize

their computational requirements.

The thread conﬁgurations of the frameworks determine

how the computational resources of each node are allo-

cated to the Java processes and threads. On the one hand,

Hadoop distributes the CPU cores between mappers and

reducers, which are single-threaded processes. The #map-

pers/#reducers conﬁgurations evaluated are 4/4, 5/3, 6/2

and 7/1. On the other hand, Spark and Flink use Workers

and TaskManagers, respectively, which are multi-threaded

manager processes that run several tasks in parallel. The

#managers/#cores per manager conﬁgurations evaluated are

1/8, 2/4, 4/2 and 8/1.

V. EXPERIMENTAL RESULTS

This section presents the analysis of the evaluation of

Hadoop, Spark and Flink in terms of performance and scal-

ability (Section V-A), as well as the impact of conﬁguration

parameters (Section V-B). The graphs in this section show

the mean value from a minimum of 10 measurements, while

the observed standard deviations were not signiﬁcant.

A. Performance and scalability

Execution times for all benchmarks are shown in Fig-

ure 1. As expected, these graphs demonstrate an important

performance improvement of Spark and Flink over Hadoop.

The comparison between Spark and Flink varies depending

on the benchmark. With the maximum cluster size, Spark

obtains the best results in WordCount and K-Means, while

Flink is better for PageRank. Both obtain similar results for

Grep, TeraSort and Connected Components.

In WordCount, Spark obtains the best results because of

its API, which provides a reduceByKey() function to sum

up the number of times each word appears. Flink uses a

groupBy().sum() approach, which seems to be less optimized

for this kind of workload. Furthermore, the CPU-bound

behavior of WordCount makes the memory optimizations

of Flink less signiﬁcant compared to other benchmarks,

and even introduce a certain overhead when computing the

results.

In Grep, Spark and Flink widely outperform Hadoop due

to several reasons. The most important is the inadequacy

of the MapReduce API for this benchmark. In Hadoop, the

benchmark uses two MapReduce jobs: one for searching the

pattern and another for sorting the results. This produces

a high number of memory copies and writes to HDFS.

Spark and Flink take a different approach, selecting the

matching input lines by means of a ﬁlter() function, without

copying them. Next, the selected lines are counted and sorted

in memory. Moreover, the pattern matching in Hadoop is

performed within the map() function, which has only half

of the CPU cores of the nodes. In Spark and Flink the

parallelism of all the operations is set to the total number

of cores in the cluster.

TeraSort is the benchmark which shows the smallest

performance gap when comparing Hadoop with Spark and

Flink. The main reason is that Hadoop was originally

intended for sorting, being one of the core components of the

MapReduce engine. Although Spark and Flink outperform

Hadoop, its high scalability allows it to obtain competitive

results, especially when using 49 nodes. Spark and Flink are

in a statistical tie. A similar benchmark, Sort, has also been

evaluated in the experiments. However, the results were very

similar to those of TeraSort, and so they are not shown in

the graphs due to space constraints.

The performance of Spark and Flink for iterative algo-

rithms (Figures 1d-1f) is clearly better than that of Hadoop

(up to 87% improvement with 49 nodes). As mentioned in

Section IV-C, both frameworks provide optimized libraries

for graph algorithms, Graphx and Gelly, obtaining very

similar results for Connected Components. That is not the

case of PageRank, whose implementation has been derived

from the examples. In this benchmark, Flink obtains the best

performance mainly due to the use of delta iterations, which

only process those elements that have not reached their ﬁnal

100

200

300

400

500

13 25 37 49

Time (s)

Cluster size (#nodes)

Hadoop

Spark Flink

(a) WordCount execution times

500

1000

1500

2000

13 25 37 49

Time (s)

Cluster size (#nodes)

Hadoop

Spark Flink

(b) Grep execution times

100

200

300

400

500

600

700

800

900

13 25 37 49

Time (s)

Cluster size (#nodes)

Hadoop

Spark Flink

200

400

600

800

1000

1200

1400

1600

13 25 37 49

Time (s)

Cluster size (#nodes)

Hadoop

Spark Flink

(d) Connected Components execution times

200

400

600

800

1000

1200

1400

1600

1800

13 25 37 49

Time (s)

Cluster size (#nodes)

Hadoop

Spark Flink

(e) PageRank execution times

500

1000

1500

2000

2500

3000

13 25 37 49

Time (s)

Cluster size (#nodes)

Hadoop

Spark Flink

(f) K-Means execution times

Figure 1: Performance results

value. However, Spark obtains the best results for K-Means

thanks to the optimized MLlib library, although it is expected

that the support of K-Means in Flink-ML can bridge this

performance gap.

To sum up, the performance results of this section show

that, excluding TeraSort, replacing Hadoop with Spark and

Flink can lead to a reduction in execution times by 77% and

70% on average, respectively, when using 49 nodes.

B. Impact of conﬁguration parameters

This section shows how the performance of WordCount,

TeraSort and PageRank is affected when modifying some of

the conﬁguration parameters of the experiments. Note that

those parameters which were not speciﬁcally modiﬁed keep

the values indicated in the experimental setup of Section IV

(see Table II).

Performance evaluation of big data frameworks for large-scale data analytics

Figures

Citations

An experimental survey on big data frameworks

Evaluation of distributed stream processing frameworks for IoT applications in Smart Cities

Network Intrusion Detection with a Hashing Based Apriori Algorithm Using Hadoop MapReduce

A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench

System and Architecture Level Characterization of Big Data Applications on Big and Little Core Server Architectures

References

MapReduce: simplified data processing on large clusters

MapReduce: simplified data processing on large clusters

Spark: cluster computing with working sets

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations

Related Papers (5)

MapReduce: simplified data processing on large clusters

Clash of the titans: MapReduce vs. Spark for large scale data analytics

Spark: cluster computing with working sets

The Stratosphere platform for big data analytics

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Frequently Asked Questions (14)

Q1. What are the contributions mentioned in the paper "Performance evaluation of big data frameworks for large-scale data analytics" ?

Q2. What are the future works mentioned in the paper "Performance evaluation of big data frameworks for large-scale data analytics" ?

Q3. What are the network interconnects that have been evaluated?

Q4. How does Spark perform in relational queries?

Q5. Why does Spark get the results in WordCount?

Q6. What is the way to minimize the overhead of garbage collections?

Q7. How many times does Spark outperform Hadoop?

Q8. What is the impact of the network on the internode communications?

Q9. How many times did Flink outperform Spark?

Q10. What makes Flink the scalable framework for PageRank?

Q11. How many threads are used in the experiments?

Q12. What is the main reason why the Flink optimizer is better than Hadoop?

Q13. What are the thread configurations of the frameworks?

Q14. What is the main difference between the Hadoop MapReduce engine and the Spark engine?