Performance evaluation of big data frameworks for large-scale data analytics
read more
Citations
An experimental survey on big data frameworks
Evaluation of distributed stream processing frameworks for IoT applications in Smart Cities
Network Intrusion Detection with a Hashing Based Apriori Algorithm Using Hadoop MapReduce
A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
System and Architecture Level Characterization of Big Data Applications on Big and Little Core Server Architectures
References
MapReduce: simplified data processing on large clusters
MapReduce: simplified data processing on large clusters
Spark: cluster computing with working sets
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing
PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations
Related Papers (5)
Frequently Asked Questions (14)
Q2. What are the future works mentioned in the paper "Performance evaluation of big data frameworks for large-scale data analytics" ?
As future work, the authors plan to investigate further on the impact of more configuration parameters on the performance of these frameworks ( e. g., spilling threshold, network buffers ).
Q3. What are the network interconnects that have been evaluated?
The network interconnects that have been evaluated are GbE and IPoIB, configuring the frameworks to use each interface for shuffle operations and HDFS replications.
Q4. How does Spark perform in relational queries?
Results show that Spark outperforms Flink by up to 2x for WordCount, while Flink is better than Spark by up to 3x for K-Means, 2.5x for PageRank and 5x for a relational query.
Q5. Why does Spark get the results in WordCount?
In WordCount, Spark obtains the best results because of its API, which provides a reduceByKey() function to sum up the number of times each word appears.
Q6. What is the way to minimize the overhead of garbage collections?
The transparent use of persistent memory management using a custom object serializer for Flink operations minimizes the overhead of garbage collections.
Q7. How many times does Spark outperform Hadoop?
Results show that Spark outperforms Hadoop up to 48x for the PAM clustering algorithm and up to 99x for the CG linear system solver.
Q8. What is the impact of the network on the internode communications?
The impact of the network not only affects the internode communications during the shuffle phase, but also the writing operations to HDFS, which replicate the data blocks among the slave nodes.
Q9. How many times did Flink outperform Spark?
Results show that Flink outperforms Spark by up to 3x for the Histogram and Mapping to Region workloads and that, contrary to the results of the relational query in [8], Spark was better by up to 4x for the Join workload.
Q10. What makes Flink the scalable framework for PageRank?
along with efficient memory management to avoid major garbage collections, makes Flink the most scalable framework for PageRank, obtaining execution times up to 6.9x and 3.6x faster than Hadoop and Spark, respectively.
Q11. How many threads are used in the experiments?
These experiments (as well as the thread configuration experiments described later) have used the maximum data size considered, 250 GB for WordCount and TeraSort, and 19.5 GB for PageRank, in order to maximize their computational requirements.
Q12. What is the main reason why the Flink optimizer is better than Hadoop?
Authors concludethat the processing of some operators like groupBy or join and the pipelining of data between operators are more efficient in Flink, and that the Flink optimizer provides better performance with complex algorithms.
Q13. What are the thread configurations of the frameworks?
The thread configurations of the frameworks determine how the computational resources of each node are allocated to the Java processes and threads.
Q14. What is the main difference between the Hadoop MapReduce engine and the Spark engine?
Although it can provide good scalability for batch workloads, the Hadoop MapReduce engine is strictly disk-based incurring high disk overhead and requiring extra memory copies during data processing.