A comparison of approaches to large-scale data analysis
read more
Citations
Benchmarking cloud serving systems with YCSB
Data-intensive applications, challenges, techniques and technologies: A survey on Big Data
Hive: a warehousing solution over a map-reduce framework
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling
Spanner: Google's globally-distributed database
References
MapReduce: simplified data processing on large clusters
MapReduce: simplified data processing on large clusters
The Google file system
Dryad: distributed data-parallel programs from sequential building blocks
Pig latin: a not-so-foreign language for data processing
Related Papers (5)
Frequently Asked Questions (17)
Q2. How many records are processed in the original mapreduce paper?
The measurements in the original MapReduce paper are based on processing 1TB of data on approximately 1800 nodes, which is 5.6 million records or roughly 535MB of data per node.
Q3. How long does it take to read the UserVisits and Rankings tables off disk?
it takes approximately 600 seconds of raw I/O to read the UserVisits and Rankings tables off of disk and then another 300 seconds to split, parse, and deserialize the various attributes.
Q4. What is the attractive aspect of the MapReduce programming model?
One of the attractive qualities about the MapReduce programming model is its simplicity: an MR program consists only of two functions, called Map and Reduce, that are written by a user to process key/value data pairs.
Q5. How did MR perform the pre-aggregate before data was transmitted to the Reduce instances?
The authors also used MR’s Combine feature to perform the pre-aggregate before data is transmitted to the Reduce instances, improving the first query’s execution time by a factor of two [8].
Q6. What language is more familiar to programmers than SQL?
Most programmers are more familiar with object-oriented, imperative programming than with other language technologies, such as SQL.
Q7. How many nodes do a MR system need to perform a query?
In addition, if a MR system needs 1,000 nodes to match the performance of a 100 node parallel database system, it is ten times more likely that a node will fail while a query is executing.
Q8. What is the probability of mid-query hardware failures in parallel DBMSs?
Since parallel DBMSs will be deployed on larger clusters over time, the probability of mid-query hardware failures will increase.
Q9. Why do programmers need to specify their goal in a high level language?
Because programmers only need to specify their goal in a high level language, they are not burdened by the underlying storage details, such as indexing options and join strategies.
Q10. What is the approach to enforce the integrity of data?
By again separating such constraints from the application and enforcing them automatically by the run time system, as is done by all SQL DBMSs, the integrity of the data is enforced without additional work on the programmer’s behalf.
Q11. What is the function that calculates the inlink count for a given key?
Given these records, the Reduce function then simply counts the number of values for a given key and outputs the URL and the calculated inlink count as the program’s final output.
Q12. Why did the authors initially think that block-level compression would improve the performance of the Map and Reduce?
The authors initially believed that this would improve CPU-bound tasks, because the Map and Reduce tasks no longer needed to split the fields by the delimiter.
Q13. What is the main reason why the central job tracker is required to coordinate node activities?
as the total number of allocated Map tasks increases, there is additional overhead required for the central job tracker to coordinate node activities.
Q14. How did the authors find that compression reduced the execution times for almost all the benchmark tasks?
The authors found that enabling compression reduced the execution times for almost all the benchmark tasks by 50%, and thus the authors only report results with compression enabled.
Q15. What other data format options resulted in slower load and execution times?
the authors found that other data format options, such as SequenceFileInputFormat or custom Writable tuples, resulted in both slower load and execution times.
Q16. Why did the authors not enable replication in DBMS-X?
because all of their benchmarks are read-only, the authors did not enable replication features in DBMS-X, since this would not have improved performance and complicates the installation process.
Q17. How do the authors measure the basic performance without the overhead of coordinating parallel tasks?
To measure the basic performance without the overhead of coordinating parallel tasks, the authors first execute each task on a single node.