Efficient processing of data warehousing queries in a split execution environment
read more
Citations
Large-scale machine learning at twitter
Split query processing in polybase
Shark: fast data analysis using coarse-grained distributed memory
A comprehensive view of Hadoop research—A systematic literature review
Making Sense of Big Data.
References
MapReduce: simplified data processing on large clusters
Pregel: a system for large-scale graph processing
Parallel database systems: the future of high performance database systems
Map-Reduce for Machine Learning on Multicore
A comparison of approaches to large-scale data analysis
Related Papers (5)
Frequently Asked Questions (16)
Q2. What is the main benefit of using a database system for these operations?
Employing a database system for these operations generally results in higher performance because a DBMS provides more efficient operator implementation, better I/O handling, and clustering/indexing.
Q3. What is the way to eliminate the first MapReduce job?
in some cases, HadoopDB’s SideDB extension can be used to entirely eliminate the first MapReduce job for a split semijoin.
Q4. What is the main benefit of implementing join operations in HadoopDB?
when tables are co-partitioned (e.g., hash partitioned on the join attribute), join operations can also be processed inside the database system.
Q5. Why is Hive unable to use hash partitioning?
Because it is typically deployed on top of a distributed file system, Hive is unable to use hash-partitioning on a join key for the colocation of related tables — a typical strategy that parallel databases exploit to minimize data movement across nodes.
Q6. What version of Linux is used to run the cluster?
Each node in the cluster has a single 2.40 GHz Intel Core 2 Duo processor running 64-bit Red Hat Enterprise Linux 5 (kernel version 2.6.
Q7. What are some of the reasons for the popularity of MapReduce?
Some of the reasons for the popularity of MapReduce include the availability of a free and open source implementation (Hadoop) [2], impressive ease-of-use experience [30], as well as Google’s, Yahoo!’s, and Facebook’s wide usage [19, 25] and evangelization of this technology.
Q8. What is the reason why PostgreSQL needs to swap to disk?
Given the large number of records, PostgreSQL is not able to keep all the intermediate data in memory and therefore needs to swap to disk.
Q9. Why did the authors not enable the replication features in DBMS-X?
because their entire benchmark is read-only, the authors did not enable the replication features in DBMS-X, since rather than improving performance this would have complicated the installation process.
Q10. Why is Hive unable to implement cost-based algorithms?
because the system catalog lacks statistics on data distribution, cost-based algorithms cannot be implemented in Hive’s optimizer.
Q11. What is the reason for the performance improvement?
The reason for the performance improvement can be attributed to leveraging decades’ worth of research in the database systems community.
Q12. Why is MapReduce used for data analysis?
Even though many argue that MapReduce is not optimal for analyzing structured data [21, 30], it is nonetheless used increasingly frequently for that purpose because of a growing tendency to unify the data management platform.
Q13. What is the main benefit of using HadoopDB for multi-stage jobs?
in order to handle even more complicated queries that include multi-stage jobs, the authors enabled HadoopDB to consume records from a combined input consisting of data from both database tables and HDFS files.
Q14. What happens when the join cannot be pushed into the database system?
This happens when the join cannot bepushed into the database system and therefore must be performed by Hadoop which is much slower than DBMS.
Q15. How many MB of data is used in Q17?
In Q17, however, very selective predicates are applied to the part table (69GB of raw data), resulting in only about 6MB of data (around 600 thousand integer identifiers).
Q16. What is the effect of the addition of a join on the total query performance?
This extra join causes the total query performance to become slower by a factor of four versus the layout with referential partitioning.