Practical Skew Handling in Parallel Joins

Open AccessProceedings Article

Practical Skew Handling in Parallel Joins

David J. DeWitt, +3 more

- pp 27-40

Chats0

TLDR

This work developed, implemented, and experimented with four new skew-handling parallel join algorithms, one of which, which is called virtual processor range partitioning, was the clear winner in high skew cases, while traditional hybrid hash join was theclear winner in lower skew or no skew cases.

Abstract:

We present an approach to dealing with skew in parallel joins in database systems. Our approach is easily implementable within current parallel DBMS, and performs well on skewed data without degrading the performance of the system on non-skewed data. The main idea is to use multiple algorithms, each specialized for a different degree of skew, and to use a small sample of the relations being joined to determine which algorithm is appropriate. We developed, implemented, and experimented with four new skew-handling parallel join algorithms; one, which we call virtual processor range partitioning, was the clear winner in high skew cases, while traditional hybrid hash join was the clear winner in lower skew or no skew cases. We present experimental results from an implementation of all four algorithms on the Gamma parallel database machine. To our knowledge, these are the first reported skew-handling numbers from an actual implementation.

Practical Skew Handling in Parallel Joins

Citations

Principles of Distributed Database Systems

The tail at scale

The Space Complexity of Approximating the Frequency Moments

The space complexity of approximating the frequency moments

SkewTune: mitigating skew in mapreduce applications

References

Universal classes of hash functions

Bounds on Multiprocessing Timing Anomalies

Sampling Techniques, 3Rd Edition

Parallel database systems: the future of high performance database systems

The Gamma database machine project

Related Papers (5)

Parallel database systems: the future of high performance database systems

A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment

MapReduce: simplified data processing on large clusters

The Gamma database machine project

A comparison of join algorithms for log processing in MaPreduce