Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems
Citations
478 citations
439 citations
387 citations
375 citations
Cites background or methods from "Parallelism-Aware Batch Scheduling:..."
...Grouping of threads into clusters happens in a synchronized manner across all memory controllers to better exploit bank-level parallelism [5, 14]....
[...]
...Thread-unaware scheduling policies have been shown to be low-performance and prone to starvation when multiple competing threads share the memory controller in general-purpose multicore/multithreaded systems [11, 16, 18, 4, 13, 14, 5]....
[...]
...Compared to PAR-BS [14], the best previous algorithm in terms of fairness, TCM improves system throughput and reduces maximum slowdown by 7....
[...]
...Section 7 compares TCM quantitatively with four stateof-the-art schedulers [19, 13, 14, 5]....
[...]
...Finally, as previous work has shown, it is desirable that scheduling decisions are made in a synchronized manner across all banks [5, 14, 12], so that concurrent requests of each thread are serviced in parallel, without being serialized due to interference from other threads....
[...]
367 citations
Cites background from "Parallelism-Aware Batch Scheduling:..."
...As is shown, memory bandwidth is highly variable, and depends on many factors: memory access rate, LLC residency, memory-and bank-level parallelism (MLP [13] and BLP [28]) and the ability to tolerate memory latency, for example....
[...]
...As is shown, memory bandwidth is highly variable, and depends on many factors: memory access rate, LLC residency, memory- and bank-level parallelism (MLP [13] and BLP [28]) and the ability to tolerate memory latency, for example....
[...]
References
4,019 citations
"Parallelism-Aware Batch Scheduling:..." refers methods in this paper
...The functional front-end of the simulator is based on Pin [17] and iDNA [1]....
[...]
1,854 citations
"Parallelism-Aware Batch Scheduling:..." refers background or methods in this paper
...To avoid this unfairness (and loss of system throughput as explained below), our ranking scheme is based on the shortest job first principle [36]: it ranks the non-intensive threads higher than the intensive ones....
[...]
...Consistent with the general machine scheduling theory [36], using the Max-Total ranking scheme to prioritize threads with fewer requests reduces the average stall time of threads within a batch....
[...]
...In the classic single-machine job-scheduling problem and many of its generalizations, shortest-job-first scheduling is optimal in that it minimizes the average job completion time [36]....
[...]
1,009 citations
"Parallelism-Aware Batch Scheduling:..." refers background or methods in this paper
...For single-threaded systems, the FR-FCFS policy was shown to pr ovide the best average performance [33, 32], significantly better than the simpler FCFS policy, which simply schedules all requests ac cording to their arrival order, regardless of the row-buffer state....
[...]
...With a conventional parallelism-unaware DRAM scheduler (such as any previously proposed scheduler [44, 33, 32, 28, 25]), t he requests can be serviced in their arrival order shown in Figure 2(top)....
[...]
...A moder n m mory controller employs the FR-FCFS (first-ready first-come-firs t serve) scheduling policy [44, 33, 32], which prioritizes readyDRAM commands from 1) row-hit requests over others and 2) row-hit sta tu being equal, older requests over younger ones....
[...]
...A DRAM controller consists of amemory request buffer that buffers the memory requests (and their data) while they are w iting to be serviced and a (possibly two-level) scheduler that selec ts the next request to be serviced [33, 28, 25]....
[...]
...For a detailed description, we refer the eader to [33, 4, 25]....
[...]
784 citations
619 citations
"Parallelism-Aware Batch Scheduling:..." refers methods in this paper
...Evaluation Metrics We measure fairness using the unfairness index proposed in [25, together with other threads divided by the memory stall time per instruction it experiences when running alone on the same system: MCPIshared i maxi MemSlowdowni MemSlowdowni = ,Unfairness = MCPIalone minj MemSlowdownj i We measure system throughput using Weighted-Speedup [37] and Hmean-Speedup [18], which balances fairness and throughput [18]: XIPCshared i NumThreads W. Speedup = ,H. Speedup = P IPCalone 1 i i iIPCshared /IPCalone ii 7.2....
[...]
...We measure system throughput using Weighted-Speedup [37] and Hmean-Speedup [18], which balances fairness and throughput [18]:...
[...]
..., [37, 18, 8]) are complementary to our work and can be used in conjunction with PAR-BS....
[...]
...Unfortunately, as shown in Figure 12(middle) and (right),it penalizes memory-intensive threads too much by allowing requests 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 CASE STUDY II st-400 st-800 st-1600 st-3200 st-6400 st-12800 st-25600 eslot full 0.5 0.0 Unfairness Weighted-Speedup Hmean-Speedup libquantum mcf GemsFDTD xalancbmk matlab h264ref omnetpp hmmer Unfairness Weighted-Speedup Hmean-Speedup Unfairness Weighted-Speedup Hmean-Speedup Unfairness Weighted-Speedup Hmean-Speedup 20.4 7.3 FR-FCFS NFQ-shares-8-8-4-1 STFM-weights-8-8-4-1 PAR-BS-pri-1-1-2-8 FR-FCFS NFQ-1-1-8K-1 STFM-1-1-8K-1 PAR-BS-L-L-0-L Figure 14....
[...]
...1.4 c=4 1.2 c=5 Memory Slowdown Memory Slowdown Value of Metric c=6 c=6c=6 1.52.01.0 c=7 c=7 c=8 c=7 0.8 c=8c=8 1.5 1.0 c=9 1.0 c=10 c=9 c=9 0.6 c=10 c=10 0.4 c=20 0.5 c=20 c=20 0.5 0.2 no-c no-c no-c 0.0 0.0 0.0 Unfairness Weighted-Speedup Hmean-Speedup libquantum mcf GemsFDTD xalancbmk matlab h264ref omnetpp hmmer Figure 11....
[...]