ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers
Yoongu Kim,Dongsu Han,Onur Mutlu,Mor Harchol-Balter +3 more
- pp 1-12
TLDR
It is shown that the implementation of least-attained-service thread prioritization reduces the time the cores spend stalling and significantly improves system throughput, and ATLAS's performance benefit increases as the number of cores increases.Abstract:
Modern chip multiprocessor (CMP) systems employ multiple memory controllers to control access to main memory. The scheduling algorithm employed by these memory controllers has a significant effect on system throughput, so choosing an efficient scheduling algorithm is important. The scheduling algorithm also needs to be scalable — as the number of cores increases, the number of memory controllers shared by the cores should also increase to provide sufficient bandwidth to feed the cores. Unfortunately, previous memory scheduling algorithms are inefficient with respect to system throughput and/or are designed for a single memory controller and do not scale well to multiple memory controllers, requiring significant finegrained coordination among controllers. This paper proposes ATLAS (Adaptive per-Thread Least-Attained-Service memory scheduling), a fundamentally new memory scheduling technique that improves system throughput without requiring significant coordination among memory controllers. The key idea is to periodically order threads based on the service they have attained from the memory controllers so far, and prioritize those threads that have attained the least service over others in each period. The idea of favoring threads with least-attained-service is borrowed from the queueing theory literature, where, in the context of a single-server queue it is known that least-attained-service optimally schedules jobs, assuming a Pareto (or any decreasing hazard rate) workload distribution. After verifying that our workloads have this characteristic, we show that our implementation of least-attained-service thread prioritization reduces the time the cores spend stalling and significantly improves system throughput. Furthermore, since the periods over which we accumulate the attained service are long, the controllers coordinate very infrequently to form the ordering of threads, thereby making ATLAS scalable to many controllers. We evaluate ATLAS on a wide variety of multiprogrammed SPEC 2006 workloads and systems with 4–32 cores and 1–16 memory controllers, and compare its performance to five previously proposed scheduling algorithms. Averaged over 32 workloads on a 24-core system with 4 controllers, ATLAS improves instruction throughput by 10.8%, and system throughput by 8.4%, compared to PAR-BS, the best previous CMP memory scheduling algorithm. ATLAS's performance benefit increases as the number of cores increases.read more
Citations
More filters
Journal ArticleDOI
RAIDR: Retention-Aware Intelligent DRAM Refresh
TL;DR: This paper proposes RAIDR (Retention-Aware Intelligent DRAM Refresh), a low-cost mechanism that can identify and skip unnecessary refreshes using knowledge of cell retention times and group DRAM rows into retention time bins and apply a different refresh rate to each bin.
Proceedings ArticleDOI
RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization
Vivek Seshadri,Yoongu Kim,Chris Fallin,Donghyuk Lee,Rachata Ausavarungnirun,Gennady Pekhimenko,Yixin Luo,Onur Mutlu,Phillip B. Gibbons,Michael Kozuch,Todd C. Mowry +10 more
TL;DR: RowClone is proposed, a new and simple mechanism to perform bulk copy and initialization completely within DRAM — eliminating the need to transfer any data over the memory channel to perform such operations.
Proceedings ArticleDOI
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior
TL;DR: This paper presents a new memory scheduling algorithm that addresses system throughput and fairness separately with the goal of achieving the best of both, and evaluates TCM on a wide variety of multiprogrammed workloads and compares its performance to four previously proposed scheduling algorithms, finding that TCM achieves both the best system throughputand fairness.
Thr ead Cluster Memory Scheduling: Exploiting Diffe rences in Memory Access Behavior
TL;DR: TCM as discussed by the authors dynamically groups threads with similar memory access behavior into either the latency-sensitive (memory non-intensive) or the bandwidth-intensive (memory intensive) clusters, and introduces a ''niceness'' metric that captures a thread's propensity to interfere with other threads.
Journal ArticleDOI
A case for exploiting subarray-level parallelism (SALP) in DRAM
TL;DR: Three new mechanisms (SALP-1, SALP-2, and MASA) mitigate the negative impact of bank serialization by overlapping different components of the bank access latencies of multiple requests that go to different subarrays within the same bank.
References
More filters
Journal ArticleDOI
Pin: building customized program analysis tools with dynamic instrumentation
Chi-Keung Luk,Robert Cohn,Robert Muth,Harish Patil,Artur Klauser,Geoff Lowney,Steven Wallace,Vijay Janapa Reddi,Kim Hazelwood +8 more
TL;DR: The goals are to provide easy-to-use, portable, transparent, and efficient instrumentation, and to illustrate Pin's versatility, two Pintools in daily use to analyze production software are described.
Journal ArticleDOI
Wide area traffic: the failure of Poisson modeling
Vern Paxson,Sally Floyd +1 more
TL;DR: It is found that user-initiated TCP session arrivals, such as remote-login and file-transfer, are well-modeled as Poisson processes with fixed hourly rates, but that other connection arrivals deviate considerably from Poisson.
Journal ArticleDOI
Networking named content
Van L. Jacobson,Diana K. Smetters,James D. Thornton,Michael F. Plass,Nicholas H. Briggs,R. Braynard +5 more
TL;DR: Content-Centric Networking (CCN) is presented which uses content chunks as a primitive---decoupling location from identity, security and access, and retrieving chunks of content by name, and simultaneously achieves scalability, security, and performance.
Journal ArticleDOI
Analysis and simulation of a fair queueing algorithm
TL;DR: In this article, a fair gateway queueing algorithm based on an earlier suggestion by Nagle is proposed to control congestion in datagram networks, based on the idea of fair queueing.
Journal ArticleDOI
Self-similarity in World Wide Web traffic: evidence and possible causes
Mark Crovella,Azer Bestavros +1 more
TL;DR: It is shown that the self-similarity in WWW traffic can be explained based on the underlying distributions of WWW document sizes, the effects of caching and user preference in file transfer, the effect of user "think time", and the superimposition of many such transfers in a local-area network.
Related Papers (5)
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems
Onur Mutlu,Thomas Moscibroda +1 more
Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors
Onur Mutlu,Thomas Moscibroda +1 more