scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems

Onur Mutlu1, Thomas Moscibroda1
01 Jun 2008-Vol. 36, Iss: 3, pp 63-74
TL;DR: A parallelism-aware batch scheduler that seamlessly incorporates support for system-level thread priorities and can provide different service levels, including purely opportunistic service, to threads with different priorities, and is also simpler to implement than STFM.
Abstract: In a chip-multiprocessor (CMP) system, the DRAM system isshared among cores. In a shared DRAM system, requests from athread can not only delay requests from other threads by causingbank/bus/row-buffer conflicts but they can also destroy other threads’DRAM-bank-level parallelism. Requests whose latencies would otherwisehave been overlapped could effectively become serialized. As aresult both fairness and system throughput degrade, and some threadscan starve for long time periods.This paper proposes a fundamentally new approach to designinga shared DRAM controller that provides quality of service to threads,while also improving system throughput. Our parallelism-aware batchscheduler (PAR-BS) design is based on two key ideas. First, PARBSprocesses DRAM requests in batches to provide fairness and toavoid starvation of requests. Second, to optimize system throughput,PAR-BS employs a parallelism-aware DRAM scheduling policythat aims to process requests from a thread in parallel in the DRAMbanks, thereby reducing the memory-related stall-time experienced bythe thread. PAR-BS seamlessly incorporates support for system-levelthread priorities and can provide different service levels, includingpurely opportunistic service, to threads with different priorities.We evaluate the design trade-offs involved in PAR-BS and compareit to four previously proposed DRAM scheduler designs on 4-, 8-, and16-core systems. Our evaluations show that, averaged over 100 4-coreworkloads, PAR-BS improves fairness by 1.11X and system throughputby 8.3% compared to the best previous scheduling technique, Stall-Time Fair Memory (STFM) scheduling. Based on simple request prioritizationrules, PAR-BS is also simpler to implement than STFM.
Citations
More filters
Proceedings ArticleDOI
21 Apr 2013
TL;DR: It is shown that an optimized, equal capacity STT-RAM main memory can provide performance comparable to DRAM main memory, with an average 60% reduction in main memory energy.
Abstract: In this paper, we explore the possibility of using STT-RAM technology to completely replace DRAM in main memory. Our goal is to make STT-RAM performance comparable to DRAM while providing substantial power savings. Towards this goal, we first analyze the performance and energy of STT-RAM, and then identify key optimizations that can be employed to improve its characteristics. Specifically, using partial write and row buffer write bypass, we show that STT-RAM main memory performance and energy can be significantly improved. Our experiments indicate that an optimized, equal capacity STT-RAM main memory can provide performance comparable to DRAM main memory, with an average 60% reduction in main memory energy.

478 citations

Proceedings ArticleDOI
01 Apr 2010
TL;DR: It is shown that the implementation of least-attained-service thread prioritization reduces the time the cores spend stalling and significantly improves system throughput, and ATLAS's performance benefit increases as the number of cores increases.
Abstract: Modern chip multiprocessor (CMP) systems employ multiple memory controllers to control access to main memory. The scheduling algorithm employed by these memory controllers has a significant effect on system throughput, so choosing an efficient scheduling algorithm is important. The scheduling algorithm also needs to be scalable — as the number of cores increases, the number of memory controllers shared by the cores should also increase to provide sufficient bandwidth to feed the cores. Unfortunately, previous memory scheduling algorithms are inefficient with respect to system throughput and/or are designed for a single memory controller and do not scale well to multiple memory controllers, requiring significant finegrained coordination among controllers. This paper proposes ATLAS (Adaptive per-Thread Least-Attained-Service memory scheduling), a fundamentally new memory scheduling technique that improves system throughput without requiring significant coordination among memory controllers. The key idea is to periodically order threads based on the service they have attained from the memory controllers so far, and prioritize those threads that have attained the least service over others in each period. The idea of favoring threads with least-attained-service is borrowed from the queueing theory literature, where, in the context of a single-server queue it is known that least-attained-service optimally schedules jobs, assuming a Pareto (or any decreasing hazard rate) workload distribution. After verifying that our workloads have this characteristic, we show that our implementation of least-attained-service thread prioritization reduces the time the cores spend stalling and significantly improves system throughput. Furthermore, since the periods over which we accumulate the attained service are long, the controllers coordinate very infrequently to form the ordering of threads, thereby making ATLAS scalable to many controllers. We evaluate ATLAS on a wide variety of multiprogrammed SPEC 2006 workloads and systems with 4–32 cores and 1–16 memory controllers, and compare its performance to five previously proposed scheduling algorithms. Averaged over 32 workloads on a 24-core system with 4 controllers, ATLAS improves instruction throughput by 10.8%, and system throughput by 8.4%, compared to PAR-BS, the best previous CMP memory scheduling algorithm. ATLAS's performance benefit increases as the number of cores increases.

439 citations

Patent
10 Dec 2012
TL;DR: In this paper, a system, method, and computer program product for a memory system is described, which includes a first semiconductor platform including at least one first circuit, and at least two additional semiconductor platforms stacked with the first and additional circuits.
Abstract: A system, method, and computer program product are provided for a memory system. The system includes a first semiconductor platform including at least one first circuit, and at least one additional semiconductor platform stacked with the first semiconductor platform and including at least one additional circuit.

387 citations

Proceedings ArticleDOI
04 Dec 2010
TL;DR: This paper presents a new memory scheduling algorithm that addresses system throughput and fairness separately with the goal of achieving the best of both, and evaluates TCM on a wide variety of multiprogrammed workloads and compares its performance to four previously proposed scheduling algorithms, finding that TCM achieves both the best system throughputand fairness.
Abstract: In a modern chip-multiprocessor system, memory is a shared resource among multiple concurrently executing threads. The memory scheduling algorithm should resolve memory contention by arbitrating memory access in such a way that competing threads progress at a relatively fast and even pace, resulting in high system throughput and fairness. Previously proposed memory scheduling algorithms are predominantly optimized for only one of these objectives: no scheduling algorithm provides the best system throughput and best fairness at the same time. This paper presents a new memory scheduling algorithm that addresses system throughput and fairness separately with the goal of achieving the best of both. The main idea is to divide threads into two separate clusters and employ different memory request scheduling policies in each cluster. Our proposal, Thread Cluster Memory scheduling (TCM), dynamically groups threads with similar memory access behavior into either the latency-sensitive (memory-non-intensive) or the bandwidth-sensitive (memory-intensive) cluster. TCM introduces three major ideas for prioritization: 1) we prioritize the latency-sensitive cluster over the bandwidth-sensitive cluster to improve system throughput, 2) we introduce a ``niceness'' metric that captures a thread's propensity to interfere with other threads, 3) we use niceness to periodically shuffle the priority order of the threads in the bandwidth-sensitive cluster to provide fair access to each thread in a way that reduces inter-thread interference. On the one hand, prioritizing memory-non-intensive threads significantly improves system throughput without degrading fairness, because such ``light'' threads only use a small fraction of the total available memory bandwidth. On the other hand, shuffling the priority order of memory-intensive threads improves fairness because it ensures no thread is disproportionately slowed down or starved. We evaluate TCM on a wide variety of multiprogrammed workloads and compare its performance to four previously proposed scheduling algorithms, finding that TCM achieves both the best system throughput and fairness. Averaged over 96 workloads on a 24-core system with 4 memory channels, TCM improves system throughput and reduces maximum slowdown by 4.6%/38.6% compared to ATLAS (previous work providing the best system throughput) and 7.6%/4.6% compared to PAR-BS (previous work providing the best fairness).

375 citations


Cites background or methods from "Parallelism-Aware Batch Scheduling:..."

  • ...Grouping of threads into clusters happens in a synchronized manner across all memory controllers to better exploit bank-level parallelism [5, 14]....

    [...]

  • ...Thread-unaware scheduling policies have been shown to be low-performance and prone to starvation when multiple competing threads share the memory controller in general-purpose multicore/multithreaded systems [11, 16, 18, 4, 13, 14, 5]....

    [...]

  • ...Compared to PAR-BS [14], the best previous algorithm in terms of fairness, TCM improves system throughput and reduces maximum slowdown by 7....

    [...]

  • ...Section 7 compares TCM quantitatively with four stateof-the-art schedulers [19, 13, 14, 5]....

    [...]

  • ...Finally, as previous work has shown, it is desirable that scheduling decisions are made in a synchronized manner across all banks [5, 14, 12], so that concurrent requests of each thread are serviced in parallel, without being serialized due to interference from other threads....

    [...]

Proceedings ArticleDOI
14 Jun 2011
TL;DR: A large opportunity for memory power reduction is demonstrated with a simple control algorithm that adjusts memory voltage and frequency based on memory bandwidth utilization, and a simple algorithm is evaluated in a real system.
Abstract: Energy efficiency and energy-proportional computing have become a central focus in enterprise server architecture. As thermal and electrical constraints limit system power, and datacenter operators become more conscious of energy costs, energy efficiency becomes important across the whole system. There are many proposals to scale energy at the datacenter and server level. However, one significant component of server power, the memory system, remains largely unaddressed. We propose memory dynamic volt age/frequency scaling (DVFS) to address this problem, and evaluate a simple algorithm in a real system.As we show, in a typical server platform, memory consumes 19% of system power on average while running SPEC CPU2006 workloads. While increasing core counts demand more bandwidth and drive the memory frequency upward, many workloads require much less than peak bandwidth. These workloads suffer minimal performance impact when memory frequency is reduced. When frequency reduces, voltage can be reduced as well. We demonstrate a large opportunity for memory power reduction with a simple control algorithm that adjusts memory voltage and frequency based on memory bandwidth utilization.We evaluate memory DVFS in a real system, emulating reduced memory frequency by altering timing registers and using an analytical model to compute power reduction. With an average of 0.17% slowdown, we show 10.4% average (20.5% max) memory power reduction, yielding 2.4% average (5.2% max) whole-system energy improvement.

367 citations


Cites background from "Parallelism-Aware Batch Scheduling:..."

  • ...As is shown, memory bandwidth is highly variable, and depends on many factors: memory access rate, LLC resi­dency, memory-and bank-level parallelism (MLP [13] and BLP [28]) and the ability to tolerate memory latency, for ex­ample....

    [...]

  • ...As is shown, memory bandwidth is highly variable, and depends on many factors: memory access rate, LLC residency, memory- and bank-level parallelism (MLP [13] and BLP [28]) and the ability to tolerate memory latency, for example....

    [...]

References
More filters
Journal ArticleDOI
12 Jun 2005
TL;DR: The goals are to provide easy-to-use, portable, transparent, and efficient instrumentation, and to illustrate Pin's versatility, two Pintools in daily use to analyze production software are described.
Abstract: Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pin's rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the application's original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pin's versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), Itanium®, and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website.

4,019 citations


"Parallelism-Aware Batch Scheduling:..." refers methods in this paper

  • ...The functional front-end of the simulator is based on Pin [17] and iDNA [1]....

    [...]

Journal ArticleDOI

1,854 citations


"Parallelism-Aware Batch Scheduling:..." refers background or methods in this paper

  • ...To avoid this unfairness (and loss of system throughput as explained below), our ranking scheme is based on the shortest job first principle [36]: it ranks the non-intensive threads higher than the intensive ones....

    [...]

  • ...Consistent with the general machine scheduling theory [36], using the Max-Total ranking scheme to prioritize threads with fewer requests reduces the average stall time of threads within a batch....

    [...]

  • ...In the classic single-machine job-scheduling problem and many of its generalizations, shortest-job-first scheduling is optimal in that it minimizes the average job completion time [36]....

    [...]

Proceedings ArticleDOI
01 May 2000
TL;DR: This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure.
Abstract: The bandwidth and latency of a memory system are strongly dependent on the manner in which accesses interact with the “3-D” structure of banks, rows, and columns characteristic of contemporary DRAM chips. There is nearly an order of magnitude difference in bandwidth between successive references to different columns within a row and different rows within a bank. This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure. Conservative reordering, in which the first ready reference in a sequence is performed, improves bandwidth by 40% for traces from five media benchmarks. Aggressive reordering, in which operations are scheduled to optimize memory bandwidth, improves bandwidth by 93% for the same set of applications. Memory access scheduling is particularly important for media processors where it enables the processor to make the most efficient use of scarce memory bandwidth.

1,009 citations


"Parallelism-Aware Batch Scheduling:..." refers background or methods in this paper

  • ...For single-threaded systems, the FR-FCFS policy was shown to pr ovide the best average performance [33, 32], significantly better than the simpler FCFS policy, which simply schedules all requests ac cording to their arrival order, regardless of the row-buffer state....

    [...]

  • ...With a conventional parallelism-unaware DRAM scheduler (such as any previously proposed scheduler [44, 33, 32, 28, 25]), t he requests can be serviced in their arrival order shown in Figure 2(top)....

    [...]

  • ...A moder n m mory controller employs the FR-FCFS (first-ready first-come-firs t serve) scheduling policy [44, 33, 32], which prioritizes readyDRAM commands from 1) row-hit requests over others and 2) row-hit sta tu being equal, older requests over younger ones....

    [...]

  • ...A DRAM controller consists of amemory request buffer that buffers the memory requests (and their data) while they are w iting to be serviced and a (possibly two-level) scheduler that selec ts the next request to be serviced [33, 28, 25]....

    [...]

  • ...For a detailed description, we refer the eader to [33, 4, 25]....

    [...]

Book
01 Mar 1995
TL;DR: In this article, the authors describe the methods employed in the floating-point area of the System/360 Model 91 to exploit the existence of multiple execution units and register tagging schemes.
Abstract: This paper describes the methods employed in the floating-point area of the System/360 Model 91 to exploit the existence of multiple execution units Basic to these techniques is a simple common data busing and register tagging scheme which permits simultaneous execution of independent instructions while preserving the essential precedences inherent in the instruction stream The common data bus improves performance by efficiently utilizing the execution units without requiring specially optimized code Instead, the hardware, by 'looking ahead' about eight instructions, automatically optimizes the program execution on a local basis The application of these techniques is not limited to floating-point arithmetic or System/360 architecture It may be used in almost any computer having multiple execution units and one or more 'accumulators' Both of the execution units, as well as the associated storage buffers, multiple accumulators and input/output buses, are extensively checked

784 citations

Journal ArticleDOI
12 Nov 2000
TL;DR: It is demonstrated that performance on a hardware multithreaded processor is sensitive to the set of jobs that are coscheduled by the operating system jobscheduler, and that a small sample of the possible schedules is sufficient to identify a good schedule quickly.
Abstract: Simultaneous Multithreading machines fetch and execute instructions from multiple instruction streams to increase system utilization and speedup the execution of jobs. When there are more jobs in the system than there is hardware to support simultaneous execution, the operating system scheduler must choose the set of jobs to coscheduleThis paper demonstrates that performance on a hardware multithreaded processor is sensitive to the set of jobs that are coscheduled by the operating system jobscheduler. Thus, the full benefits of SMT hardware can only be achieved if the scheduler is aware of thread interactions. Here, a mechanism is presented that allows the scheduler to significantly raise the performance of SMT architectures. This is done without any advance knowledge of a workload's characteristics, using sampling to identify jobs which run well together.We demonstrate an SMT jobscheduler called SOS. SOS combines an overhead-free sample phase which collects information about various possible schedules, and a symbiosis phase which uses that information to predict which schedule will provide the best performance. We show that a small sample of the possible schedules is sufficient to identify a good schedule quickly. On a system with random job arrivals and departures, response time is improved as much as 17% over a schedule which does not incorporate symbiosis.

619 citations


"Parallelism-Aware Batch Scheduling:..." refers methods in this paper

  • ...Evaluation Metrics We measure fairness using the unfairness index proposed in [25, together with other threads divided by the memory stall time per in­struction it experiences when running alone on the same system: MCPIshared i maxi MemSlowdowni MemSlowdowni = ,Unfairness = MCPIalone minj MemSlowdownj i We measure system throughput using Weighted-Speedup [37] and Hmean-Speedup [18], which balances fairness and throughput [18]: XIPCshared i NumThreads W. Speedup = ,H. Speedup = P IPCalone 1 i i iIPCshared /IPCalone ii 7.2....

    [...]

  • ...We measure system throughput using Weighted-Speedup [37] and Hmean-Speedup [18], which balances fairness and throughput [18]:...

    [...]

  • ..., [37, 18, 8]) are complementary to our work and can be used in conjunction with PAR-BS....

    [...]

  • ...Unfortunately, as shown in Figure 12(middle) and (right),it penalizes memory-intensive threads too much by allowing requests 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 CASE STUDY II st-400 st-800 st-1600 st-3200 st-6400 st-12800 st-25600 eslot full 0.5 0.0 Unfairness Weighted-Speedup Hmean-Speedup libquantum mcf GemsFDTD xalancbmk matlab h264ref omnetpp hmmer Unfairness Weighted-Speedup Hmean-Speedup Unfairness Weighted-Speedup Hmean-Speedup Unfairness Weighted-Speedup Hmean-Speedup 20.4 7.3 FR-FCFS NFQ-shares-8-8-4-1 STFM-weights-8-8-4-1 PAR-BS-pri-1-1-2-8 FR-FCFS NFQ-1-1-8K-1 STFM-1-1-8K-1 PAR-BS-L-L-0-L Figure 14....

    [...]

  • ...1.4 c=4 1.2 c=5 Memory Slowdown Memory Slowdown Value of Metric c=6 c=6c=6 1.52.01.0 c=7 c=7 c=8 c=7 0.8 c=8c=8 1.5 1.0 c=9 1.0 c=10 c=9 c=9 0.6 c=10 c=10 0.4 c=20 0.5 c=20 c=20 0.5 0.2 no-c no-c no-c 0.0 0.0 0.0 Unfairness Weighted-Speedup Hmean-Speedup libquantum mcf GemsFDTD xalancbmk matlab h264ref omnetpp hmmer Figure 11....

    [...]