scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Symbiotic jobscheduling for a simultaneous multithreaded processor

12 Nov 2000-Vol. 35, Iss: 11, pp 234-244
TL;DR: It is demonstrated that performance on a hardware multithreaded processor is sensitive to the set of jobs that are coscheduled by the operating system jobscheduler, and that a small sample of the possible schedules is sufficient to identify a good schedule quickly.
Abstract: Simultaneous Multithreading machines fetch and execute instructions from multiple instruction streams to increase system utilization and speedup the execution of jobs. When there are more jobs in the system than there is hardware to support simultaneous execution, the operating system scheduler must choose the set of jobs to coscheduleThis paper demonstrates that performance on a hardware multithreaded processor is sensitive to the set of jobs that are coscheduled by the operating system jobscheduler. Thus, the full benefits of SMT hardware can only be achieved if the scheduler is aware of thread interactions. Here, a mechanism is presented that allows the scheduler to significantly raise the performance of SMT architectures. This is done without any advance knowledge of a workload's characteristics, using sampling to identify jobs which run well together.We demonstrate an SMT jobscheduler called SOS. SOS combines an overhead-free sample phase which collects information about various possible schedules, and a symbiosis phase which uses that information to predict which schedule will provide the best performance. We show that a small sample of the possible schedules is sufficient to identify a good schedule quickly. On a system with random job arrivals and departures, response time is improved as much as 17% over a schedule which does not incorporate symbiosis.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
02 Mar 2004
TL;DR: This paper examines two single-ISA heterogeneous multi-core architectures in detail, demonstrating dynamic core assignment policies that provide significant performance gains over naive assignment, and even outperform the best static assignment.
Abstract: A single-ISA heterogeneous multi-core architecture is achip multiprocessor composed of cores of varying size, performance,and complexity. This paper demonstrates that thisarchitecture can provide significantly higher performance inthe same area than a conventional chip multiprocessor. It doesso by matching the various jobs of a diverse workload to thevarious cores. This type of architecture covers a spectrum ofworkloads particularly well, providing high single-thread performancewhen thread parallelism is low, and high throughputwhen thread parallelism is high.This paper examines two such architectures in detail,demonstrating dynamic core assignment policies that providesignificant performance gains over naive assignment, andeven outperform the best static assignment. It examines policiesfor heterogeneous architectures both with and withoutmultithreading cores. One heterogeneous architecture we examineoutperforms the comparable-area homogeneous architectureby up to 63%, and our best core assignment strategyachieves up to 31% speedup over a naive policy.

647 citations

Proceedings ArticleDOI
Onur Mutlu1, Thomas Moscibroda1
01 Dec 2007
TL;DR: This paper proposes a new memory access scheduler, called the Stall-Time Fair Memory scheduler (STFM), that provides quality of service to different threads sharing the DRAM memory system and shows that STFM significantly reduces the unfairness in theDRAM system while also improving system throughput on a wide variety of workloads and systems.
Abstract: DRAM memory is a major resource shared among cores in a chip multiprocessor (CMP) system. Memory requests from different threads can interfere with each other. Existing memory access scheduling techniques try to optimize the overall data throughput obtained from the DRAM and thus do not take into account inter-thread interference. Therefore, different threads running together on the same chip can ex- perience extremely different memory system performance: one thread can experience a severe slowdown or starvation while another is un- fairly prioritized by the memory scheduler. This paper proposes a new memory access scheduler, called the Stall-Time Fair Memory scheduler (STFM), that provides quality of service to different threads sharing the DRAM memory system. The goal of the proposed scheduler is to "equalize" the DRAM-related slowdown experienced by each thread due to interference from other threads, without hurting overall system performance. As such, STFM takes into account inherent memory characteristics of each thread and does not unfairly penalize threads that use the DRAM system without interfering with other threads. We show that STFM significantly reduces the unfairness in the DRAM system while also improving system throughput (i.e., weighted speedup of threads) on a wide variety of workloads and systems. For example, averaged over 32 different workloads running on an 8-core CMP, the ratio between the highest DRAM-related slowdown and the lowest DRAM-related slowdown reduces from 5.26X to 1.4X, while the average system throughput improves by 7.6%. We qualitatively and quantitatively compare STFM to one new and three previously- proposed memory access scheduling algorithms, including network fair queueing. Our results show that STFM provides the best fairness, system throughput, and scalability.

584 citations


Cites background or methods from "Symbiotic jobscheduling for a simul..."

  • ...This metric should not be used to evaluate system throughput [28, 15] since even throughput-oriented realistic systems need to consider fairness and ensure forward progress of individual threads....

    [...]

  • ...Fairness Issues in Multithreaded Systems: Although fairness issues have been studied in multithreaded systems especially at the processor level [28, 15, 7], the DRAM subsystem has received significantly less attention....

    [...]

  • ...mini MemSlowdowni We measure overall system throughput using the weighted speedup metric [28], defined as the sum of relative IPC performances of each thread in the evaluated workload: Weighted Speedup = P...

    [...]

Journal ArticleDOI
Onur Mutlu1, Thomas Moscibroda1
01 Jun 2008
TL;DR: A parallelism-aware batch scheduler that seamlessly incorporates support for system-level thread priorities and can provide different service levels, including purely opportunistic service, to threads with different priorities, and is also simpler to implement than STFM.
Abstract: In a chip-multiprocessor (CMP) system, the DRAM system isshared among cores. In a shared DRAM system, requests from athread can not only delay requests from other threads by causingbank/bus/row-buffer conflicts but they can also destroy other threads’DRAM-bank-level parallelism. Requests whose latencies would otherwisehave been overlapped could effectively become serialized. As aresult both fairness and system throughput degrade, and some threadscan starve for long time periods.This paper proposes a fundamentally new approach to designinga shared DRAM controller that provides quality of service to threads,while also improving system throughput. Our parallelism-aware batchscheduler (PAR-BS) design is based on two key ideas. First, PARBSprocesses DRAM requests in batches to provide fairness and toavoid starvation of requests. Second, to optimize system throughput,PAR-BS employs a parallelism-aware DRAM scheduling policythat aims to process requests from a thread in parallel in the DRAMbanks, thereby reducing the memory-related stall-time experienced bythe thread. PAR-BS seamlessly incorporates support for system-levelthread priorities and can provide different service levels, includingpurely opportunistic service, to threads with different priorities.We evaluate the design trade-offs involved in PAR-BS and compareit to four previously proposed DRAM scheduler designs on 4-, 8-, and16-core systems. Our evaluations show that, averaged over 100 4-coreworkloads, PAR-BS improves fairness by 1.11X and system throughputby 8.3% compared to the best previous scheduling technique, Stall-Time Fair Memory (STFM) scheduling. Based on simple request prioritizationrules, PAR-BS is also simpler to implement than STFM.

575 citations


Cites methods from "Symbiotic jobscheduling for a simul..."

  • ...Evaluation Metrics We measure fairness using the unfairness index proposed in [25, together with other threads divided by the memory stall time per in­struction it experiences when running alone on the same system: MCPIshared i maxi MemSlowdowni MemSlowdowni = ,Unfairness = MCPIalone minj MemSlowdownj i We measure system throughput using Weighted-Speedup [37] and Hmean-Speedup [18], which balances fairness and throughput [18]: XIPCshared i NumThreads W. Speedup = ,H. Speedup = P IPCalone 1 i i iIPCshared /IPCalone ii 7.2....

    [...]

  • ...We measure system throughput using Weighted-Speedup [37] and Hmean-Speedup [18], which balances fairness and throughput [18]:...

    [...]

  • ..., [37, 18, 8]) are complementary to our work and can be used in conjunction with PAR-BS....

    [...]

  • ...Unfortunately, as shown in Figure 12(middle) and (right),it penalizes memory-intensive threads too much by allowing requests 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 CASE STUDY II st-400 st-800 st-1600 st-3200 st-6400 st-12800 st-25600 eslot full 0.5 0.0 Unfairness Weighted-Speedup Hmean-Speedup libquantum mcf GemsFDTD xalancbmk matlab h264ref omnetpp hmmer Unfairness Weighted-Speedup Hmean-Speedup Unfairness Weighted-Speedup Hmean-Speedup Unfairness Weighted-Speedup Hmean-Speedup 20.4 7.3 FR-FCFS NFQ-shares-8-8-4-1 STFM-weights-8-8-4-1 PAR-BS-pri-1-1-2-8 FR-FCFS NFQ-1-1-8K-1 STFM-1-1-8K-1 PAR-BS-L-L-0-L Figure 14....

    [...]

  • ...1.4 c=4 1.2 c=5 Memory Slowdown Memory Slowdown Value of Metric c=6 c=6c=6 1.52.01.0 c=7 c=7 c=8 c=7 0.8 c=8c=8 1.5 1.0 c=9 1.0 c=10 c=9 c=9 0.6 c=10 c=10 0.4 c=20 0.5 c=20 c=20 0.5 0.2 no-c no-c no-c 0.0 0.0 0.0 Unfairness Weighted-Speedup Hmean-Speedup libquantum mcf GemsFDTD xalancbmk matlab h264ref omnetpp hmmer Figure 11....

    [...]

Proceedings ArticleDOI
29 Sep 2004
TL;DR: It is found that optimizing fairness usually increases throughput, while maximizing throughput does not necessarily improve fairness, and two algorithms are proposed that optimize fairness.
Abstract: This paper presents a detailed study of fairness in cache sharing between threads in a chip multiprocessor (CMP) architecture. Prior work in CMP architectures has only studied throughput optimization techniques for a shared cache. The issue of fairness in cache sharing, and its relation to throughput, has not been studied. Fairness is a critical issue because the operating system (OS) thread scheduler's effectiveness depends on the hardware to provide fair cache sharing to co-scheduled threads. Without such hardware, serious problems, such as thread starvation and priority inversion, can arise and render the OS scheduler ineffective. This paper makes several contributions. First, it proposes and evaluates five cache fairness metrics that measure the degree of fairness in cache sharing, and shows that two of them correlate very strongly with the execution-time fairness. Execution-time fairness is defined as how uniform the execution times of co-scheduled threads are changed, where each change is relative to the execution time of the same thread running alone. Secondly, using the metrics, the paper proposes static and dynamic L2 cache partitioning algorithms that optimize fairness. The dynamic partitioning algorithm is easy to implement, requires little or no profiling, has low overhead, and does not restrict the cache replacement algorithm to LRU. The static algorithm, although requiring the cache to maintain LRU stack information, can help the OS thread scheduler to avoid cache thrashing. Finally, this paper studies the relationship between fairness and throughput in detail. We found that optimizing fairness usually increases throughput, while maximizing throughput does not necessarily improve fairness. Using a set of co-scheduled pairs of benchmarks, on average our algorithms improve fairness by a factor of 4/spl times/, while increasing the throughput by 15%, compared to a nonpartitioned shared cache.

544 citations


Cites background from "Symbiotic jobscheduling for a simul..."

  • ...Some studies have proposed metrics that, if optimized, balances between throughputand fairness [12, 15 ]....

    [...]

  • ...For example, a weighted speedup, which incorporates fairness to some extent, has been proposed by Snavely, et al. [ 15 ]....

    [...]

  • ...Even in SMT architectures, however, the studies have only focused on either improving throughput, or improving throughput without sacrificing fairness too much [8, 17, 15 , 12]....

    [...]

  • ...In Simultaneous Multi-Threaded (SMT) architectures, where typically the entire cache hierarchy and many processor resources are shared,it has beenobservedthat throughput-optimizingpolicies tend to favor threads that naturally have high IPC [ 15 ], hence sacrificing fairness....

    [...]

Proceedings ArticleDOI
12 Feb 2005
TL;DR: Three performance models are proposed that predict the impact of cache sharing on co-scheduled threads and the most accurate model, the inductive probability model, achieves an average error of only 3.9%.
Abstract: This paper studies the impact of L2 cache sharing on threads that simultaneously share the cache, on a chip multi-processor (CMP) architecture. Cache sharing impacts threads nonuniformly, where some threads may be slowed down significantly, while others are not. This may cause severe performance problems such as sub-optimal throughput, cache thrashing, and thread starvation for threads that fail to occupy sufficient cache space to make good progress. Unfortunately, there is no existing model that allows extensive investigation of the impact of cache sharing. To allow such a study, we propose three performance models that predict the impact of cache sharing on co-scheduled threads. The input to our models is the isolated L2 cache stack distance or circular sequence profile of each thread, which can be easily obtained on-line or off-line. The output of the models is the number of extra L2 cache misses for each thread due to cache sharing. The models differ by their complexity and prediction accuracy. We validate the models against a cycle-accurate simulation that implements a dual-core CMP architecture, on fourteen pairs of mostly SPEC benchmarks. The most accurate model, the inductive probability model, achieves an average error of only 3.9%. Finally, to demonstrate the usefulness and practicality of the model, a case study that details the relationship between an application's temporal reuse behavior and its cache sharing impact is presented.

543 citations


Cites background from "Symbiotic jobscheduling for a simul..."

  • ...rely on discovering the interaction (symbiosis) between threads in a SMT system by profiling all possible co-schedules [17, 16]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: In this paper, it was shown that if the three means are finite and the corresponding stochastic processes strictly stationary, and if the arrival process is metrically transitive with nonzero mean, then L = λW.
Abstract: In a queuing process, let 1/λ be the mean time between the arrivals of two consecutive units, L be the mean number of units in the system, and W be the mean time spent by a unit in the system. It is shown that, if the three means are finite and the corresponding stochastic processes strictly stationary, and, if the arrival process is metrically transitive with nonzero mean, then L = λW.

2,536 citations

Proceedings ArticleDOI
01 May 1995
TL;DR: Simultaneous multithreading has the potential to achieve 4 times the throughput of a superscalar, and double that of fine-grain multi-threading, and is an attractive alternative to single-chip multiprocessors.
Abstract: This paper examines simultaneous multithreading, a technique permitting several independent threads to issue instructions to a superscalar's multiple functional units in a single cycle. We present several models of simultaneous multithreading and compare them with alternative organizations: a wide superscalar, a fine-grain multithreaded processor, and single-chip, multiple-issue multiprocessing architectures. Our results show that both (single-threaded) superscalar and fine-grain multithreaded architectures are limited their ability to utilize the resources of a wide-issue processor. Simultaneous multithreading has the potential to achieve 4 times the throughput of a superscalar, and double that of fine-grain multithreading. We evaluate several cache configurations made possible by this type of organization and evaluate tradeoffs between them. We also show that simultaneous multithreading is an attractive alternative to single-chip multiprocessors; simultaneous multithreaded processors with a variety of organizations outperform corresponding conventional multiprocessors with similar execution resources.While simultaneous multithreading has excellent potential to increase processor utilization, it can add substantial complexity to the design. We examine many of these complexities and evaluate alternative organizations in the design space.

1,713 citations


"Symbiotic jobscheduling for a simul..." refers background in this paper

  • ...Simultaneous Multithreading (SMT) [32, 31, 18] architectures execute instructions from multiple streams of execution (threads) each cycle to increase instruction level paral-...

    [...]

  • ...A simultaneous multithreading processor [32, 31, 18, 14, 35] holds the state of multiple threads (execution contexts) in hardware, allowing the execution of instructions from multiple threads each cycle on a wide superscalar processor....

    [...]

Journal ArticleDOI
TL;DR: The nature and implementation of the file system and of the user command interface are discussed, including the ability to initiate asynchronous processes and over 100 subsystems including a dozen languages.
Abstract: UNIX is a general-purpose, multi-user, interactive operating system for the Digital Equipment Corporation PDP-11/40 and 11/45 computers. It offers a number of features seldom found even in a larger operating systems, including: (1) a hierarchical file system incorporating demountable volumes; (2) compatible file, device, and inter-process I/O; (3) the ability to initiate asynchronous processes; (4) system command language selectable on a per-user basis; and (5) over 100 subsystems including a dozen languages. This paper discusses the nature and implementation of the file system and of the user command interface.

1,140 citations


"Symbiotic jobscheduling for a simul..." refers background in this paper

  • ...The scheduling discipline Multi-level Feedback, implemented in several avors of Unix ([28] 4....

    [...]

Proceedings ArticleDOI
01 May 1996
TL;DR: This paper presents an architecture for simultaneous multithreading that minimizes the architectural impact on the conventional superscalar design, has minimal performance impact on a single thread executing alone, and achieves significant throughput gains when running multiple threads.
Abstract: Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with similar hardware resources. This speedup is enhanced by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the "best" instructions to the processor.

827 citations


"Symbiotic jobscheduling for a simul..." refers background in this paper

  • ...This organization has the potential to more than double the throughput of the processor without excessive increases in hardware [31]....

    [...]

  • ...Simultaneous Multithreading (SMT) [32, 31, 18] architectures execute instructions from multiple streams of execution (threads) each cycle to increase instruction level paral-...

    [...]

  • ...A simultaneous multithreading processor [32, 31, 18, 14, 35] holds the state of multiple threads (execution contexts) in hardware, allowing the execution of instructions from multiple threads each cycle on a wide superscalar processor....

    [...]

Proceedings ArticleDOI
01 Jun 1990
TL;DR: The Tera architecture was designed with several goals in mind; it needed to be suitable for very high speed implementations, i.
Abstract: The Tera architecture was designed with several ma jor goals in mind. First, it needed to be suitable for very high speed implementations, i. e., admit a short clock period and be scalable to many processors. This goal will be achieved; a maximum configuration of the first implementation of the architecture will have 256 processors, 512 memory units, 256 I/O cache units, 256 I/O processors, and 4096 interconnection network nodes and a clock period less than 3 nanoseconds. The abstract architecture is scalable essentially without limit (although a particular implementation is not, of course). The only requirement is that the number of instruction streams increase more rapidly than the number of physical processors. Although this means that speedup is sublinear in the number of instruction streams, it can still increase linearly with the number of physical pro cessors. The price/performance ratio of the system is unmatched, and puts Tera’s high performance within economic reach. Second, it was important that the architecture be applicable to a wide spectrum of problems. Programs that do not vectoriae well, perhaps because of a preponderance of scalar operations or too-frequent conditional branches, will execute efficiently as long as there is sufficient parallelism to keep the processors busy. Virtually any parallelism available in the total computational workload can be turned into speed, from operation level parallelism within program basic blocks to multiuser timeand space-sharing. The architecture

797 citations


"Symbiotic jobscheduling for a simul..." refers background or methods in this paper

  • ...By contrast, the Tera MTA supercomputer [3], which features ne-grain multithreading, has fewer shared system resources and less intimate interactions between threads....

    [...]

  • ...The techniques described here also apply to other multithreaded architectures [3, 11, 2]; however, the SMT architecture is most interesting because threads interact at such a ne granularity in the architecture, and because it is closest to widespread commercial use, having been announced for the next Alpha processor [10]....

    [...]

Trending Questions (1)
How would you design a system of getting correct information about job status to identify delays quickly?

We show that a small sample of the possible schedules is sufficient to identify a good schedule quickly.