A case for NUMA-aware contention management on multicore systems

Home
/
Papers
/
A case for NUMA-aware contention management on multicore systems

Proceedings Article•

A case for NUMA-aware contention management on multicore systems

Sergey Blagodurov¹, Sergey Zhuravlev¹, Mohammad Dashti¹, Alexandra Fedorova¹•Institutions (1)

15 Jun 2011-pp 1-1

TL;DR: The effects on performance imposed by resource contention and remote access latency are quantified and a new contention management algorithm is proposed and evaluated that significantly outperforms a NUMA-unaware algorithm proposed before as well as the default Linux scheduler.

read less

Abstract: On multicore systems, contention for shared resources occurs when memory-intensive threads are co-scheduled on cores that share parts of the memory hierarchy, such as last-level caches and memory controllers. Previous work investigated how contention could be addressed via scheduling. A contention-aware scheduler separates competing threads onto separate memory hierarchy domains to eliminate resource sharing and, as a consequence, to mitigate contention. However, all previous work on contention-aware scheduling assumed that the underlying system is UMA (uniform memory access latencies, single memory controller). Modern multicore systems, however, are NUMA, which means that they feature non-uniform memory access latencies and multiple memory controllers. We discovered that state-of-the-art contention management algorithms fail to be effective on NUMA systems and may even hurt performance relative to a default OS scheduler. In this paper we investigate the causes for this behavior and design the first contention-aware algorithm for NUMA systems.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Heracles: improving resource efficiency at scale

[...]

David Lo¹, Liqun Cheng², Rama K. Govindaraju², Parthasarathy Ranganathan², Christos Kozyrakis¹ - Show less +1 more•Institutions (2)

Stanford University¹, Google²

13 Jun 2015

TL;DR: Heracles is presented, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service and dynamically manages multiple hardware and software isolation mechanisms to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best- Effort tasks.

...read moreread less

Abstract: User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared resources can cause latency spikes that violate the service-level objectives of latency-sensitive tasks. The resulting under-utilization hurts both the affordability and energy-efficiency of large-scale datacenters. With technology scaling slowing down, it becomes important to address this opportunity. We present Heracles, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service. Heracles dynamically manages multiple hardware and software isolation mechanisms, such as CPU, memory, and network isolation, to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best-effort tasks. We evaluate Heracles using production latency-critical and batch workloads from Google and demonstrate average server utilizations of 90% without latency violations across all the load and colocation scenarios that we evaluated.

...read moreread less

464 citations

Cites background from "A case for NUMA-aware contention ma..."

...In multi-socket servers, one can isolate workloads across NUMA channels [9, 73], but this approach constrains DRAM capacity allocation and address interleaving....
[...]

Proceedings Article•DOI•

CPI2: CPU performance isolation for shared compute clusters

[...]

Xiao Zhang¹, Eric S. Tune¹, Robert Hagmann¹, Rohit Jnagal¹, Vrigo Gokhale¹, John Wilkes¹ - Show less +2 more•Institutions (1)

Google¹

15 Apr 2013

TL;DR: CPI2, which uses cycles-per-instruction (CPI) data obtained by hardware performance counters to identify problems, select the likely perpetrators, and then optionally throttle them so that the victims can return to their expected behavior.

...read moreread less

Abstract: Performance isolation is a key challenge in cloud computing. Unfortunately, Linux has few defenses against performance interference in shared resources such as processor caches and memory buses, so applications in a cloud can experience unpredictable performance caused by other programs' behavior.Our solution, CPI2, uses cycles-per-instruction (CPI) data obtained by hardware performance counters to identify problems, select the likely perpetrators, and then optionally throttle them so that the victims can return to their expected behavior. It automatically learns normal and anomalous behaviors by aggregating data from multiple tasks in the same job.We have rolled out CPI2 to all of Google's shared compute clusters. The paper presents the analysis that lead us to that outcome, including both case studies and a large-scale evaluation of its ability to solve real production issues.

...read moreread less

314 citations

Cites methods from "A case for NUMA-aware contention ma..."

...This could certainly be done: for example, Zhang et al. [40] used memory reference counts to approximate memory bandwidth consumption on SMP machines; West et al. [38] used cache miss and reference counts to estimate cache occupancy of competing threads on multicore machines; VM3 [20] pro.led applications cache-misses per instruction to estimate effective cache sizes in a consolidated virtual machine environment; Cuanta [17] introduced a cache loader micro-benchmark to pro.le application performance under varying cache-usage pressure; and Blagodurov [7] and Zhuravlev [43] applied heuristics based on cache miss rates to guide contention-aware scheduling....
[...]
...[38] used cache miss and reference counts to estimate cache occupancy of competing threads on multicore machines; VM3 [20] profiled applications’ cache-misses per instruction to estimate effective cache sizes in a consolidated virtual machine environment; Cuanta [17] introduced a cache loader micro-benchmark to profile application performance under varying cache-usage pressure; and Blagodurov [7] and Zhuravlev [43] applied heuristics based on cache miss rates to guide contention-aware scheduling....
[...]

Proceedings Article•DOI•

Traffic management: a holistic approach to memory placement on NUMA systems

[...]

Mohammad Dashti¹, Alexandra Fedorova¹, Justin Funston¹, Fabien Gaud¹, Renaud Lachaize, Baptiste Lepers², Vivien Quéma³, Mark Roth¹ - Show less +4 more•Institutions (3)

Simon Fraser University¹, Centre national de la recherche scientifique², Grenoble Institute of Technology³

16 Mar 2013

TL;DR: The design of Carrefour is presented, the challenges of implementing it on modern hardware are presented, and insights about hardware support that would help optimize system software on future NUMA systems are drawn.

...read moreread less

Abstract: NUMA systems are characterized by Non-Uniform Memory Access times, where accessing data in a remote node takes longer than a local access. NUMA hardware has been built since the late 80's, and the operating systems designed for it were optimized for access locality. They co-located memory pages with the threads that accessed them, so as to avoid the cost of remote accesses. Contrary to older systems, modern NUMA hardware has much smaller remote wire delays, and so remote access costs per se are not the main concern for performance, as we discovered in this work. Instead, congestion on memory controllers and interconnects, caused by memory traffic from data-intensive applications, hurts performance a lot more. Because of that, memory placement algorithms must be redesigned to target traffic congestion. This requires an arsenal of techniques that go beyond optimizing locality. In this paper we describe Carrefour, an algorithm that addresses this goal. We implemented Carrefour in Linux and obtained performance improvements of up to 3.6 relative to the default kernel, as well as significant improvements compared to NUMA-aware patchsets available for Linux. Carrefour never hurts performance by more than 4% when memory placement cannot be improved. We present the design of Carrefour, the challenges of implementing it on modern hardware, and draw insights about hardware support that would help optimize system software on future NUMA systems.

...read moreread less

289 citations

Cites background from "A case for NUMA-aware contention ma..."

...The most comprehensive work to date on NUMA-aware contention management is the DINO scheduler [5], which spreads memory intensive threads across memory domains and accordingly migrates the corresponding memory pages....
[...]
...Some of these works were designed for UMA systems [16, 21, 33] and are inefficient on NUMA systems because they fail to address or even accentuate issues such as remote access latencies and contention on memory controllers and on the interconnect links [5]....
[...]

Proceedings Article•DOI•

Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM

[...]

Kevin K. Chang¹, Prashant J. Nair², Donghyuk Lee¹, Saugata Ghose¹, Moinuddin K. Qureshi², Onur Mutlu¹ - Show less +2 more•Institutions (2)

Carnegie Mellon University¹, Georgia Institute of Technology²

12 Mar 2016

TL;DR: A new DRAM substrate, Low-Cost Inter-Linked Subarrays (LISA), whose goal is to enable fast and efficient data movement across a large range of memory at low cost, and whose combined benefit is higher than the benefit of each alone, on a variety of workloads and system configurations.

...read moreread less

Abstract: This paper introduces a new DRAM design that enables fast and energy-efficient bulk data movement across subarrays in a DRAM chip. While bulk data movement is a key operation in many applications and operating systems, contemporary systems perform this movement inefficiently, by transferring data from DRAM to the processor, and then back to DRAM, across a narrow off-chip channel. The use of this narrow channel for bulk data movement results in high latency and energy consumption. Prior work proposed to avoid these high costs by exploiting the existing wide internal DRAM bandwidth for bulk data movement, but the limited connectivity of wires within DRAM allows fast data movement within only a single DRAM subarray. Each subarray is only a few megabytes in size, greatly restricting the range over which fast bulk data movement can happen within DRAM. We propose a new DRAM substrate, Low-Cost Inter-Linked Subarrays (LISA), whose goal is to enable fast and efficient data movement across a large range of memory at low cost. LISA adds low-cost connections between adjacent subarrays. By using these connections to interconnect the existing internal wires (bitlines) of adjacent subarrays, LISA enables wide-bandwidth data transfer across multiple subarrays with little (only 0.8%) DRAM area overhead. As a DRAM substrate, LISA is versatile, enabling an array of new applications. We describe and evaluate three such applications in detail: (1) fast inter-subarray bulk data copy, (2) in-DRAM caching using a DRAM architecture whose rows have heterogeneous access latencies, and (3) accelerated bitline precharging by linking multiple precharge units together. Our extensive evaluations show that each of LISA's three applications significantly improves performance and memory energy efficiency, and their combined benefit is higher than the benefit of each alone, on a variety of workloads and system configurations.

...read moreread less

217 citations

Cites methods from "A case for NUMA-aware contention ma..."

...While bulk data movement is a key operation in many applications and operating systems, contemporary systems perform this movement inefficiently, by transferring data from DRAM to the processor, and then back to DRAM, across a narrow off-chip channel....
[...]

Proceedings Article•DOI•

MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability

[...]

Akhil Arunkumar¹, Evgeny Bolotin², Benjamin Cho³, Ugljesa Milic⁴, Eiman Ebrahimi², Oreste Villa², Aamer Jaleel², Carole-Jean Wu¹, David Nellans² - Show less +5 more•Institutions (4)

Arizona State University¹, Nvidia², University of Texas at Austin³, Polytechnic University of Catalonia⁴

24 Jun 2017

TL;DR: It is demonstrated that package-level integration of multiple GPU modules to build larger logical GPUs can enable continuous performance scaling beyond Moore's law and solve the need for higher performing GPUs in many domains.

...read moreread less

Abstract: Historically, improvements in GPU-based high performance computing have been tightly coupled to transistor scaling. As Moore's law slows down, and the number of transistors per die no longer grows at historical rates, the performance curve of single monolithic GPUs will ultimately plateau. However, the need for higher performing GPUs continues to exist in many domains. To address this need, in this paper we demonstrate that package-level integration of multiple GPU modules to build larger logical GPUs can enable continuous performance scaling beyond Moore's law. Specifically, we propose partitioning GPUs into easily manufacturable basic GPU Modules (GPMs), and integrating them on package using high bandwidth and power efficient signaling technologies. We lay out the details and evaluate the feasibility of a basic Multi-Chip-Module GPU (MCM-GPU) design. We then propose three architectural optimizations that significantly improve GPM data locality and minimize the sensitivity on inter-GPM bandwidth. Our evaluation shows that the optimized MCM-GPU achieves 22.8% speedup and 5x inter-GPM bandwidth reduction when compared to the basic MCM-GPU architecture. Most importantly, the optimized MCM-GPU design is 45.5% faster than the largest implementable monolithic GPU, and performs within 10% of a hypothetical (and unbuildable) monolithic GPU. Lastly we show that our optimized MCM-GPU is 26.8% faster than an equally equipped Multi-GPU system with the same total number of SMs and DRAM bandwidth.

...read moreread less

163 citations

Cites background from "A case for NUMA-aware contention ma..."

...In a multi-core domain, existing work tries to minimize the memory access latency by thread-to-core mapping [21, 38, 51], or memory allocation policy [22, 27, 34]....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Pin: building customized program analysis tools with dynamic instrumentation

[...]

Chi-Keung Luk¹, Robert Cohn¹, Robert Muth¹, Harish Patil¹, Artur Klauser¹, Geoff Lowney¹, Steven Wallace¹, Vijay Janapa Reddi², Kim Hazelwood¹ - Show less +5 more•Institutions (2)

Intel¹, University of Colorado Boulder²

12 Jun 2005

TL;DR: The goals are to provide easy-to-use, portable, transparent, and efficient instrumentation, and to illustrate Pin's versatility, two Pintools in daily use to analyze production software are described.

...read moreread less

Abstract: Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pin's rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the application's original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pin's versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), Itanium®, and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website.

...read moreread less

4,019 citations

"A case for NUMA-aware contention ma..." refers methods in this paper

...In order to rapidly evaluate various memory migration strategies, we designed a simulator based on a widely used binary instrumentation tool for x86 binaries called Pin [15]....
[...]

Proceedings Article•DOI•

The SGI Origin: a ccNUMA highly scalable server

[...]

James Laudon, Daniel E. Lenoski

01 May 1997

TL;DR: The motivation for building the Origin 2000 is discussed and the architecture and implementation of the multiprocessor is described, and performance results are presented for the NAS Parallel Benchmarks V2.2 and the SPLASH2 applications.

...read moreread less

Abstract: The SGI Origin 2000 is a cache-coherent non-uniform memory access (ccNUMA) multiprocessor designed and manufactured by Silicon Graphics, Inc. The Origin system was designed from the ground up as a multiprocessor capable of scaling to both small and large processor counts without any bandwidth, latency, or cost cliffs. The Origin system consists of up to 512 nodes interconnected by a scalable Craylink network. Each node consists of one or two R10000 processors, up to 4 GB of coherent memory, and a connection to a portion of the XIO IO subsystem. This paper discusses the motivation for building the Origin 2000 and then describes its architecture and implementation. In addition, performance results are presented for the NAS Parallel Benchmarks V2.2 and the SPLASH2 applications. Finally, the Origin system is compared to other contemporary commercial ccNUMA systems.

...read moreread less

900 citations

"A case for NUMA-aware contention ma..." refers methods in this paper

...The SGI Origin 2000 system [4] implemented the following hardware-supported [13] mechanism for colocation of computation and memory....
[...]

Journal Article•DOI•

Addressing shared resource contention in multicore processors via scheduling

[...]

Sergey Zhuravlev¹, Sergey Blagodurov¹, Alexandra Fedorova¹•Institutions (1)

Simon Fraser University¹

13 Mar 2010

TL;DR: This study is the first to provide a comprehensive analysis of contention-mitigating techniques that use only scheduling, and finds a classification scheme that addresses not only contention for cache space, but contention for other shared resources, such as the memory controller, memory bus and prefetching hardware.

...read moreread less

Abstract: Contention for shared resources on multicore processors remains an unsolved problem in existing systems despite significant research efforts dedicated to this problem in the past. Previous solutions focused primarily on hardware techniques and software page coloring to mitigate this problem. Our goal is to investigate how and to what extent contention for shared resource can be mitigated via thread scheduling. Scheduling is an attractive tool, because it does not require extra hardware and is relatively easy to integrate into the system. Our study is the first to provide a comprehensive analysis of contention-mitigating techniques that use only scheduling. The most difficult part of the problem is to find a classification scheme for threads, which would determine how they affect each other when competing for shared resources. We provide a comprehensive analysis of such classification schemes using a newly proposed methodology that enables to evaluate these schemes separately from the scheduling algorithm itself and to compare them to the optimal. As a result of this analysis we discovered a classification scheme that addresses not only contention for cache space, but contention for other shared resources, such as the memory controller, memory bus and prefetching hardware. To show the applicability of our analysis we design a new scheduling algorithm, which we prototype at user level, and demonstrate that it performs within 2\% of the optimal. We also conclude that the highest impact of contention-aware scheduling techniques is not in improving performance of a workload as a whole but in improving quality of service or performance isolation for individual applications.

...read moreread less

532 citations

Proceedings Article•DOI•

Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors

[...]

David Tam¹, Reza Azimi¹, Michael Stumm¹•Institutions (1)

University of Toronto¹

21 Mar 2007

TL;DR: The design and implementation of a scheme to schedule threads based on sharing patterns detected online using features of standard performance monitoring units (PMUs) available in today's processing units are described and reductions in cross-chip cache accesses are demonstrated.

...read moreread less

Abstract: The major chip manufacturers have all introduced chip multiprocessing (CMP) and simultaneous multithreading (SMT) technology into their processing units. As a result, even low-end computing systems and game consoles have become shared memory multiprocessors with L1 and L2 cache sharing within a chip. Mid- and large-scale systems will have multiple processing chips and hence consist of an SMP-CMP-SMT configuration with non-uniform data sharing overheads. Current operating system schedulers are not aware of these new cache organizations, and as a result, distribute threads across processors in a way that causes many unnecessary, long-latency cross-chip cache accesses.In this paper we describe the design and implementation of a scheme to schedule threads based on sharing patterns detected online using features of standard performance monitoring units (PMUs) available in today's processing units. The primary advantage of using the PMU infrastructure is that it is fine-grained (down to the cache line) and has relatively low overhead. We have implemented our scheme in Linux running on an 8-way Power5 SMP-CMP-SMT multi-processor. For commercial multithreaded server workloads (VolanoMark, SPECjbb, and RUBiS), we are able to demonstrate reductions in cross-chip cache accesses of up to 70%. These reductions lead to application-reported performance improvements of up to 7%.

...read moreread less

289 citations

"A case for NUMA-aware contention ma..." refers background or methods in this paper

...The DINO algorithm introduced in our work complements [19] as it is designed to mitigate contention between applications....
[...]
...Many research efforts addressed efficient co-location of the computation and related memory on the same node [14, 3, 12, 19, 1, 4]....
[...]
...In [19] the authors group threads of the same application that are likely to share data onto neighbouring cores to minimize the costs of data sharing between them....
[...]
...However, when this assumption does not hold, DINO can be extended to predict when co-scheduling threads on the same domain is more beneficial than separating them, using techniques described in [9] or [19]....
[...]

Proceedings Article•DOI•

Efficient operating system scheduling for performance-asymmetric multi-core architectures

[...]

Tong Li¹, Dan Baumberger¹, David A. Koufaty¹, Scott D. Hahn¹•Institutions (1)

Intel¹

10 Nov 2007

TL;DR: This paper presents AMPS, an operating system scheduler that efficiently supports both SMP-and NUMA-style performance-asymmetric architectures, and shows that AMPS improves fairness and repeatability of application performance measurements.

...read moreread less

Abstract: Recent research advocates asymmetric multi-core architectures, where cores in the same processor can have different performance. These architectures support single-threaded performance and multithreaded throughput at lower costs (e.g., die size and power). However, they also pose unique challenges to operating systems, which traditionally assume homogeneous hardware. This paper presents AMPS, an operating system scheduler that efficiently supports both SMP-and NUMA-style performance-asymmetric architectures. AMPS contains three components: asymmetry-aware load balancing, faster-core-first scheduling, and NUMA-aware migration. We have implemented AMPS in Linux kernel 2.6.16 and used CPU clock modulation to emulate performance asymmetry on an SMP and NUMA system. For various workloads, we show that AMPS achieves a median speedup of 1.16 with a maximum of 1.44 over stock Linux on the SMP, and a median of 1.07 with a maximum of 2.61 on the NUMA system. Our results also show that AMPS improves fairness and repeatability of application performance measurements.

...read moreread less

274 citations

"A case for NUMA-aware contention ma..." refers background in this paper

...in [14] introduced AMPS, an operating system scheduler for asymmetric multicore systems that supports NUMA architectures....
[...]
...Many research efforts addressed efficient co-location of the computation and related memory on the same node [14, 3, 12, 19, 1, 4]....
[...]