Accelerating critical section execution with asymmetric multi-core architectures

doi:10.1145/1508244.1508274

Home
/
Papers
/
Accelerating critical section execution with asymmetric multi-core architectures

Proceedings Article•DOI•

Accelerating critical section execution with asymmetric multi-core architectures

M. Aater Suleman¹, Onur Mutlu², Moinuddin K. Qureshi³, Yale N. Patt¹•Institutions (3)

University of Texas at Austin¹, Carnegie Mellon University², IBM³

07 Mar 2009-Vol. 44, Iss: 3, pp 253-264

TL;DR: The proposed accelerated critical sections mechanism reduces this limitation by executing critical sections on the high-performance core of an asymmetric chip multiprocessor, which can execute them faster than the smaller cores can.

read less

Abstract: To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only one thread accesses shared data at any given time. Critical sections can serialize the execution of threads, which significantly reduces performance and scalability.This paper proposes Accelerated Critical Sections (ACS), a technique that leverages the high-performance core(s) of an Asymmetric Chip Multiprocessor (ACMP) to accelerate the execution of critical sections. In ACS, selected critical sections are executed by a high-performance core, which can execute the critical section faster than the other, smaller cores. As a result, ACS reduces serialization: it lowers the likelihood of threads waiting for a critical section to finish. Our evaluation on a set of 12 critical-section-intensive workloads shows that ACS reduces the average execution time by 34% compared to an equal-area 32T-core symmetric CMP and by 23% compared to an equal-area ACMP. Moreover, for 7 out of the 12 workloads, ACS improves scalability by increasing the number of threads at which performance saturates.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Dark Silicon and the End of Multicore Scaling

[...]

Hadi Esmaeilzadeh¹, Emily Blem², R. St. Amant³, Karthikeyan Sankaralingam², Doug Burger⁴ - Show less +1 more•Institutions (4)

University of Washington¹, University of Wisconsin-Madison², University of Texas at Austin³, Microsoft⁴

01 May 2012-IEEE Micro

TL;DR: A comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.

...read moreread less

Abstract: A key question for the microprocessor research and design community is whether scaling multicores will provide the performance and value needed to scale down many more technology generations. To provide a quantitative answer to this question, a comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.

...read moreread less

1,556 citations

Proceedings Article•DOI•

Dark silicon and the end of multicore scaling

[...]

Hadi Esmaeilzadeh¹, Emily Blem², Renee St. Amant³, Karthikeyan Sankaralingam², Doug Burger⁴ - Show less +1 more•Institutions (4)

University of Washington¹, University of Wisconsin-Madison², University of Texas at Austin³, Microsoft⁴

04 Jun 2011

TL;DR: The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community.

...read moreread less

Abstract: Since 2005, processor designers have increased core counts to exploit Moore's Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit multicore scaling just as single-core scaling has been curtailed. This paper models multicore scaling limits by combining device scaling, single-core scaling, and multicore scaling to measure the speedup potential for a set of parallel workloads for the next five technology generations. For device scaling, we use both the ITRS projections and a set of more conservative device scaling parameters. To model single-core scaling, we combine measurements from over 150 processors to derive Pareto-optimal frontiers for area/performance and power/performance. Finally, to model multicore scaling, we build a detailed performance model of upper-bound performance and lower-bound core power. The multicore designs we study include single-threaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community. Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. Through 2024, only 7.9x average speedup is possible across commonly used parallel workloads, leaving a nearly 24-fold gap from a target of doubled performance per generation.

...read moreread less

1,379 citations

Proceedings Article•DOI•

A scalable processing-in-memory accelerator for parallel graph processing

[...]

Junwhan Ahn¹, Sungpack Hong², Sungjoo Yoo¹, Onur Mutlu³, Kiyoung Choi¹ - Show less +1 more•Institutions (3)

Seoul National University¹, Oracle Corporation², Carnegie Mellon University³

13 Jun 2015

TL;DR: This work argues that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve memory-capacity-proportional performance and designs a programmable PIM accelerator for large-scale graph processing called Tesseract.

...read moreread less

Abstract: The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations. In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.

...read moreread less

718 citations

Proceedings Article•DOI•

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

[...]

Yoongu Kim¹, Michael K. Papamichael¹, Onur Mutlu¹, Mor Harchol-Balter¹•Institutions (1)

Carnegie Mellon University¹

04 Dec 2010

TL;DR: This paper presents a new memory scheduling algorithm that addresses system throughput and fairness separately with the goal of achieving the best of both, and evaluates TCM on a wide variety of multiprogrammed workloads and compares its performance to four previously proposed scheduling algorithms, finding that TCM achieves both the best system throughputand fairness.

...read moreread less

Abstract: In a modern chip-multiprocessor system, memory is a shared resource among multiple concurrently executing threads. The memory scheduling algorithm should resolve memory contention by arbitrating memory access in such a way that competing threads progress at a relatively fast and even pace, resulting in high system throughput and fairness. Previously proposed memory scheduling algorithms are predominantly optimized for only one of these objectives: no scheduling algorithm provides the best system throughput and best fairness at the same time. This paper presents a new memory scheduling algorithm that addresses system throughput and fairness separately with the goal of achieving the best of both. The main idea is to divide threads into two separate clusters and employ different memory request scheduling policies in each cluster. Our proposal, Thread Cluster Memory scheduling (TCM), dynamically groups threads with similar memory access behavior into either the latency-sensitive (memory-non-intensive) or the bandwidth-sensitive (memory-intensive) cluster. TCM introduces three major ideas for prioritization: 1) we prioritize the latency-sensitive cluster over the bandwidth-sensitive cluster to improve system throughput, 2) we introduce a ``niceness'' metric that captures a thread's propensity to interfere with other threads, 3) we use niceness to periodically shuffle the priority order of the threads in the bandwidth-sensitive cluster to provide fair access to each thread in a way that reduces inter-thread interference. On the one hand, prioritizing memory-non-intensive threads significantly improves system throughput without degrading fairness, because such ``light'' threads only use a small fraction of the total available memory bandwidth. On the other hand, shuffling the priority order of memory-intensive threads improves fairness because it ensures no thread is disproportionately slowed down or starved. We evaluate TCM on a wide variety of multiprogrammed workloads and compare its performance to four previously proposed scheduling algorithms, finding that TCM achieves both the best system throughput and fairness. Averaged over 96 workloads on a 24-core system with 4 memory channels, TCM improves system throughput and reduces maximum slowdown by 4.6%/38.6% compared to ATLAS (previous work providing the best system throughput) and 7.6%/4.6% compared to PAR-BS (previous work providing the best fairness).

...read moreread less

375 citations

Cites background from "Accelerating critical section execu..."

...In contrast, the execution time of the second type of multithreaded applications is determined by slow-running critical threads [22, 1, 2]....
[...]

Thr ead Cluster Memory Scheduling: Exploiting Diffe rences in Memory Access Behavior

[...]

Yoongu Kim, Michael K. Papamichael, Onur Mutlu

01 Jan 2010

TL;DR: TCM as discussed by the authors dynamically groups threads with similar memory access behavior into either the latency-sensitive (memory non-intensive) or the bandwidth-intensive (memory intensive) clusters, and introduces a ''niceness'' metric that captures a thread's propensity to interfere with other threads.

...read moreread less

358 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

The SPLASH-2 programs: characterization and methodological considerations

[...]

Steven Cameron Woo¹, Moriyoshi Ohara¹, Evan Torrie¹, Jaswinder Pal Singh², Anoop Gupta¹ - Show less +1 more•Institutions (2)

Stanford University¹, Princeton University²

01 May 1995

TL;DR: This paper quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well, including the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality.

...read moreread less

Abstract: The SPLASH-2 suite of parallel applications has recently been released to facilitate the study of centralized and distributed shared-address-space multiprocessors. In this context, this paper has two goals. One is to quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well. The properties we study include the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality, as well as how these properties scale with problem size and the number of processors. The other, related goal is methodological: to assist people who will use the programs in architectural evaluations to prune the space of application and machine parameters in an informed and meaningful way. For example, by characterizing the working sets of the applications, we describe which operating points in terms of cache size and problem size are representative of realistic situations, which are not, and which re redundant. Using SPLASH-2 as an example, we hope to convey the importance of understanding the interplay of problem size, number of processors, and working sets in designing experiments and interpreting their results.

...read moreread less

4,002 citations

Proceedings Article•DOI•

Validity of the single processor approach to achieving large scale computing capabilities

[...]

Gene Myron Amdahl¹•Institutions (1)

IBM¹

18 Apr 1967

TL;DR: In this paper, the authors argue that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution.

...read moreread less

Abstract: For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution. Variously the proper direction has been pointed out as general purpose computers with a generalized interconnection of memories, or as specialized computers with geometrically related memory interconnections and controlled by one or more instruction streams.

...read moreread less

3,653 citations

Proceedings Article•DOI•

Transactional memory: architectural support for lock-free data structures

[...]

Maurice Herlihy, J. Eliot B. Moss¹•Institutions (1)

University of Massachusetts Amherst¹

01 May 1993

TL;DR: Simulation results show that transactional memory matches or outperforms the best known locking techniques for simple benchmarks, even in the absence of priority inversion, convoying, and deadlock.

...read moreread less

Abstract: A shared data structure is lock-free if its operations do not require mutual exclusion. If one process is interrupted in the middle of an operation, other processes will not be prevented from operating on that object. In highly concurrent systems, lock-free data structures avoid common problems associated with conventional locking techniques, including priority inversion, convoying, and difficulty of avoiding deadlock. This paper introduces transactional memory, a new multiprocessor architecture intended to make lock-free synchronization as efficient (and easy to use) as conventional techniques based on mutual exclusion. Transactional memory allows programmers to define customized read-modify-write operations that apply to multiple, independently-chosen words of memory. It is implemented by straightforward extensions to any multiprocessor cache-coherence protocol. Simulation results show that transactional memory matches or outperforms the best known locking techniques for simple benchmarks, even in the absence of priority inversion, convoying, and deadlock.

...read moreread less

2,406 citations

Journal Article•DOI•

Branch-and-Bound Methods: A Survey

[...]

Eugene L. Lawler¹, D. E. Wood¹•Institutions (1)

University of Michigan¹

01 Aug 1966-Operations Research

TL;DR: The essential features of the branch-and-bound approach to constrained optimization are described, and several specific applications are reviewed, including integer linear programming Land-Doig and Balas methods, nonlinear programming minimization of nonconvex objective functions, and the quadratic assignment problem Gilmore and Lawler methods.

...read moreread less

Abstract: The essential features of the branch-and-bound approach to constrained optimization are described, and several specific applications are reviewed. These include integer linear programming Land-Doig and Balas methods, nonlinear programming minimization of nonconvex objective functions, the traveling-salesman problem Eastman and Little, et al. methods, and the quadratic assignment problem Gilmore and Lawler methods. Computational considerations, including trade-offs between length of computation and storage requirements, are discussed and a comparison with dynamic programming is made. Various applications outside the domain of mathematical programming are also mentioned.

...read moreread less

1,915 citations

Journal Article•DOI•

Implementing remote procedure calls

[...]

Andrew D. Birrell¹, Bruce Jay Nelson¹•Institutions (1)

PARC¹

01 Feb 1984-ACM Transactions on Computer Systems

TL;DR: The overall structure of the RPC mechanism, the facilities for binding RPC clients, the transport level communication protocol, and some performance measurements are described, including some optimizations used to achieve high performance and to minimize the load on server machines that have many clients.

...read moreread less

Abstract: Remote procedure calls (RPC) appear to be a useful paradig m for providing communication across a network between programs written in a high-level language. This paper describes a package providing a remote procedure call facility, the options that face the designer of such a package, and the decisions ~we made. We describe the overall structure of our RPC mechanism, our facilities for binding RPC clients, the transport level communication protocol, and some performance measurements. We include descriptioro~ of some optimizations used to achieve high performance and to minimize the load on server machines that have many clients.

...read moreread less

1,868 citations