scispace - formally typeset
Journal ArticleDOI

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Zeshan A. Chishti, +2 more
- Vol. 33, Iss: 2, pp 357-368
Reads0
Chats0
TLDR
This work proposes controlled replication to reduce capacity pressure by not making extra copies in some cases, and obtaining the data from an existing on-chip copy, and proposes capacity stealing in which private data that exceeds a coreýs capacity is placed in a neighboring cache with less capacity demand.
Abstract
Chip multiprocessors (CMPs) substantially increase capacity pressure on the on-chip memory hierarchy while requiring fast access. Neither private nor shared caches can provide both large capacity and fast access in CMPs. We observe that compared to symmetric multiprocessors (SMPs), CMPs change the latency-capacity tradeoff in two significant ways. We propose three novel ideas to exploit the changes: (1) Though placing copies close to requestors allows fast access for read-only sharing, the copies also reduce the already-limited on-chip capacity in CMPs. We propose controlled replication to reduce capacity pressure by not making extra copies in some cases, and obtaining the data from an existing on-chip copy. This option is not suitable for SMPs because obtaining data from another processor is expensive and capacity is not limited to on-chip storage. (2) Unlike SMPs, CMPs allow fast on-chip communication between processors for read-write sharing. Instead of incurring slow access to read-write shared data through coherence misses as do SMPs, we propose in-situ communication to provide fast access without making copies or incurring coherence misses. (3) Accessing neighborsý caches is not as expensive in CMPs as it is in SMPs. We propose capacity stealing in which private data that exceeds a coreýs capacity is placed in a neighboring cache with less capacity demand. To incorporate our ideas, we use a hybrid of private, per-processor tag arrays and a shared data array. Because the shared data array is slow, we employ non-uniform access and distance associativity from previous proposals to hold frequently-accessed data in regions close to the requestor. We extend the previously-proposed Non-uniform access with Replacement And Placement usIng Distance associativity (NuRAPID) to CMPs, and call our cache CMP-NuRAPID. Our results show that for a 4-core CMP with 8 MB cache, CMP-NuRAPID improves performance by 13% over a shared cache and 8% over private caches for three commercial multithreaded workloads.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.
Proceedings ArticleDOI

Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

TL;DR: This work implements two major extensions to the CACTI cache modeling tool that focus on interconnect design for a large cache, and adopts state-of-the-art design space exploration strategies for non-uniform cache access (NUCA).
Proceedings ArticleDOI

A novel architecture of the 3D stacked MRAM L2 cache for CMPs

TL;DR: This paper stacks MRAM-based L2 caches directly atop CMPs and compares it against SRAM counterparts in terms of performance and energy, and proposes two architectural techniques: read-preemptive write buffer and SRAM-MRAM hybrid L2 cache.
Proceedings ArticleDOI

Reactive NUCA: near-optimal block placement and replication in distributed caches

TL;DR: Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache, is proposed.
Journal ArticleDOI

Design and Management of 3D Chip Multiprocessors Using Network-in-Memory

TL;DR: A router architecture and a topology design that makes use of a network architecture embedded into the L2 cache memory are proposed that demonstrate that a 3D L2 memory architecture generates much better results than the conventional two-dimensional designs under different number of layers and vertical connections.
References
More filters
Proceedings ArticleDOI

The SPLASH-2 programs: characterization and methodological considerations

TL;DR: This paper quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well, including the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality.
Journal ArticleDOI

Simics: A full system simulation platform

TL;DR: Simics is a platform for full system simulation that can run actual firmware and completely unmodified kernel and driver code, and it provides both functional accuracy for running commercial workloads and sufficient timing accuracy to interface to detailed hardware models.
Book

Parallel Computer Architecture: A Hardware/Software Approach

TL;DR: This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures and provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.
Proceedings ArticleDOI

Generating representative Web workloads for network and server performance evaluation

TL;DR: This paper applies a number of observations of Web server usage to create a realistic Web workload generation tool which mimics a set of real users accessing a server and addresses the technical challenges to satisfying this large set of simultaneous constraints on the properties of the reference stream.
Proceedings ArticleDOI

An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

TL;DR: This paper proposes physical designs for these Non-Uniform Cache Architectures (NUCAs) and extends these physical designs with logical policies that allow important data to migrate toward the processor within the same level of the cache.
Related Papers (5)