scispace - formally typeset
Proceedings ArticleDOI

Reactive NUCA: near-optimal block placement and replication in distributed caches

TLDR
Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache, is proposed.
Abstract
Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests. At the same time, the growing on-chip communication delay favors core-private caches that replicate data to minimize delays on global wires. Recent hybrid proposals offer lower average latency than conventional designs, but they address the placement requirements of only a subset of the data accessed by the application, require complex lookup and coherence mechanisms that increase latency, or fail to scale to high core counts.In this work, we observe that the cache access patterns of a range of server and scientific workloads can be classified into distinct classes, where each class is amenable to different block placement policies. Based on this observation, we propose Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache. R-NUCA cooperates with the operating system to support intelligent placement, migration, and replication without the overhead of an explicit coherence mechanism for the on-chip last-level cache. In a range of server, scientific, and multiprogrammed workloads, R-NUCA matches the performance of the best cache design for each workload, improving performance by 14% on average over competing designs and by 32% at best, while achieving performance within 5% of an ideal cache design.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Clearing the clouds: a study of emerging scale-out workloads on modern hardware

TL;DR: This work identifies the key micro-architectural needs of scale-out workloads, calling for a change in the trajectory of server processors that would lead to improved computational density and power efficiency in data centers.
Journal ArticleDOI

A data placement strategy in scientific cloud workflows

TL;DR: A matrix based k-means clustering strategy for data placement in scientific cloud workflows that dynamically clusters newly generated datasets to the most appropriate data centres-based on dependencies-during the runtime stage is proposed.
Proceedings ArticleDOI

The impact of memory subsystem resource sharing on datacenter applications

TL;DR: This paper presents a study of the importance of thread-to-core mappings for applications in the datacenter as threads can be mapped to share or to not share caches and bus bandwidth, and investigates the impact of co-locating threads from multiple applications with diverse memory behavior.
Proceedings ArticleDOI

Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture

TL;DR: This work investigates and quantifies the costs and benefits of using MCMs with fine-grained chiplets for deep learning inference, an application area with large compute and on-chip storage requirements, and introduces three tiling optimizations that improve data locality.
Journal ArticleDOI

Data-oriented transaction execution

TL;DR: DORA is designed, a system that decomposes each transaction to smaller actions and assigns actions to threads based on which data each action is about to access, and attains up to 4.8x higher throughput than a state-of-the-art storage engine when running a variety of synthetic and real-world OLTP workloads.
References
More filters
Proceedings ArticleDOI

Route packets, not wires: on-chip interconnection networks

TL;DR: This paper introduces the concept of on-chip networks, sketches a simple network, and discusses some challenges in the architecture and design of these networks.
Journal ArticleDOI

Niagara: a 32-way multithreaded Sparc processor

TL;DR: The Niagara processor implements a thread-rich architecture designed to provide a high-performance solution for commercial server applications that exploits the thread-level parallelism inherent to server applications, while targeting low levels of power consumption.
Journal ArticleDOI

The Torus Routing Chip

TL;DR: The torus routing chip (TRC) is a selftimed chip that performs deadlock-free cut-through routing ink-aryn-cube multiprocessor interconnection networks using a new method of deadlock avoidance called virtual channels.
Proceedings ArticleDOI

An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

TL;DR: This paper proposes physical designs for these Non-Uniform Cache Architectures (NUCAs) and extends these physical designs with logical policies that allow important data to migrate toward the processor within the same level of the cache.
Journal ArticleDOI

Larrabee: a many-core x86 architecture for visual computing

TL;DR: This article consists of a collection of slides from the author's conference presentation, some of the topics discussed include: architecture convergence; Larrabee architecture; and graphics pipeline.
Related Papers (5)