Proceedings ArticleDOI
Reactive NUCA: near-optimal block placement and replication in distributed caches
Nikos Hardavellas,Michael Ferdman,Babak Falsafi,Anastasia Ailamaki +3 more
- Vol. 37, Iss: 3, pp 184-195
TLDR
Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache, is proposed.Abstract:
Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests. At the same time, the growing on-chip communication delay favors core-private caches that replicate data to minimize delays on global wires. Recent hybrid proposals offer lower average latency than conventional designs, but they address the placement requirements of only a subset of the data accessed by the application, require complex lookup and coherence mechanisms that increase latency, or fail to scale to high core counts.In this work, we observe that the cache access patterns of a range of server and scientific workloads can be classified into distinct classes, where each class is amenable to different block placement policies. Based on this observation, we propose Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache. R-NUCA cooperates with the operating system to support intelligent placement, migration, and replication without the overhead of an explicit coherence mechanism for the on-chip last-level cache. In a range of server, scientific, and multiprogrammed workloads, R-NUCA matches the performance of the best cache design for each workload, improving performance by 14% on average over competing designs and by 32% at best, while achieving performance within 5% of an ideal cache design.read more
Citations
More filters
Proceedings ArticleDOI
Clearing the clouds: a study of emerging scale-out workloads on modern hardware
Michael Ferdman,Almutaz Adileh,Onur Kocberber,Stavros Volos,Mohammad Alisafaee,Djordje Jevdjic,Cansu Kaynak,Adrian Daniel Popescu,Anastasia Ailamaki,Babak Falsafi +9 more
TL;DR: This work identifies the key micro-architectural needs of scale-out workloads, calling for a change in the trajectory of server processors that would lead to improved computational density and power efficiency in data centers.
Journal ArticleDOI
A data placement strategy in scientific cloud workflows
TL;DR: A matrix based k-means clustering strategy for data placement in scientific cloud workflows that dynamically clusters newly generated datasets to the most appropriate data centres-based on dependencies-during the runtime stage is proposed.
Proceedings ArticleDOI
The impact of memory subsystem resource sharing on datacenter applications
TL;DR: This paper presents a study of the importance of thread-to-core mappings for applications in the datacenter as threads can be mapped to share or to not share caches and bus bandwidth, and investigates the impact of co-locating threads from multiple applications with diverse memory behavior.
Proceedings ArticleDOI
Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture
Yakun Sophia Shao,Jason Clemons,Rangharajan Venkatesan,Brian Zimmer,Matthew Fojtik,Nan Jiang,Ben Keller,Alicia Klinefelter,Nathaniel Pinckney,Priyanka Raina,Stephen G. Tell,Yanqing Zhang,William J. Dally,Joel Emer,C. Thomas Gray,Brucek Khailany,Stephen W. Keckler +16 more
TL;DR: This work investigates and quantifies the costs and benefits of using MCMs with fine-grained chiplets for deep learning inference, an application area with large compute and on-chip storage requirements, and introduces three tiling optimizations that improve data locality.
Journal ArticleDOI
Data-oriented transaction execution
TL;DR: DORA is designed, a system that decomposes each transaction to smaller actions and assigns actions to threads based on which data each action is about to access, and attains up to 4.8x higher throughput than a state-of-the-art storage engine when running a variety of synthetic and real-world OLTP workloads.
References
More filters
Proceedings ArticleDOI
Route packets, not wires: on-chip interconnection networks
William J. Dally,Brian Towles +1 more
TL;DR: This paper introduces the concept of on-chip networks, sketches a simple network, and discusses some challenges in the architecture and design of these networks.
Journal ArticleDOI
Niagara: a 32-way multithreaded Sparc processor
TL;DR: The Niagara processor implements a thread-rich architecture designed to provide a high-performance solution for commercial server applications that exploits the thread-level parallelism inherent to server applications, while targeting low levels of power consumption.
Journal ArticleDOI
The Torus Routing Chip
TL;DR: The torus routing chip (TRC) is a selftimed chip that performs deadlock-free cut-through routing ink-aryn-cube multiprocessor interconnection networks using a new method of deadlock avoidance called virtual channels.
Proceedings ArticleDOI
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches
TL;DR: This paper proposes physical designs for these Non-Uniform Cache Architectures (NUCAs) and extends these physical designs with logical policies that allow important data to migrate toward the processor within the same level of the cache.
Journal ArticleDOI
Larrabee: a many-core x86 architecture for visual computing
Larry D. Seiler,Doug Carmean,Eric Sprangle,Tom Forsyth,Michael Abrash,Pradeep Dubey,Stephen Junkins,Adam T. Lake,Jeremy Sugerman,Robert Dale Cavin,Roger Espasa,Ed Grochowski,Toni Juan,Pat Hanrahan +13 more
TL;DR: This article consists of a collection of slides from the author's conference presentation, some of the topics discussed include: architecture convergence; Larrabee architecture; and graphics pipeline.