scispace - formally typeset
Proceedings ArticleDOI

SWEL: hardware cache coherence protocols to map shared data onto shared caches

TLDR
A novel coherence protocol is proposed that greatly reduces the number of coherence operations and falls back on a simple broadcast-based snooping protocol when infrequent coherence is required, based on the premise that most blocks are either private to a core or read-only, and hence, do not require coherence.
Abstract
Snooping and directory-based coherence protocols have become the de facto standard in chip multi-processors, but neither design is without drawbacks. Snooping protocols are not scalable, while directory protocols incur directory storage overhead, frequent indirections, and are more prone to design bugs. In this paper, we propose a novel coherence protocol that greatly reduces the number of coherence operations and falls back on a simple broadcast-based snooping protocol when infrequent coherence is required. This new protocol is based on the premise that most blocks are either private to a core or read-only, and hence, do not require coherence. This will be especially true for future large-scale multi-core machines that will be used to execute message-passing workloads in the HPC domain, or multiple virtual machines for servers. In such systems, it is expected that a very small fraction of blocks will be both shared and frequently written, hence the need to optimize coherence protocols for a new common case. In our new protocol, dubbed SWEL (protocol states are Shared, Written, Exclusivity Level), the L1 cache attempts to store only private or read-only blocks, while shared and written blocks must reside at the shared L2 level. These determinations are made at runtime without software assistance. While accesses to blocks banished from the L1 become more expensive, SWEL can improve throughput because directory indirection is removed for many common write-sharing patterns. Compared to a MESI based directory implementation, we see up to 15% increased performance, a maximum degradation of 2%, and an average performance increase of 2.5% using SWEL and its derivatives. Other advantages of this strategy are reduced protocol complexity (achieved by reducing transient states) and significantly less storage overhead than traditional directory protocols.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism

TL;DR: DeNovo is presented, a hardware architecture motivated by a disciplined shared-memory programming model that allows DeNovo to seamlessly integrate message passing-like interactions within a global address space for improved design complexity, performance, and efficiency.
Proceedings ArticleDOI

Complexity-effective multicore coherence

TL;DR: A virtually costless coherence that outperforms a MESI directory protocol while at the same time reducing shared cache and network energy consumption for 15 parallel benchmarks, on 16 cores is shown.
Patent

System and method for simplifying cache coherence using multiple write policies

TL;DR: In this article, a multi-core cache coherence system with a local/shared cache hierarchy is described. The system includes multiple processor cores, a main memory, and a local cache memory associated with each core for storing cache lines accessible only by the associated core.
Proceedings ArticleDOI

TSO-CC: Consistency directed cache coherence for TSO

TL;DR: TSO-CC does not track sharers, and instead relies on self-invalidation and detection of potential acquires using timestamps to satisfy the TSO memory consistency model lazily, and achieves average performance comparable to a MESI directory protocol.
Proceedings ArticleDOI

DeNovoND: efficient hardware support for disciplined non-determinism

TL;DR: DeNovoND is proposed, a system that supports lock-based, disciplined non-determinism, with the simplicity, performance, and energy benefits of DeNovo, and a coherence protocol that does not require transient states, invalidation traffic, or directories, and does not incur false sharing.
References
More filters
Proceedings ArticleDOI

The SPLASH-2 programs: characterization and methodological considerations

TL;DR: This paper quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well, including the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality.
Proceedings ArticleDOI

The PARSEC benchmark suite: characterization and architectural implications

TL;DR: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs), and shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic.
Journal ArticleDOI

The Nas Parallel Benchmarks

TL;DR: A new set of benchmarks has been developed for the performance evaluation of highly parallel supercom puters that mimic the computation and data move ment characteristics of large-scale computational fluid dynamics applications.
Journal ArticleDOI

Simics: A full system simulation platform

TL;DR: Simics is a platform for full system simulation that can run actual firmware and completely unmodified kernel and driver code, and it provides both functional accuracy for running commercial workloads and sufficient timing accuracy to interface to detailed hardware models.
Book

Parallel Computer Architecture: A Hardware/Software Approach

TL;DR: This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures and provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.
Related Papers (5)