SWEL: hardware cache coherence protocols to map shared data onto shared caches

doi:10.1145/1854273.1854331

Proceedings ArticleDOI

SWEL: hardware cache coherence protocols to map shared data onto shared caches

- pp 465-476

TLDR

A novel coherence protocol is proposed that greatly reduces the number of coherence operations and falls back on a simple broadcast-based snooping protocol when infrequent coherence is required, based on the premise that most blocks are either private to a core or read-only, and hence, do not require coherence.

Abstract:

Snooping and directory-based coherence protocols have become the de facto standard in chip multi-processors, but neither design is without drawbacks. Snooping protocols are not scalable, while directory protocols incur directory storage overhead, frequent indirections, and are more prone to design bugs. In this paper, we propose a novel coherence protocol that greatly reduces the number of coherence operations and falls back on a simple broadcast-based snooping protocol when infrequent coherence is required. This new protocol is based on the premise that most blocks are either private to a core or read-only, and hence, do not require coherence. This will be especially true for future large-scale multi-core machines that will be used to execute message-passing workloads in the HPC domain, or multiple virtual machines for servers. In such systems, it is expected that a very small fraction of blocks will be both shared and frequently written, hence the need to optimize coherence protocols for a new common case. In our new protocol, dubbed SWEL (protocol states are Shared, Written, Exclusivity Level), the L1 cache attempts to store only private or read-only blocks, while shared and written blocks must reside at the shared L2 level. These determinations are made at runtime without software assistance. While accesses to blocks banished from the L1 become more expensive, SWEL can improve throughput because directory indirection is removed for many common write-sharing patterns. Compared to a MESI based directory implementation, we see up to 15% increased performance, a maximum degradation of 2%, and an average performance increase of 2.5% using SWEL and its derivatives. Other advantages of this strategy are reduced protocol complexity (achieved by reducing transient states) and significantly less storage overhead than traditional directory protocols.

SWEL: hardware cache coherence protocols to map shared data onto shared caches

Citations

DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism

Complexity-effective multicore coherence

System and method for simplifying cache coherence using multiple write policies

TSO-CC: Consistency directed cache coherence for TSO

DeNovoND: efficient hardware support for disciplined non-determinism

References

The SPLASH-2 programs: characterization and methodological considerations

The PARSEC benchmark suite: characterization and architectural implications

The Nas Parallel Benchmarks

Simics: A full system simulation platform

Parallel Computer Architecture: A Hardware/Software Approach

Related Papers (5)

Reactive NUCA: near-optimal block placement and replication in distributed caches

The SPLASH-2 programs: characterization and methodological considerations

The PARSEC benchmark suite: characterization and architectural implications

Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

GARNET: A detailed on-chip network model inside a full-system simulator