Loop-Aware Memory Prefetching Using Code Block Working Sets

doi:10.1109/MICRO.2014.27

Proceedings ArticleDOI

Loop-Aware Memory Prefetching Using Code Block Working Sets

- pp 533-544

TLDR

This paper presents the code block working set (CBWS) prefetcher, which captures the working set of complete loop iterations using a single context and improves the performance of existing prefetchers when dealing with tight loops.

Abstract:

Memory prefetchers predict streams of memory addresses that are likely to be accessed by recurring invocations of a static instruction. They identify an access pattern and prefetch the data that is expected to be accessed by pending invocations of the said instruction. A stream, or a prefetch context, is thus typically composed of a trigger instruction and an access pattern. Recurring code blocks, such as loop iterations may, however, include multiple memory instructions. Accurate data prefetching for recurring code blocks thus requires tight coordination across multiple prefetch contexts. This paper presents the code block working set (CBWS) prefetcher, which captures the working set of complete loop iterations using a single context. The prefetcher is based on the observation that code block working sets are highly interdependent across tight loop iterations. Using automated annotation of tight loops, the prefetcher tracks and predicts the working sets of complete loop iterations. The proposed CBWS prefetcher is evaluated using a set of benchmarks from the SPEC CPU2006, PARSEC, SPLASH and Parboil suites. Our evaluation shows that the CBWS prefetcher improves the performance of existing prefetchers when dealing with tight loops. For example, we show that the integration of the CBWS prefetcher with the state-of-the-art spatial memory streaming (SMS) prefetcher achieves an average speedup of 1.16× (up to 4× ), compared to the standalone SMS prefetcher.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

A Survey of Recent Prefetching Techniques for Processor Caches

Sparsh Mittal

- 02 Aug 2016 -

ACM Computing Surveys

TL;DR: This article surveys several recent techniques that aim to improve the implementation and effectiveness of prefetching and characterize the techniques on several parameters to highlight their similarities and differences.

...read moreread less

Proceedings ArticleDOI

Domino Temporal Data Prefetcher

Mohammad Bakhshalipour, +2 more

TL;DR: This work identifies the lookup mechanism of existing temporal prefetchers responsible for the large gap between what they offer and the opportunity, and proposes a practical design for Domino prefetcher that employs an Enhanced Index Table that is indexed by just a single miss address.

...read moreread less

Proceedings ArticleDOI

Translation-Triggered Prefetching

Abhishek Bhattacharjee

TL;DR: TEMPO as mentioned in this paper is a low-overhead hardware mechanism to boost memory performance by exploiting the operating system's (OS) virtual memory subsystem by translating page tables to DRAM.

...read moreread less

Proceedings ArticleDOI

Graph Prefetching Using Data Structure Knowledge

Sam Ainsworth, +1 more

TL;DR: A design of an explicitly configured prefetcher to improve performance for breadth-first searches and sequential iteration on the efficient and commonly-used compressed sparse row graph format by snooping L1 cache accesses from the core and reacting to data returned from its own prefetches.

...read moreread less

Proceedings ArticleDOI

A case for richer cross-layer abstractions: bridging the semantic gap with expressive memory

Nandita Vijaykumar, +8 more

TL;DR: The benefits of XMem are demonstrated using two use cases: improving the performance portability of software-based cache optimization by expressing the semantics of data locality in the optimization andimproving the performance of OS-based page placement in DRAM by leveraging the semanticsof data structures and their access properties.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

LLVM: a compilation framework for lifelong program analysis & transformation

Chris Lattner, +1 more

TL;DR: The design of the LLVM representation and compiler framework is evaluated in three ways: the size and effectiveness of the representation, including the type information it provides; compiler performance for several interprocedural problems; and illustrative examples of the benefits LLVM provides for several challenging compiler problems.

...read moreread less

Proceedings ArticleDOI

The SPLASH-2 programs: characterization and methodological considerations

Steven Cameron Woo, +4 more

TL;DR: This paper quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well, including the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality.

...read moreread less

Proceedings ArticleDOI

The PARSEC benchmark suite: characterization and architectural implications

Christian Bienia, +3 more

TL;DR: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs), and shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic.

...read moreread less

Proceedings ArticleDOI

Rodinia: A benchmark suite for heterogeneous computing

Shuai Che, +6 more

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

...read moreread less

Collapse

Loop-Aware Memory Prefetching Using Code Block Working Sets

Citations

A Survey of Recent Prefetching Techniques for Processor Caches

Domino Temporal Data Prefetcher

Translation-Triggered Prefetching

Graph Prefetching Using Data Structure Knowledge

A case for richer cross-layer abstractions: bridging the semantic gap with expressive memory

References

LLVM: a compilation framework for lifelong program analysis & transformation

The gem5 simulator

The SPLASH-2 programs: characterization and methodological considerations

The PARSEC benchmark suite: characterization and architectural implications

Rodinia: A benchmark suite for heterogeneous computing

Related Papers (5)

Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers

A stateless, content-directed data prefetching mechanism

Memory access scheduling

Spatial Memory Streaming

IMP: indirect memory prefetcher