(PDF) Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers (1990) | Norman P. Jouppi

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Tradeoffs in two-level on-chip caching

[...]

Norman P. Jouppi, Steven J. E. Wilton¹•Institutions (1)

University of Toronto¹

01 Apr 1994

TL;DR: Two-level exclusive caching improves the performance of two-level caching organizations by increasing the effective associativity and capacity.

...read moreread less

Abstract: The performance of two-level on-chip caching is investigated for a range of technology and architecture assumptions. The area and access time of each level of cache is modeled in detail. The results indicate that for most workloads, two-level cache configurations (with a set-associative second level) perform marginally better than single-level cache configurations that require the same chip area once the first-level cache sizes are 64KB or larger. Two-level configurations become even more important in systems with no off-chip cache and in systems in which the memory cells in the first-level caches are multiported and hence larger than those in the second-level cache. Finally, a new replacement policy called two-level exclusive caching is introduced. Two-level exclusive caching improves the performance of two-level caching organizations by increasing the effective associativity and capacity.

...read moreread less

195 citations

Proceedings Article•DOI•

Run-time adaptive cache hierarchy management via reference analysis

[...]

Teresa L. Johnson¹, Wen-mei W. Hwu¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 May 1997

TL;DR: A technique for dynamic analysis of program data access behavior is presented, which is then used to proactively guide the placement of data within the cache hierarchy in a location-sensitive manner and is fully compatible with existing Instruction Set Architectures.

...read moreread less

Abstract: Improvements in main memory speeds have not kept pace with increasing processor clock frequency and improved exploitation of instruction-level parallelism. Consequently, the gap between processor and main memory performance is expected to grow, increasing the number of execution cycles spent waiting for memory accesses to complete. One solution to this growing problem is to reduce the number of cache misses by increasing the effectiveness of the cache hierarchy. In this paper we present a technique for dynamic analysis of program data access behavior, which is then used to proactively guide the placement of data within the cache hierarchy in a location-sensitive manner. We introduce the concept of a macroblock, which allows us to feasibly characterize the memory locations accessed by a program, and a Memory Address Table, which performs the dynamic reference analysis. Our technique is fully compatible with existing Instruction Set Architectures. Results from detailed simulations of several integer programs show significant speedups.

...read moreread less

192 citations

Proceedings Article•DOI•

Guided region prefetching: a cooperative hardware/software approach

[...]

Zhenlin Wang¹, Doug Burger², Kathryn S. McKinley², Steven K. Reinhardt³, Charles C. Weems¹ - Show less +1 more•Institutions (3)

University of Massachusetts Amherst¹, University of Texas at Austin², University of Michigan³

01 May 2003

TL;DR: The GRP hardware-software collaboration thus combines the accuracy of compilerbased program analysis with the performance potential of aggressive hardware prefetching, bringing the performance gap versus a perfect L2 cache under 20%.

...read moreread less

Abstract: Despite large caches, main-memory access latencies still cause significant performance losses in many applications. Numerous hardware and software prefetching schemes have been proposed to tolerate these latencies. Software prefetching typically provides better prefetch accuracy than hardware, but is limited by prefetch instruction overheads and the compiler's limited ability to schedule prefetches sufficiently far in advance to cover level-two cache miss latencies. Hardware prefetching can be effective at hiding these large latencies, but generates many useless prefetches and consumes considerable memory bandwidth. In this paper, we propose a cooperative hardware-software prefetching scheme called Guided Region Prefetching (GRP), which uses compiler-generated hints encoded in load instructions to regulate an aggressive hardware prefetching engine. We compare GRP against a sophisticated pure hardware stride prefetcher and a scheduled region prefetching (SRP) engine. SRP and GRP show the best performance, with respective 22% and 21% gains over no prefetching, but SRP incurs 180% extra memory traffic---nearly tripling bandwidth requirements. GRP achieves performance close to SRP, but with a mere eighth of the extra prefetching traffic, a 23% increase over no prefetching. The GRP hardware-software collaboration thus combines the accuracy of compilerbased program analysis with the performance potential of aggressive hardware prefetching, bringing the performance gap versus a perfect L2 cache under 20%.

...read moreread less

188 citations

Journal Article•DOI•

Avoiding conflict misses dynamically in large direct-mapped caches

[...]

Brian N. Bershad¹, Dennis Lee¹, Theodore H. Romer¹, J. Bradley Chen²•Institutions (2)

University of Washington¹, Carnegie Mellon University²

01 Nov 1994

TL;DR: Using trace-driven simulation of applications and the operating system, it is shown that a CML buffer enables a large direct-mapped cache to perform nearly as well as a two-way set associative cache of equivalent size and speed, although with lower hardware cost and complexity.

...read moreread less

Abstract: This paper describes a method for improving the performance of a large direct-mapped cache by reducing the number of conflict misses. Our solution consists of two components: an inexpensive hardware device called a Cache Miss Lookaside (CML) buffer that detects conflicts by recording and summarizing a history of cache misses, and a software policy within the operating system's virtual memory system that removes conflicts by dynamically remapping pages whenever large numbers of conflict misses are detected. Using trace-driven simulation of applications and the operating system, we show that a CML buffer enables a large direct-mapped cache to perform nearly as well as a two-way set associative cache of equivalent size and speed, although with lower hardware cost and complexity.

...read moreread less

187 citations

Proceedings Article•DOI•

OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator

[...]

Subhankar Pal¹, Jonathan Beaumont¹, Dong-Hyeon Park¹, Aporva Amarnath¹, Siying Feng¹, Chaitali Chakrabarti, Hun-Seok Kim¹, David Blaauw¹, Trevor Mudge¹, Ronald G. Dreslinski¹ - Show less +6 more•Institutions (1)

University of Michigan¹

01 Feb 2018

TL;DR: OuterSPACE is a highly-scalable, energy-efficient, reconfigurable design, consisting of massively parallel Single Program, Multiple Data (SPMD)-style processing units, distributed memories, high-speed crossbars and High Bandwidth Memory (HBM).

...read moreread less

Abstract: Sparse matrices are widely used in graph and data analytics, machine learning, engineering and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator targeted at applications that involve large sparse matrices. OuterSPACE is a highly-scalable, energy-efficient, reconfigurable design, consisting of massively parallel Single Program, Multiple Data (SPMD)-style processing units, distributed memories, high-speed crossbars and High Bandwidth Memory (HBM). We identify redundant memory accesses to non-zeros as a key bottleneck in traditional sparse matrix-matrix multiplication algorithms. To ameliorate this, we implement an outer product based matrix multiplication technique that eliminates redundant accesses by decoupling multiplication from accumulation. We demonstrate that traditional architectures, due to limitations in their memory hierarchies and ability to harness parallelism in the algorithm, are unable to take advantage of this reduction without incurring significant overheads. OuterSPACE is designed to specifically overcome these challenges. We simulate the key components of our architecture using gem5 on a diverse set of matrices from the University of Florida's SuiteSparse collection and the Stanford Network Analysis Project and show a mean speedup of 7.9× over Intel Math Kernel Library on a Xeon CPU, 13.0× against cuSPARSE and 14.0× against CUSP when run on an NVIDIA K40 GPU, while achieving an average throughput of 2.9 GFLOPS within a 24 W power budget in an area of 87 mm2.

...read moreread less

186 citations

Collapse

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

Citations

References

Related Papers (5)