Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

doi:10.1145/285930.285998

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Myths and realities: the performance impact of garbage collection

[...]

Stephen M. Blackburn¹, Perry Cheng², Kathryn S. McKinley³•Institutions (3)

Australian National University¹, IBM², University of Texas at Austin³

01 Jun 2004

TL;DR: This paper explores and quantifies garbage collection behavior for three whole heap collectors and generational counterparts: copying semi-space, mark-sweep, and reference counting, the canonical algorithms from which essentially all other collection algorithms are derived.

...read moreread less

Abstract: This paper explores and quantifies garbage collection behavior for three whole heap collectors and generational counterparts: copying semi-space, mark-sweep, and reference counting, the canonical algorithms from which essentially all other collection algorithms are derived. Efficient implementations in MMTk, a Java memory management toolkit, in IBM's Jikes RVM share all common mechanisms to provide a clean experimental platform. Instrumentation separates collector and program behavior, and performance counters measure timing and memory behavior on three architectures.Our experimental design reveals key algorithmic features and how they match program characteristics to explain the direct and indirect costs of garbage collection as a function of heap size on the SPEC JVM benchmarks. For example, we find that the contiguous allocation of copying collectors attains significant locality benefits over free-list allocators. The reduced collection costs of the generational algorithms together with the locality benefit of contiguous allocation motivates a copying nursery for newly allocated objects. These benefits dominate the overheads of generational collectors compared with non-generational and no collection, disputing the myth that "no garbage collection is good garbage collection." Performance is less sensitive to the mature space collection algorithm in our benchmarks. However the locality and pointer mutation characteristics for a given program occasionally prefer copying or mark-sweep. This study is unique in its breadth of garbage collection algorithms and its depth of analysis.

...read moreread less

248 citations

Proceedings Article•DOI•

A performance study of software and hardware data prefetching schemes

[...]

Tien-Fu Chen¹, Jean-Loup Baer²•Institutions (2)

National Chung Cheng University¹, University of Washington²

01 Apr 1994

TL;DR: Qualitative comparisons indicate that both schemes are able to reduce cache misses in the domain of linear array references, and an approach combining software and hardware schemes is proposed; it shows promise in reducing the memory latency with least overhead.

...read moreread less

Abstract: Prefetching, i.e., exploiting the overlap of processor computations with data accesses, is one of several approaches for tolerating memory latencies. Prefetching can be either hardware-based or software-directed or a combination of both. Hardware-based prefetching, requiring some support unit connected to the cache, can dynamically handle prefetches at run-time without compiler intervention. Software-directed approaches rely on compiler technology to insert explicit prefetch instructions. Mowry et al.'s software scheme [13, 14] and our hardware approach [1] are two representative schemes.In this paper, we evaluate approximations to these two schemes in the context of a shared-memory multiprocessor environment. Our qualitative comparisons indicate that both schemes are able to reduce cache misses in the domain of linear array references. When complex data access patterns are considered, the software approach has compile-time information to perform sophisticated prefetching whereas the hardware scheme has the advantage of manipulating dynamic information. The performance results from an instruction-level simulation of four benchmarks confirm these observations. Our simulations show that the hardware scheme introduces more memory traffic into the network and that the software scheme introduces a non-negligible instruction execution overhead. An approach combining software and hardware schemes is proposed; it shows promise in reducing the memory latency with least overhead.

...read moreread less

238 citations

Proceedings Article•DOI•

Missing the Memory Wall: The Case for Processor/Memory Integration

[...]

Ashley Saulsbury¹, Fong Pong², Andreas Nowatzyk²•Institutions (2)

Swedish Institute of Computer Science¹, Sun Microsystems²

01 May 1996

TL;DR: It is shown that processor memory integration can be used to build competitive, scalable and cost-effective MP systems and results from execution driven uni- and multi-processor simulations show that the benefits of lower latency and higher bandwidth can compensate for the restrictions on the size and complexity of the integrated processor.

...read moreread less

Abstract: Current high performance computer systems use complex, large superscalar CPUs that interface to the main memory through a hierarchy of caches and interconnect systems. These CPU-centric designs invest a lot of power and chip area to bridge the widening gap between CPU and main memory speeds. Yet, many large applications do not operate well on these systems and are limited by the memory subsystem performance.This paper argues for an integrated system approach that uses less-powerful CPUs that are tightly integrated with advanced memory technologies to build competitive systems with greatly reduced cost and complexity. Based on a design study using the next generation 0.25µm, 256Mbit dynamic random-access memory (DRAM) process and on the analysis of existing machines, we show that processor memory integration can be used to build competitive, scalable and cost-effective MP systems.We present results from execution driven uni- and multi-processor simulations showing that the benefits of lower latency and higher bandwidth can compensate for the restrictions on the size and complexity of the integrated processor. In this system, small direct mapped instruction caches with long lines are very effective, as are column buffer data caches augmented with a victim cache.

...read moreread less

235 citations

Proceedings Article•DOI•

Cache write policies and performance

[...]

Norman P. Jouppi

01 May 1993

TL;DR: Tradeoffs on writes that miss in the cache are investigated and a mixture of these two alternatives, called write caching, which places a small fully-associative cache behind a write-through cache.

...read moreread less

Abstract: This paper investigates issues involving writes and caches. First, tradeoffs on writes that miss in the cache are investigated. In particular, whether the missed cache block is fetched on a write miss, whether the missed cache block is allocated in the cache, and whether the cache line is written before hit or miss is known are considered. Depending on the combination of these polices chosen, the entire cache miss rate can vary by a factor of two on some applications. The combination of no-fetch-on-write and write-allocate can provide better performance than cache line allocation instructions. Second, tradeoffs between write-through and write-back caching when writes hit in a cache are considered. A mixture of these two alternatives, called write caching is proposed. Write caching places a small fully-associative cache behind a write-through cache. A write cache can eliminate almost as much write traffic as a write-back cache.

...read moreread less

234 citations

Proceedings Article•DOI•

A fully associative software-managed cache design

[...]

Erik G. Hallnor¹, Steven K. Reinhardt¹•Institutions (1)

University of Michigan¹

01 May 2000

TL;DR: A practical, fully associative, software-managed secondary cache system that provides performance competitive with or superior to traditional caches without OS or application involvement is presented.

...read moreread less

Abstract: As DRAM access latencies approach a thousand instruction-execution times and on-chip caches grow to multiple megabytes, it is not clear that conventional cache structures continue to be appropriate. Two key features—full associativity and software management—have been used successfully in the virtual-memory domain to cope with disk access latencies. Future systems will need to employ similar techniques to deal with DRAM latencies. This paper presents a practical, fully associative, software-managed secondary cache system that provides performance competitive with or superior to traditional caches without OS or application involvement. We see this structure as the first step toward OS- and application-aware management of large on-chip caches.This paper has two primary contributions: a practical design for a fully associative memory structure, the indirect index cache (IIC), and a novel replacement algorithm, generational replacement, that is specifically designed to work with the IIC. We analyze the behavior of an IIC with generational replacement as a drop-in, transparent substitute for a conventional secondary cache. We achieve miss rate reductions from 8% to 85% relative to a 4-way associative LRU organization, matching or beating a (practically infeasible) fully associative true LRU cache. Incorporating these miss rates into a rudimentary timing model indicates that the IIC/generational replacement cache could be competitive with a conventional cache at today's DRAM latencies, and will outperform a conventional cache as these CPU-relative latencies grow.

...read moreread less

224 citations

Collapse

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

Citations

References

Related Papers (5)