Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

doi:10.1145/285930.285998

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Variational path profiling

[...]

Erez Perelman¹, Trishul Chilimbi¹, Brad Calder²•Institutions (2)

University of California, San Diego¹, Microsoft²

17 Sep 2005

TL;DR: This paper presents a new type of profiling analysis called variational path profiling (VPP), which pinpoints exactly where in the program there are potentially significant optimization opportunities for speedup.

...read moreread less

Abstract: Current profiling techniques are good at identifying where time is being spent during program execution. These techniques are not as good at pinpointing exactly where in the execution there are definite opportunities a programmer can exploit with optimization. In this paper we present a new type of profiling analysis called variational path profiling (VPP). VPP pinpoints exactly where in the program there are potentially significant optimization opportunities for speedup. VPP finds the acyclic control flow paths that vary the most in execution time (the time it takes to execute each occurrence of the path). This is calculated by sampling the time it takes to execute frequent paths using hardware performance counters. The motivation for concentrating on a path with a high net variation in its execution time is that it can potentially be optimized so that most or all executions of that path have the minimal execution time seen during profiling. We present a profiling and analysis approach to find these variational paths, so that they can be communicated back to a programmer to guide optimization. Our results show that this variation accounts for a significant fraction of overall program execution time and a small number of paths account for a large fraction of this variation. By applying straight forward prefetching optimizations to these variational paths we see 8.5% speedups on average.

...read moreread less

55 citations

Proceedings Article•DOI•

HMTT: a platform independent full-system memory trace monitoring system

[...]

Yungang Bao¹, Mingyu Chen¹, Yuan Ruan¹, Li Liu¹, Jianping Fan¹, Qingbo Yuan¹, Bo Song¹, Jianwei Xu¹ - Show less +4 more•Institutions (1)

Chinese Academy of Sciences¹

02 Jun 2008

TL;DR: Using HMTT, the implementation of an initial monitoring system, it is observed that burst bandwidth utilization is much larger than average bandwidth utilization, by up to 5X in desktop applications, and the stream memory accesses of many applications contribute even more than 40% of L2 Cache misses and OS virtual memory management may decrease stream accesses in view of memory controller (or L2 cache), by up- 30.2%.

...read moreread less

Abstract: Memory trace analysis is an important technology for architecture research, system software (i.e., OS, compiler) optimization, and application performance improvements. Many approaches have been used to track memory trace, such as simulation, binary instrumentation and hardware snooping. However, they usually have limitations of time, accuracy and capacity.In this paper we propose a platform independent memory trace monitoring system, which is able to track virtual memory reference trace of full systems (including OS, VMMs, libraries, and applications). The system adopts a DIMM-snooping mechanism that uses hardware boards plugged in DIMM slots to snoop. There are several advantages in this approach, such as fast, complete, undistorted, and portable. Three key techniques are proposed to address the system design challenges with this mechanism: (1) To keep up with memory speeds, the DDR protocol state machine is simplified, and large FIFOs are added between the state machine and the trace transmitting logic to handle burst memory accesses; (2) To reconstruct physical-tovirtual mapping and distinguish one process' address space from others, an OS kernel module, which collects page table information, and a synchronization mechanism, which synchronizes the page table information with the memory race, are developed; (3) To dump massive trace data, we employ a straightforward method to compress the trace and use Gigabit Ethernet and RAID to send and receive the compressed trace.We present our implementation of an initial monitoring system, named HMTT (Hyper Memory Trace Tracker). Using HMTT, we have observed that burst bandwidth utilization is much larger than average bandwidth utilization, by up to 5X in desktop applications. We have also confirmed that the stream memory accesses of many applications contribute even more than 40% of L2 Cache misses and OS virtual memory management may decrease stream accesses in view of memory controller (or L2 Cache), by up to 30.2%. Moreover, we have evaluated OS impact on memory performance in real systems. The evaluations and case studies show the feasibility and effectiveness of our proposed monitoring mechanism and techniques.

...read moreread less

54 citations

Macroscopic data structure analysis and optimization

[...]

Vikram Adve¹, Chris Lattner¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Jan 2005

TL;DR: A new class of techniques named "Macroscopic Data Structure Analyses and Optimizations" is presented, which is a new approach to the problem of analyzing and optimizing pointer-intensive programs, and a large class of potential applications for the work in fields such as heap safety and reliability, program understanding, distributed computing, and static garbage collection are described.

...read moreread less

Abstract: Providing high performance for pointer-intensive programs on modern architectures is an increasingly difficult problem for compilers. Pointer-intensive programs are often bound by memory latency and cache performance, but traditional approaches to these problems usually fail: Pointer intensive programs are often highly-irregular and the compiler has little control over the layout of heap allocated objects. This thesis presents a new class of techniques named “Macroscopic Data Structure Analyses and Optimizations”, which is a new approach to the problem of analyzing and optimizing pointer-intensive programs. Instead of analyzing individual load/store operations or structure definitions, this approach identifies, analyzes, and transforms entire memory structures as a unit. The foundation of the approach is an analysis named Data Structure Analysis and a transformation named Automatic Pool Allocation. Data Structure Analysis is a context-sensitive pointer analysis which identifies data structures on the heap and their important properties (such as type safety). Automatic Pool Allocation uses the results of Data Structure Analysis to segregate dynamically allocated objects on the heap, giving control over the layout of the data structure in memory to the compiler. Based on these two foundation techniques, this thesis describes several performance improving optimizations for pointer-intensive programs. First, Automatic Pool Allocation itself provides important locality improvements for the program. Once the program is pool allocated, several pool-specific optimizations can be performed to reduce inter-object padding and pool overhead. Second, we describe an aggressive technique, Automatic Pointer Compression, which reduces the size of pointers on 64-bit targets to 32-bits or less, increasing effective cache capacity and memory bandwidth for pointer-intensive programs. This thesis describes the approach, analysis, and transformation of programs with macroscopic techniques, and evaluates the net performance impact of the transformations. Finally, it describes a large class of potential applications for the work in fields such as heap safety and reliability, program understanding, distributed computing, and static garbage collection.

...read moreread less

53 citations

Proceedings Article•DOI•

Tango: a hardware-based data prefetching technique for superscalar processors

[...]

Shlomit S. Pinter¹, Adi Yoaz²•Institutions (2)

IBM¹, Intel²

02 Dec 1996

TL;DR: A new hardware-based data prefetching mechanism for enhancing instruction level parallelism and improving the performance of superscalar processors is presented and a new hardware construct, the program progress graph (PPG), is suggested as a simple extension to the branch target buffer (BTB).

...read moreread less

Abstract: We present a new hardware-based data prefetching mechanism for enhancing instruction level parallelism and improving the performance of superscalar processors. The emphasis in our scheme is on the effective utilization of slack time and hardware resources not used for the main computation. The scheme suggests a new hardware construct, the program progress graph (PPG), as a simple extension to the branch target buffer (BTB). We use the PPG for implementing a fast pre-program counter pre-PC, that travels only through memory reference instructions (rather than scanning all the instructions sequentially). In a single clock cycle the pre-PC extracts all the predicted memory references in some future block of instructions, to obtain early data prefetching. In addition, the PPG can be used for implementing a pre-processor and for instruction prefetching. The prefetch requests are scheduled to "range" with the core requests from the data cache, by using only free time slots on the existing data cache tag ports. Employing special methods for removing prefetch requests that are already in the cache (without utilizing the cache-tag ports bandwidth) and a simple optimization on the cache LRU mechanism reduce the number of prefetch requests sent to the core-cache bus and to the memory (second level) bus. Simulation results on the SPEC92 benchmark for the base line architecture (32 K-byte data cache and 12 cycles fetch latency) show an average speedup of 1.36 (CPI ratio).

...read moreread less

53 citations

Journal Article•DOI•

Balanced Cache: Reducing Conflict Misses of Direct-Mapped Caches

[...]

Chuanjun Zhang¹•Institutions (1)

University of Missouri–Kansas City¹

01 May 2006

TL;DR: A replacement policy to direct-mapped cache design is introduced and the access to the underutilized cache sets is increased with the help of programmable decoders to reduce the miss rate of direct mapped caches through balancing the accesses to cache sets.

...read moreread less

Abstract: Level one cache normally resides on a processor's critical path, which determines the clock frequency. Directmapped caches exhibit fast access time but poor hit rates compared with same sized set-associative caches due to nonuniform accesses to the cache sets, which generate more conflict misses in some sets while other sets are underutilized. We propose a technique to reduce the miss rate of direct mapped caches through balancing the accesses to cache sets. We increase the decoder length and thus reduce the accesses to heavily used sets without dynamically detecting the cache set usage information. We introduce a replacement policy to direct-mapped cache design and increase the access to the underutilized cache sets with the help of programmable decoders. On average, the proposed balanced cache, or BCache, achieves 64.5% and 37.8% miss rate reductions on all 26 SPEC2K benchmarks for the instruction and data caches, respectively. This translates into an average IPC improvement of 5.9%. The B-Cache consumes 10.5% more power per access but exhibits 2% total memory access related energy saving due to the miss rate reductions and hence the reduction to applications' execution time. Compared with previous techniques that aim at reducing the miss rate of direct-mapped caches, our technique requires only one cycle to access all cache hits and has the same access time of a direct-mapped cache.

...read moreread less

53 citations

Collapse

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

Citations

References

Related Papers (5)