Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses
Summary (4 min read)
Introduction
- This paper introduces a classification of applications into four cache usage categories.
- The authors also propose a low-overhead method to automatically find the best per-instruction cache management policy.
- When an application shares a multicore with other applications, new types of performance considerations are required for good system throughput.
II. MANAGING CACHES IN SOFTWARE
- Application performance on multicores is highly dependent on the activities of the other cores in the same chip due to contention for shared resources.
- The miss ratio of the non-streaming application decreases as the amount of available cache increases.
- Using the rs and δ the authors can classify applications based on how they use the cache.
- These applications fit their entire data set in the private cache, they are therefore largely unaffected by contention for the shared cache and memory bandwidth.
- The large base miss ratio in such applications is due to memory accesses that touch data that is never reused while it resides in the cache.
III. CACHE MANAGEMENT INSTRUCTIONS
- Most modern instruction sets include instructions to manage caches.
- Many processors support at least one of these instruction classes.
- Instructions from the second category, forced cache eviction, appear in some form in most architectures.
- The ECB and WH64 instructions are in many ways similar to the caching hints in the previous category, but instead of annotating the load or store instruction, the hints are given after or, in case of a store, before the memory accesses in question.
- The third category, non-temporal prefetches, is also included in several different ISAs.
IV. LOW-OVERHEAD CACHE MODELING
- A natural starting point for modeling LRU caches is the stack distance [11].
- StatStack is a statistical cache model that models fully associative caches with LRU replacement.
- To estimate the stack distance of the second access to A in Figure 4, the authors sum the estimated likelihoods that the reuse distance of the memory accesses executed between the two accesses to A have reuse distances such that their corresponding arcs reach beyond “Out Boundary”.
- StatStack uses this approach to estimate the stack distances of all memory accesses in a reuse distance sample, effectively estimating a stack distance distribution.
- A fourth step is included to take effects from sampled stack distances into account.
A. A first simplified approach
- An instruction has non-temporal behavior if all forward stack distances, i.e. the number of unique cache lines accessed between this instruction and the next access to the same cache line, are larger or equal to the size of the cache.
- Therefore, the authors can use a non-temporal instruction to bypass the entire cache hierarchy for such accesses.
- Most applications, even purely streaming ones that do not reuse data, may still exhibit short temporal reuse, e.g. spatial locality where neighboring data items on the same cache line are accessed in close succession.
- Since cache management is done at a cache line granularity, this clearly restricts the number of possible instructions that can be treated as non-temporal.
B. Refining the simple approach
- Most hardware implementations of cache management instructions allow the non-temporal data to live in parts of the cache hierarchy, such as the L1, before it is evicted to memory.
- The authors assume that whenever a non-temporal memory access touches a cache line, the cache line is installed in the MRU-position of the LRU stack, and a special bit on the cache line, the evict to memory (ETM) bit, is set.
- Whenever a normal memory access touches a cache line, the ETM bit is cleared.
- The authors thus require that one stack distance is greater or equal to dmax , and that the number of stack distances that are larger or equal to dETM but smaller than dmax is smaller than some threshold, tm.
- In most implementations tm will not be a single value for all accesses, but depend on factors such as how many additional cache hits can be created by disabling caching for a memory access.
C. Handling sticky ETM bits
- When the ETM bit is retained for the cache lines’ entire lifetime in the cache, the conditions for a memory accessing instruction to be non-temporal developed in section V-B are no longer sufficient.
- If instruction X sets the ETM bit on a cache line, then the ETM status applies to all subsequent reuses of the cache line as well.
- The sticky ETM bit is only a problem for non-temporal accesses that have forward reuse distances less than dETM .
- When Y accesses the cache line it is moved to the MRU position of the LRU stack, and the sticky ETM bit is retained.
- Therefore, instead of applying the non-temporal condition to a single instruction, the authors have to apply it to all instructions reusing the cache line accessed by the first instruction.
D. Handling sampled data
- To avoid the overhead of measuring exact stack distances, the authors use StatStack to calculate stack distances from sampled reuse distances.
- Sampled stack distances can generally be used in place of a full stack distance trace with only a small decrease in average accuracy.
- There is always a risk of missing some critical behavior.
- This could potentially lead to flagging an access as non-temporal, even though the instruction in fact has some temporal behavior in some cases, and thereby introducing an unwanted cache miss.
A. Model system
- To evaluate their model the authors used an x86 based system with an AMD Phenom II X4 920 processor with the AMD family 10h micro-architecture.
- The processor has 4-cores, each with a private L1 and L2 cache and a shared L3 cache.
- According to the documentation of the prefetchnta instruction, data fetched using the non-temporal prefetch is not installed in the L2 unless it was fetched from the L2 in the first place.
- Their experiments show that this is not the case.
- The system therefore works like the system modeled in section V-C where the ETM-bit is sticky.
B. Benchmark preparation
- The benchmarks were first compiled normally for initial reference runs and sampling.
- Sampling was done on each benchmark running with the reference input set.
- The benchmarks were then recompiled taking this profile into account.
- The assembly output was then modified before it was passed to the assembler.
- Before each non-temporal memory access the script inserted a prefetchnta instruction to the same memory location as the original access.
C. Algorithm parameters
- The authors model the cache behavior of their benchmarks using StatStack and a reuse distance sample with 100 000 memory access pairs per benchmark.
- This behavior lets us merge the two caches and treat them as one larger LRU stack where each cache level corresponds to a contiguous section of the stack.
- In most cases this is a valid assumption, especially for large caches with a high degree of associativity.
- The authors therefore have to be more conservative when evaluating stack distances within this range.
- The authors use different, conservative, values of dETM , when calculating the number of introduced misses and handling the stickiness of the ETM bits.
D. Benchmarks
- Using the software classification introduced in section II the authors selected two benchmarks representing each category for analysis.
- Applications on the right-hand side of the map, Gobblers & Victims and Cache Gobblers, have a high base miss ratio and store a large amount of non-temporal data in the shared cache.
- Whenever there is a cache miss, a new cache line is installed and another one is replaced.
- Looking at Figure 6a the authors see that libquantum’s replacement ratio is reduced from approximately 20% to 0% in the shared cache, while the miss ratio stays at 20%.
- The authors reclassify their benchmarks based on their new replacement ratio curves, the new classification allows us to predict how applications affect each other after they introduce the nontemporal memory accesses.
VII. RESULTS AND ANALYSIS
- The results for runs of six different mixes of four SPEC2006 benchmarks running with the reference input set, with and without software cache management are shown in Figure 8 and Figure 9.
- Figure 9 shows five different mixes consisting of two pairs of benchmarks from different categories.
- The speedup is the improvement in IPC over the unmanaged version when running in a mix.
- Applying software cache management pushes the knee to the left, i.e. towards smaller cache sizes, and decreases the miss ratio for systems with between 4MB and 8MB of cache.
- Looking at Figure 9a, Figure 9c and Figure 9e the authors see that running together with applications from these categories causes a significant decrease in IPC compared to when running in isolation.
ACKNOWLEDGMENTS
- The authors would like to thank Kelly Shaw and David Black-Schaffer for valuable comments and insights that has helped to improve this paper.
- This work was financially supported by the CoDeR-MP and UPMARC projects.
Did you find this useful? Give us your feedback
Citations
110 citations
86 citations
72 citations
Cites methods from "Reducing Cache Pollution Through De..."
...Software solutions to this problem include thread mapping [48] and scheduling [15], cache partitioning [18], and compiler-time code transformation for cache behavior optimization [33][36]....
[...]
68 citations
Cites background from "Reducing Cache Pollution Through De..."
...Methods to reduce cache pollution by compiler-directed use of non-temporal move instructions [29] and non-temporal prefetch instructions are also proposed [30]....
[...]
62 citations
References
1,329 citations
"Reducing Cache Pollution Through De..." refers background or methods in this paper
...However, instead of using LRU stack distances, they use OPT stack distances, which requires expensive simulation....
[...]
...A natural starting point for modeling LRU caches is the stack distance [11]....
[...]
...[3] propose a method to identify non-temporal memory accesses based on Mattson’s optimal replacement algorithm (OPT) [11]....
[...]
...Wong et al. [3] propose a method to identify non-temporal memory accesses based on Mattson s optimal replacement algorithm (OPT) [11]....
[...]
1,083 citations
"Reducing Cache Pollution Through De..." refers methods in this paper
...To detect non-temporal data, they introduce a set of shadow tags [15] used to count the number of hits to a cache line that would have occurred if the thread was allocated all ways in the cache set....
[...]
722 citations
"Reducing Cache Pollution Through De..." refers background in this paper
...Several researchers have proposed hardware improvements to the LRU replacement algorithm [5], [6], [7], [8]....
[...]
...[5] propose an insertion policy (DIP) where on a cache miss to non-temporal data it is installed in the LRU position, instead of the MRU position of the LRU stack....
[...]
715 citations
"Reducing Cache Pollution Through De..." refers background in this paper
...Several researchers have proposed hardware improvements to the LRU replacement algorithm [5], [6], [7], [8]....
[...]
...A recent extension [8] introduces an additional policy that installs cache lines in the MRU− 1 position....
[...]
334 citations
"Reducing Cache Pollution Through De..." refers background or methods in this paper
...Several hardware methods have been proposed [1], [3], [6], [14], that dynamically identify non-temporal data....
[...]
...[14] propose a replacement policy, PIPP, to effectively way-partition a shared cache, that explicitly handles nontemporal (streaming) data....
[...]
...Focus has recently started to shift towards shared caches [6], [14]....
[...]
Related Papers (5)
Frequently Asked Questions (10)
Q2. What are the future works mentioned in the paper "Reducing cache pollution through detection and elimination of non-temporal memory accesses" ?
Future work will explore other hardware mechanism for handling non-temporal data hints from software and possible applications in scheduling.
Q3. How did the authors measure the cycles and instruction counts?
The authors used the performance counters in the processor to measure the cycles and instruction counts using the perf framework provided by recent Linux kernels.
Q4. What is the implicit assumption that caches can be modeled to be?
Since the authors are using StatStack the authors have made the implicit assumption that caches can be modeled to be fully associative, i.e. conflict misses are insignificant.
Q5. What is the reason for the speedup when running with victims?
The speedup when running with applications from the two victim categories can largely be attributed to a reduction in the total bandwidth requirement of the mix.
Q6. What is the way to manage the cache for these applications?
Managing the cache for these applications is likely to improve throughput, both when they are running in isolation and in a mix with other applications.
Q7. What is the main advantage of using a non-temporal instruction to bypass the entire cache?
Most hardware implementations of cache management instructions allow the non-temporal data to live in parts of the cache hierarchy, such as the L1, before it is evicted to memory.
Q8. What is the stack distance distribution for LRU caches?
the stack distance distribution enables the application’s miss ratio to be computed for any given cache size, by simply computing the fraction of memory accesses with a stack distances greater than the desired cache size.
Q9. How can the authors reclassify applications based on their replacement ratios?
Using a modified StatStack implementation the authors can reclassify applications based on their replacement ratios after applying cache management, this allows us to reason about how cache management impacts performance.
Q10. How can the authors determine if the next access to the data used by the instruction is a?
By looking at the forward stack distances of an instruction the authors can easily determine if the next access to the data used by that instruction will be a cache miss, i.e. the instruction is non-temporal.