Proceedings Article•DOI•

Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses

Andreas Sandberg¹, David Eklov¹, Erik Hagersten¹•Institutions (1)

13 Nov 2010-pp 1-11

TL;DR: A classification of applications into four cache usage categories is introduced and how applications from different categories affect each other's performance indirectly through cache sharing is discussed and a scheme to optimize such sharing is devised.

read less

Abstract: Contention for shared cache resources has been recognized as a major bottleneck for multicores--especially for mixed workloads of independent applications. While most modern processors implement instructions to manage caches, these instructions are largely unused due to a lack of understanding of how to best leverage them. This paper introduces a classification of applications into four cache usage categories. We discuss how applications from different categories affect each other's performance indirectly through cache sharing and devise a scheme to optimize such sharing. We also propose a low-overhead method to automatically find the best per-instruction cache management policy. We demonstrate how the indirect cache-sharing effects of mixed workloads can be tamed by automatically altering some instructions to better manage cache resources. Practical experiments demonstrate that our software-only method can improve application performance up to 35% on x86 multicore hardware.

...read moreread less

Summary (4 min read)

Jump to: [Introduction] – [II. MANAGING CACHES IN SOFTWARE] – [III. CACHE MANAGEMENT INSTRUCTIONS] – [IV. LOW-OVERHEAD CACHE MODELING] – [A. A first simplified approach] – [B. Refining the simple approach] – [C. Handling sticky ETM bits] – [D. Handling sampled data] – [A. Model system] – [B. Benchmark preparation] – [C. Algorithm parameters] – [D. Benchmarks] – [VII. RESULTS AND ANALYSIS] – [VIII. RELATED WORK] and [ACKNOWLEDGMENTS]

Introduction

This paper introduces a classification of applications into four cache usage categories.
The authors also propose a low-overhead method to automatically find the best per-instruction cache management policy.
When an application shares a multicore with other applications, new types of performance considerations are required for good system throughput.

II. MANAGING CACHES IN SOFTWARE

Application performance on multicores is highly dependent on the activities of the other cores in the same chip due to contention for shared resources.
The miss ratio of the non-streaming application decreases as the amount of available cache increases.
Using the rs and δ the authors can classify applications based on how they use the cache.
These applications fit their entire data set in the private cache, they are therefore largely unaffected by contention for the shared cache and memory bandwidth.
The large base miss ratio in such applications is due to memory accesses that touch data that is never reused while it resides in the cache.

III. CACHE MANAGEMENT INSTRUCTIONS

Most modern instruction sets include instructions to manage caches.
Many processors support at least one of these instruction classes.
Instructions from the second category, forced cache eviction, appear in some form in most architectures.
The ECB and WH64 instructions are in many ways similar to the caching hints in the previous category, but instead of annotating the load or store instruction, the hints are given after or, in case of a store, before the memory accesses in question.
The third category, non-temporal prefetches, is also included in several different ISAs.

IV. LOW-OVERHEAD CACHE MODELING

A natural starting point for modeling LRU caches is the stack distance [11].
StatStack is a statistical cache model that models fully associative caches with LRU replacement.
To estimate the stack distance of the second access to A in Figure 4, the authors sum the estimated likelihoods that the reuse distance of the memory accesses executed between the two accesses to A have reuse distances such that their corresponding arcs reach beyond “Out Boundary”.
StatStack uses this approach to estimate the stack distances of all memory accesses in a reuse distance sample, effectively estimating a stack distance distribution.
A fourth step is included to take effects from sampled stack distances into account.

A. A first simplified approach

An instruction has non-temporal behavior if all forward stack distances, i.e. the number of unique cache lines accessed between this instruction and the next access to the same cache line, are larger or equal to the size of the cache.
Therefore, the authors can use a non-temporal instruction to bypass the entire cache hierarchy for such accesses.
Most applications, even purely streaming ones that do not reuse data, may still exhibit short temporal reuse, e.g. spatial locality where neighboring data items on the same cache line are accessed in close succession.
Since cache management is done at a cache line granularity, this clearly restricts the number of possible instructions that can be treated as non-temporal.

B. Refining the simple approach

Most hardware implementations of cache management instructions allow the non-temporal data to live in parts of the cache hierarchy, such as the L1, before it is evicted to memory.
The authors assume that whenever a non-temporal memory access touches a cache line, the cache line is installed in the MRU-position of the LRU stack, and a special bit on the cache line, the evict to memory (ETM) bit, is set.
Whenever a normal memory access touches a cache line, the ETM bit is cleared.
The authors thus require that one stack distance is greater or equal to dmax , and that the number of stack distances that are larger or equal to dETM but smaller than dmax is smaller than some threshold, tm.
In most implementations tm will not be a single value for all accesses, but depend on factors such as how many additional cache hits can be created by disabling caching for a memory access.

C. Handling sticky ETM bits

When the ETM bit is retained for the cache lines’ entire lifetime in the cache, the conditions for a memory accessing instruction to be non-temporal developed in section V-B are no longer sufficient.
If instruction X sets the ETM bit on a cache line, then the ETM status applies to all subsequent reuses of the cache line as well.
The sticky ETM bit is only a problem for non-temporal accesses that have forward reuse distances less than dETM .
When Y accesses the cache line it is moved to the MRU position of the LRU stack, and the sticky ETM bit is retained.
Therefore, instead of applying the non-temporal condition to a single instruction, the authors have to apply it to all instructions reusing the cache line accessed by the first instruction.

D. Handling sampled data

To avoid the overhead of measuring exact stack distances, the authors use StatStack to calculate stack distances from sampled reuse distances.
Sampled stack distances can generally be used in place of a full stack distance trace with only a small decrease in average accuracy.
There is always a risk of missing some critical behavior.
This could potentially lead to flagging an access as non-temporal, even though the instruction in fact has some temporal behavior in some cases, and thereby introducing an unwanted cache miss.

A. Model system

To evaluate their model the authors used an x86 based system with an AMD Phenom II X4 920 processor with the AMD family 10h micro-architecture.
The processor has 4-cores, each with a private L1 and L2 cache and a shared L3 cache.
According to the documentation of the prefetchnta instruction, data fetched using the non-temporal prefetch is not installed in the L2 unless it was fetched from the L2 in the first place.
Their experiments show that this is not the case.
The system therefore works like the system modeled in section V-C where the ETM-bit is sticky.

B. Benchmark preparation

The benchmarks were first compiled normally for initial reference runs and sampling.
Sampling was done on each benchmark running with the reference input set.
The benchmarks were then recompiled taking this profile into account.
The assembly output was then modified before it was passed to the assembler.
Before each non-temporal memory access the script inserted a prefetchnta instruction to the same memory location as the original access.

C. Algorithm parameters

The authors model the cache behavior of their benchmarks using StatStack and a reuse distance sample with 100 000 memory access pairs per benchmark.
This behavior lets us merge the two caches and treat them as one larger LRU stack where each cache level corresponds to a contiguous section of the stack.
In most cases this is a valid assumption, especially for large caches with a high degree of associativity.
The authors therefore have to be more conservative when evaluating stack distances within this range.
The authors use different, conservative, values of dETM , when calculating the number of introduced misses and handling the stickiness of the ETM bits.

D. Benchmarks

Using the software classification introduced in section II the authors selected two benchmarks representing each category for analysis.
Applications on the right-hand side of the map, Gobblers & Victims and Cache Gobblers, have a high base miss ratio and store a large amount of non-temporal data in the shared cache.
Whenever there is a cache miss, a new cache line is installed and another one is replaced.
Looking at Figure 6a the authors see that libquantum’s replacement ratio is reduced from approximately 20% to 0% in the shared cache, while the miss ratio stays at 20%.
The authors reclassify their benchmarks based on their new replacement ratio curves, the new classification allows us to predict how applications affect each other after they introduce the nontemporal memory accesses.

VII. RESULTS AND ANALYSIS

The results for runs of six different mixes of four SPEC2006 benchmarks running with the reference input set, with and without software cache management are shown in Figure 8 and Figure 9.
Figure 9 shows five different mixes consisting of two pairs of benchmarks from different categories.
The speedup is the improvement in IPC over the unmanaged version when running in a mix.
Applying software cache management pushes the knee to the left, i.e. towards smaller cache sizes, and decreases the miss ratio for systems with between 4MB and 8MB of cache.
Looking at Figure 9a, Figure 9c and Figure 9e the authors see that running together with applications from these categories causes a significant decrease in IPC compared to when running in isolation.

ACKNOWLEDGMENTS

The authors would like to thank Kelly Shaw and David Black-Schaffer for valuable comments and insights that has helped to improve this paper.
This work was financially supported by the CoDeR-MP and UPMARC projects.

Did you find this useful? Give us your feedback

Figures (10)

Figure 5. LRU stack (top) and the forward stack distance distribution of a memory accessing instruction (bottom). If the ETM bit is set the cache lines are evicted early to DRAM when they reach dETM . The bars within the shaded area of the forward stack distances distribution represent memory accesses that will result in cache misses if the ETM bit is set. An instruction is classified as non-temporal if there are less than tm forward stack distances between dETM and dmax and at least one forward stack distance greater than dmax .

Table I CACHE PROPERTIES OF THE MODEL SYSTEM (AMD PHENOM II X4 920)

Figure 1. Miss ratio as a function of cache size for an application with streaming behavior and a typical non-streaming application that reuses most of its data. When run in isolation, each application has access to both the private cache and the entire SLLC. Running together causes the non-streaming application to receive a small fraction of the SLLC, while the streaming application receives a large fraction without decreasing its miss ratio. The change in perceived cache size and miss ratio is illustrated by the arrows.

Figure 2. A generalized miss ratio curve for an application. The minimum, i.e. only the private cache, and the maximum, i.e. the private cache and the full shared cache, amount of cache available to an application are shown on the x-axis. The miss ratio when running in isolation (rs) is the smallest miss ratio that an application can achieve on this system, while the miss ratio when running only in the private cache (rp) is the worst miss ratio. The δ represents how much an application is affected by competition for the shared cache.

Figure 9. Performance of mixes with benchmarks from two different categories. Benchmarks from different categories are separated by a dotted line. All of the benchmarks, except for the Don’t care category, generally run slower in mixes than in isolation. Disabling caching for non-temporal memory accesses regains some of the IPC lost to cache and bandwidth contention without any negative impact on application performance in isolation.

Figure 3. Classification map of a subset of the SPEC2006 benchmarks running with the reference input set on a system with a 576 kB private cache and 6MB shared cache. The quadrants signify different behaviors when running together with other applications. Applications to the left tend to reuse almost all of their data in the shared cache and generally work well with other applications, applications to the right tend to use large parts of the shared cache for data that is never reused and are generally troublesome in mixes with other applications. Applications in the upper half are sensitive to the amount of data that can be stored in the shared cache, while applications on the bottom are insensitive.

Figure 6. Miss and replacement ratio before and after cache managing the benchmarks to avoid caching of non-temporal data. The number of replacements can be reduced by cache management in both applications. The number of misses in (b) can be reduced, particularly around the target cache size, because a reduction in the number of replacements will allow more temporal data to fit in the cache. The miss ratio normally drops to 0% once the entire data set fits in the cache, this is no longer the case for managed applications, since the non-temporal memory accesses always cause a miss.

Figure 7. Changes in classification after disabling caching of non-temporal memory accesses. Note that this classification is based on the replacement ratio rather than the miss ratio.

Figure 4. Reuse distance in a memory access stream. The arcs connect successive memory accesses to the same cache line, and represents the reuse of cache lines. The stack distance of the second memory access to A is equal to the number of arcs that cross “Out Boundary”.

Figure 8. Performance for a mix of four applications, each from a different category. The IPC plot compares the IPC for managed and unmanaged benchmarks, both in a mix and in isolation. The speedup is relative to the unmanaged mix.

Content maybe subject to copyright Report

Reducing Cache Pollution Through Detection and

Elimination of Non-Temporal Memory Accesses

Andreas Sandberg, David Eklöv and Erik Hagersten

Department of Information Technology

Uppsala University, Sweden

{andreas.sandberg, david.eklov, eh}@it.uu.se

Abstract—Contention for shared cache resources has been

recognized as a major bottleneck for multicores—especially for

mixed workloads of independent applications. While most mod-

ern processors implement instructions to manage caches, these

instructions are largely unused due to a lack of understanding

of how to best leverage them.

This paper introduces a classiﬁcation of applications into

four cache usage categories. We discuss how applications from

different categories affect each other’s performance indirectly

through cache sharing and devise a scheme to optimize such

sharing. We also propose a low-overhead method to automatically

ﬁnd the best per-instruction cache management policy.

We demonstrate how the indirect cache-sharing effects of

mixed workloads can be tamed by automatically altering some

instructions to better manage cache resources. Practical experi-

ments demonstrate that our software-only method can improve

application performance up to 35% on x86 multicore hardware.

I. INTRODUCTION

The introduction of multicore processors has signiﬁcantly

changed the landscape for most applications. The literature

has mostly focused on parallel multithreaded applications.

However, multicores are often used to run several independent

applications. Such mixed workloads are common in a wide

range of systems, spanning from cell phones to HPC servers.

HPC clusters often run a large number of serial applications

in parallel across their physical cores. For example, parameter

studies in science and engineering where the same application

is run with different input data sets.

When an application shares a multicore with other appli-

cations, new types of performance considerations are required

for good system throughput. Typically, the co-scheduled appli-

cations share resources with limited capacity and bandwidth,

such as a shared last-level cache (SLLC) and DRAM inter-

faces. An application overusing any of these resources can

degrade the performance of the other applications sharing the

same multicore chip.

Consider a simple example: Application A has an active

working set that barely ﬁts in the SLLC, and application

B makes a copy of a data structure much larger than the

SLLC. When run together, B will use a large portion of

the SLLC and will force A to miss much more often than

when run in isolation. Fortunately, most libraries implementing

memory copying routines, e.g. memcpy, have been hand-

optimized and use special cache-bypass instructions, such as

non-temporal reads and writes. On most implementations,

these instructions will avoid allocation of resources in the

SLLC and subsequently will not force any replacements of

application A’s working set in the cache.

In the example above the use of cache bypass instructions

may seem obvious, and hand-tuning a common routine, such

as memcpy, may motivate the use of special assembler instruc-

tions. However, many common programs also have memory

accesses that allocate data with little beneﬁt in the SLLC

and may slow down co-scheduled applications. Detecting such

reckless use is beyond the capability of most application pro-

grammers, as is the use of assembly coding. Ideally, both the

detection and cache-bypassing should be done automatically

using existing hardware support.

Several software techniques for managing caches have been

proposed in the past [1], [2], [3], [4]. However, most of

these methods require an expensive simulation analysis. These

techniques assume the existence of heavily specialized instruc-

tions [1], [4], or extensions to the cache state and replacement

policy [3], none of which can be found in today’s processors.

Several researchers have proposed hardware improvements to

the LRU replacement algorithm [5], [6], [7], [8]. In general,

such algorithms tweak the LRU order by including additional

predictions about future re-references. Others have tried to

predict [9] and quantify [10] interference due to cache sharing.

In this paper, we propose an efﬁcient and practical software-

only technique to automatically manage cache sharing to

improve the performance of mixed workloads of common

applications running on existing x86 hardware. Unlike pre-

viously proposed methods, our technique does not rely on

any hardware modiﬁcations and can be applied to existing

applications running on commodity hardware. This paper

makes the following contributions:

• We propose a scheme to classify applications according

to their impact, and dependence, on the SLLC.

• We propose an automatic and low-overhead method to

ﬁnd instructions that use the SLLC recklessly and au-

tomatically introduce cache bypass instructions into the

binary.

• We demonstrate how this technique can change the classi-

ﬁcation of many applications, making them better mixed

workload citizens.

• We evaluate the performance gain of the applications and

2010 IEEE Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for

creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be

obtained from the IEEE.

SC10 November 2010, New Orleans, Louisiana, USA 978-1- 4244-7558-2/10/$26.00

10%

15%

20%

25%

30%

Private Private+SLLC

Miss Ratio

Cache size

Streaming

Non-streaming

Together

Isolation

Figure 1. Miss ratio as a function of cache size for an application with

streaming behavior and a typical non-streaming application that reuses most

of its data. When run in isolation, each application has access to both the

private cache and the entire SLLC. Running together causes the non-streaming

application to receive a small fraction of the SLLC, while the streaming

application receives a large fraction without decreasing its miss ratio. The

change in perceived cache size and miss ratio is illustrated by the arrows.

show that their improved behavior is in agreement with

the classiﬁcation.

II. MANAGING CACHES IN SOFTWARE

Application performance on multicores is highly dependent

on the activities of the other cores in the same chip due to

contention for shared resources. In most modern processors

there is no explicit hardware policy to manage these shared

resources. However, there are usually instructions to manage

these resources in software. By using these instructions prop-

erly, it is possible to increase the amount of data that is reused

through the cache hierarchy. However, this requires being able

to predict which applications, and which instructions, beneﬁt

from caching and which do not.

In order to know which application would beneﬁt from

using more of the shared cache resources, we need to know

the applications’ cache usage characteristics. The cache miss

ratio as a function of cache size, i.e. the number of cache

misses as a fraction of the total number of memory accesses

as a function of cache size, is a useful metric to determine

such characteristics. Figure 1 shows the miss ratio curves

for a typical streaming application and an application that

reuses its data. The miss ratio of the non-streaming application

decreases as the amount of available cache increases. This

occurs because more of the data set ﬁts in the cache. Since

the streaming application does not reuse its data, the miss ratio

stays constant even when the cache size is increased.

When the applications are run in isolation, they will get

access to both the core-local private cache and the SLLC.

Assuming that the cache hierarchy is exclusive, the amount

of cache available to an application running in isolation is

the sum of the private cache and the SLLC. When two

applications share the cache, they will perceive the SLLC

as being smaller. In the case illustrated by Figure 1, the

streaming application misses much more frequently than the

non-streaming application. The frequent misses causes the

streaming application to install more data in the cache than

the non-streaming application. The non-streaming application

will therefore perceive the cache as being much smaller than

Private Private+SLLC

Miss Ratio

Cache size

Figure 2. A generalized miss ratio curve for an application. The minimum,

i.e. only the private cache, and the maximum, i.e. the private cache and the

full shared cache, amount of cache available to an application are shown on

the x-axis. The miss ratio when running in isolation (r

) is the smallest miss

ratio that an application can achieve on this system, while the miss ratio when

running only in the private cache (r

) is the worst miss ratio. The δ represents

how much an application is affected by competition for the shared cache.

when run in isolation. The change in perceived cache size,

and how this affects miss ratio is illustrated by the arrows in

Figure 1.

Decreasing the perceived cache size for the streaming

application does not affect its miss ratio. The non-streaming

application, however, sees an increased miss ratio when access

to the SLLC is restricted. As the number of misses increase,

the bandwidth requirements also increase, which affects the

performance of all the applications sharing the same memory

interface. If we could make sure that the streaming application

does not install any of its streaming data into the cache, the

miss ratio, and bandwidth requirement, of the non-streaming

applications would decrease without sacriﬁcing any perfor-

mance. In fact, the streaming application would run faster

since the total bandwidth requirement would be decreased.

Using the miss ratio curves we can classify applications

based on how they affect others and how they are affected by

competition for the shared cache. We base this classiﬁcation on

the base miss ratio, r

, when the application is run in isolation

and has access to both its private cache and the entire SLLC,

and the miss ratio, r

, when it only has access to the private

cache, see Figure 2. The r

miss ratio can be thought of as the

maximum miss ratio that an application can get due to cache

contention, while r

is the ideal case when the application is

run in isolation. To capture the sensitivity to cache contention

we deﬁne the cache sensitivity, δ, to be the difference between

the two miss ratios. A large δ indicates that an application

beneﬁts from using the shared cache, while a small δ means

that the application exhibits streaming behavior and does not

beneﬁt from additional cache resources.

Using the r

and δ we can classify applications based on

how they use the cache. This classiﬁcation allows us to predict

how applications will affect each other and how the system

will be affected by software cache management. We deﬁne

the following categories:

Don’t care

Small r

and small δ—Applications that are largely

unaffected by contention for the shared cache level.

0.001%

0.01%

0.1%

10%

0.01% 0.1% 1% 10% 100%

Cache Sensitivity (δ)

Base Miss Ratio (r

)

perlbench

bzip2

gcc

bwaves

gamess

mcf

milc

zeusmp

leslie3d

soplex

hmmer

libq. . .

h264ref

lbm

astar

sphinx3

Xalan

Don’t care

Victims Gobblers & Victims

Cache gobblers

Figure 3. Classiﬁcation map of a subset of the SPEC2006 benchmarks

running with the reference input set on a system with a 576 kB private

cache and 6 MB shared cache. The quadrants signify different behaviors when

running together with other applications. Applications to the left tend to reuse

almost all of their data in the shared cache and generally work well with other

applications, applications to the right tend to use large parts of the shared

cache for data that is never reused and are generally troublesome in mixes

with other applications. Applications in the upper half are sensitive to the

amount of data that can be stored in the shared cache, while applications on

the bottom are insensitive.

These applications ﬁt their entire data set in the

private cache, they are therefore largely unaffected

by contention for the shared cache and memory

bandwidth.

Victims

Small r

and large δ—Applications that suffer badly

if the amount of cache at the shared level is restricted.

The data they manage to install in the shared resource

is almost always reused. Applications with a working

set larger than the private cache, but smaller than the

total cache size belong in this category.

Gobblers & Victims

Large r

and large δ—Applications that suffer from

SLLC cache contention, but store large amounts of

data that is never reused in the shared cache. For

example, applications traversing a small and a large

data structure in parallel may reuse data in the cache

when accessing the small structure, while accesses

to the large data structure always miss. Disabling

caching for the accesses to the large data structure

would allow more of the smaller data structure to be

cached. Managing the cache for these applications

is likely to improve throughput, both when they

are running in isolation and in a mix with other

applications.

Cache Gobblers

Large r

and small δ—Applications that do not ben-

eﬁt from the shared cache at all, but still install large

amounts of data in it. Applications in this category

work on streaming data or data structures that are

much larger than the cache. These applications are

good candidates for software cache management.

Since they do not reuse the data they install in

the shared cache, their throughput is generally not

improved when running in isolation. Managing these

applications will improve the full system throughput

by allowing applications from other categories to use

more of the shared cache.

Figure 3 shows the classiﬁcation of several SPEC2006

benchmarks according to these categories. Applications clas-

siﬁed as wasting cache resources, i.e. applications on the

right-hand side of the map, are obvious targets for cache

management. The large base miss ratio in such applications

is due to memory accesses that touch data that is never reused

while it resides in the cache. Disabling caching for such

instructions does not introduce new misses since data is not

reused, instead it will free up cache space for other accesses.

III. CACHE MANAGEMENT INSTRUCTIONS

Most modern instruction sets include instructions to manage

caches. These instructions can typically be classiﬁed into three

different categories: non-temporal memory accesses, forced

cache eviction and non-temporal prefetches. Many processors

support at least one of these instruction classes. However,

their semantics may not always make them suitable for cache

management for performance.

Examples from the ﬁrst category are the memory accesses

in the PA-RISC which can be annotated with caching hints,

e.g. only spatial locality or write only. Similar instruction

annotations exist for Itanium. Other instruction sets, such as

some of the SIMD extensions to the x86, contain completely

separate instructions for handling non-temporal data. The

hardware may, based on these hints, decide not to install write-

only cache lines in the cache and use write-combining buffers

instead. Non-temporal reads can be handled using separate

non-temporal buffers or by installing the accessed cache line

in such a way that it is the next line to be evicted from a set.

Instructions from the second category, forced cache eviction,

appear in some form in most architectures. However, not

all architectures expose such instructions to user space. Yet

other implementations may have undesired semantics that limit

their usefulness in code optimizations, e.g. the x86 Flush

Cache Line (CLFLUSH) instruction forces all caches in a

coherence domain to be ﬂushed. There are some architectures

that implement instructions in this class that are speciﬁcally

intended for code optimizations. For example, the Alpha ISA

speciﬁes an instruction, Evict Data Cache Block (ECB), that

gives the memory system a hint that a speciﬁc cache line will

not be reused in the near future. A similar instruction, Write

Hint (WH64), tells the memory subsystem that an entire cache

line will be overwritten before being read again, this allows

the memory system to allocate the cache line without actually

reading its old contents. The ECB and WH64 instructions are

in many ways similar to the caching hints in the previous

category, but instead of annotating the load or store instruction,

the hints are given after or, in case of a store, before the

memory accesses in question.

The third category, non-temporal prefetches, is also included

in several different ISAs. The SPARC ISA has both read and

write prefetch variants for data that is not temporally reused.

Similar prefetch instructions are also available in both Itanium

and x86. Some implementations may choose to prefetch into

the cache such that the fetched line is the next to be evicted

from that set; others may prevent the data from propagating

from the L1 to a higher level in the cache hierarchy.

In the remainder of this paper, we will assume an archi-

tecture with a non-temporal hint that is implemented such

that non-temporal data is fetched into the L1 cache, but never

installed in higher levels. This is how the AMD system we

target implement support for non-temporal prefetches.

IV. LOW-OVERHEAD CACHE MODELING

A natural starting point for modeling LRU caches is the

stack distance [11]. A stack distance is the number of unique

cache lines accessed between two successive memory accesses

to the same cache line. It can be directly used to determine

if a memory access results in a cache hit or a cache miss

for a fully-associative LRU cache: if the stack distance is less

than the cache size, the access will be a hit, otherwise it will

miss. Therefore, the stack distance distribution enables the

application’s miss ratio to be computed for any given cache

size, by simply computing the fraction of memory accesses

with a stack distances greater than the desired cache size.

In this work, we need to differentiate between what we call

backward and forward stack distance. Let A and B be two

successive memory accesses to the same cache line. Suppose

that there are S unique cache lines accessed by the memory

accesses executed between A and B. Here, we say that A has a

forward stack distance of S, and that B has a backward stack

distance of S.

Measuring stack distances is generally very expensive. In

this paper, we use StatStack [12] to estimate stack distances

and miss ratios. StatStack is a statistical cache model that

models fully associative caches with LRU replacement. Mod-

eling fully associative LRU caches is, for most applications, a

good approximation of the set associative pseudo LRU caches

implemented in hardware. StatStack estimates an application’s

stack distances using only a sparse sample of the application’s

reuse distances, i.e. the number of memory accesses performed

between two accesses to the same cache line. This approach

to modeling caches has been shown to be several orders of

magnitude faster than full cache simulation, and almost as

accurate. The runtime proﬁle of an application can be collected

with an overhead of only 40% [13], and the execution time of

the cache model is only a few seconds [12].

To understand how StatStack works, consider the access

sequence shown in Figure 4. Here the arcs connect subsequent

accesses to the same cache line, and represent the reuse of data.

In this example, the second memory access to cache line A has

a reuse distance of ﬁve, since there are ﬁve memory accesses

executed between the two accesses to A, and a backward

A B B C D C C D BA

Out Boundary

Figure 4. Reuse distance in a memory access stream. The arcs connect

successive memory accesses to the same cache line, and represents the reuse

of cache lines. The stack distance of the second memory access to A is equal

to the number of arcs that cross “Out Boundary”.

stack distance of three, since there are three unique cache

lines (B, C and D) accessed between the two accesses to A.

Furthermore, we see that there are three arcs that cross the

vertical line labeled “Out Boundary”, which is the same as

the stack distance of the second access to A. This observation

holds true in general. Based on it we can compute the stack

distance of any memory access, given that we know the reuse

distances of all memory access performed between it and the

previous access to the same cache line.

The input to StatStack is a sparse reuse distance sample that

only contains the reuse distances of a sparse random selection

of an application’s memory accesses, and therefore does not

contain enough information for the above observation to be

directly applied. Instead, StatStack uses the reuse distance

sample to estimate the application’s reuse distance distribution.

This distribution is then used to estimate the likelihood that

a memory access has a reuse distance greater than a given

length. Since the length of a reuse distance determines if its

outbound arc reaches beyond the “Out Boundary”, we can

use these likelihoods to estimate the stack distance of any

memory access. For example, to estimate the stack distance

of the second access to A in Figure 4, we sum the estimated

likelihoods that the reuse distance of the memory accesses exe-

cuted between the two accesses to A have reuse distances such

that their corresponding arcs reach beyond “Out Boundary”.

StatStack uses this approach to estimate the stack distances

of all memory accesses in a reuse distance sample, effectively

estimating a stack distance distribution. StatStack uses this

distribution to estimate the miss ratio for any given cache size,

C, as the fraction of stack distances in the estimated stack

distance distribution that are greater than C.

V. IDENTIFYING NON-TEMPORAL ACCESSES

Using the stack distance proﬁle of an application we can de-

termine which memory accesses do not beneﬁt from caching.

We will refer to memory accessing instructions whose data

is never reused during its lifetime in the cache hierarchy as

non-temporal memory accesses.

If these non-temporal accesses can be identiﬁed, the com-

piler, a post processing pass, or a dynamic instrumentation

engine can alter the application to use non-temporal instruc-

tions in these locations without hurting performance.

The system we model implements a non-temporal hint that

causes a cache line to be installed in the L1, but never in any of

the higher cache levels. It turns out that modeling this system

is fairly complicated, we will therefore describe our algorithm

to ﬁnd non-temporal accesses in three steps. Each step adds

more detail to the model and brings it closer to the hardware.

A fourth step is included to take effects from sampled stack

distances into account.

A. A ﬁrst simpliﬁed approach

By looking at the forward stack distances of an instruction

we can easily determine if the next access to the data used

by that instruction will be a cache miss, i.e. the instruction is

non-temporal. An instruction has non-temporal behavior if all

forward stack distances, i.e. the number of unique cache lines

accessed between this instruction and the next access to the

same cache line, are larger or equal to the size of the cache. In

that case, we know that the next instruction to touch the same

data is very likely to be a cache miss. Therefore, we can use a

non-temporal instruction to bypass the entire cache hierarchy

for such accesses.

This approach has a major drawback. Most applications,

even purely streaming ones that do not reuse data, may

still exhibit short temporal reuse, e.g. spatial locality where

neighboring data items on the same cache line are accessed in

close succession. Since cache management is done at a cache

line granularity, this clearly restricts the number of possible

instructions that can be treated as non-temporal.

B. Reﬁning the simple approach

Most hardware implementations of cache management in-

structions allow the non-temporal data to live in parts of the

cache hierarchy, such as the L1, before it is evicted to memory.

We can exploit this to accommodate short temporal reuse of

cache lines. We assume that whenever a non-temporal memory

access touches a cache line, the cache line is installed in the

MRU-position of the LRU stack, and a special bit on the cache

line, the evict to memory (ETM) bit, is set. Whenever a normal

memory access touches a cache line, the ETM bit is cleared.

Cache lines with the ETM bit set are evicted earlier than other

lines, see Figure 5. Instead of waiting for the line to reach the

depth d

max

it is evicted when it reaches a shallower depth,

ETM

. This allows us to model implementations that allow

non-temporal data to live in parts of the memory hierarchy.

For example, the memory controller in our AMD system evicts

ETM tagged cache lines from the L1 to main memory, and

would therefore be modeled with d

ETM

being the size of the

L1 and d

max

the total combined cache size.

The model with the ETM bit allows us to consider memory

accesses as non-temporal even if they have short reuses that

hit in the small ETM area. Instead of requiring that all forward

stack distances are larger than the cache size, we require

that there is at least one such access and that the number

of accesses that reuse data in the area of the LRU stack

outside the ETM area, the gray area in Figure 5, is small,

i.e. the number of misses introduced if the access is treated

as non-temporal is small. We thus require that one stack

distance is greater or equal to d

max

, and that the number

of stack distances that are larger or equal to d

ETM

but

smaller than d

max

is smaller than some threshold, t

. In most

implementations t

will not be a single value for all accesses,

but depend on factors such as how many additional cache hits

can be created by disabling caching for a memory access.

The hardware we want to model does not, unfortunately,

reset the ETM bit when a temporal access reuses ETM data.

This new situation can be thought of as sticky ETM bits, as

they are only reset on cache line eviction.

C. Handling sticky ETM bits

When the ETM bit is retained for the cache lines’ entire

lifetime in the cache, the conditions for a memory accessing

instruction to be non-temporal developed in section V-B are no

longer sufﬁcient. If instruction X sets the ETM bit on a cache

line, then the ETM status applies to all subsequent reuses of

the cache line as well. To correctly model this, we need to

make sure that the non-temporal condition from section V-B

applies, not only to X, but also to all instructions that reuse

the cache lines accessed by X.

The sticky ETM bit is only a problem for non-temporal

accesses that have forward reuse distances less than d

ET M

For example, consider a memory accessing instruction, Y, that

reuses the cache line previously accessed by a non-temporal

access X (here Y is a cache hit). When Y accesses the cache

line it is moved to the MRU position of the LRU stack, and the

sticky ETM bit is retained. Now, since Y would have resulted

in a cache hit no matter if X had set the sticky ETM bit or

not, this is the same as if we would have set the sticky ETM

bit for the cache line when it was accessed by Y.

Therefore, instead of applying the non-temporal condition

to a single instruction, we have to apply it to all instructions

reusing the cache line accessed by the ﬁrst instruction.

In a machine, such as our AMD system, where d

ETM

corresponds to the L1 cache, this new condition allows us to

categorize a memory access as non-temporal if all the data it

touches is reused through the L1 cache or misses in the entire

cache hierarchy. Due to the stickiness of the non-temporal

status, this condition must also hold for any memory access

that reuses the same data through the L1 cache.

D. Handling sampled data

To avoid the overhead of measuring exact stack distances,

we use StatStack to calculate stack distances from sampled

reuse distances. Sampled stack distances can generally be used

in place of a full stack distance trace with only a small decrease

in average accuracy. However, there is always a risk of missing

some critical behavior. This could potentially lead to ﬂagging

an access as non-temporal, even though the instruction in

fact has some temporal behavior in some cases, and thereby

introducing an unwanted cache miss.

In order to reduce the likelihood of introducing misses due

to sampling, we need to make sure that ﬂagging an instruction

as non-temporal is always based on reliable data. We do this

by introducing a sample threshold, t

, which is the smallest

number of samples originating from an instruction that can be

considered to be non-temporal.

HTML Viewer

Frequently Asked Questions (10)

Q1. What are the contributions in "Reducing cache pollution through detection and elimination of non-temporal memory accesses" ?

This paper introduces a classification of applications into four cache usage categories. The authors discuss how applications from different categories affect each other ’ s performance indirectly through cache sharing and devise a scheme to optimize such sharing. The authors also propose a low-overhead method to automatically find the best per-instruction cache management policy. The authors demonstrate how the indirect cache-sharing effects of mixed workloads can be tamed by automatically altering some instructions to better manage cache resources.

Q2. What are the future works mentioned in the paper "Reducing cache pollution through detection and elimination of non-temporal memory accesses" ?

Future work will explore other hardware mechanism for handling non-temporal data hints from software and possible applications in scheduling.

Q3. How did the authors measure the cycles and instruction counts?

The authors used the performance counters in the processor to measure the cycles and instruction counts using the perf framework provided by recent Linux kernels.

Q4. What is the implicit assumption that caches can be modeled to be?

Since the authors are using StatStack the authors have made the implicit assumption that caches can be modeled to be fully associative, i.e. conflict misses are insignificant.

Q5. What is the reason for the speedup when running with victims?

The speedup when running with applications from the two victim categories can largely be attributed to a reduction in the total bandwidth requirement of the mix.

Q6. What is the way to manage the cache for these applications?

Managing the cache for these applications is likely to improve throughput, both when they are running in isolation and in a mix with other applications.

Q7. What is the main advantage of using a non-temporal instruction to bypass the entire cache?

Most hardware implementations of cache management instructions allow the non-temporal data to live in parts of the cache hierarchy, such as the L1, before it is evicted to memory.

Q8. What is the stack distance distribution for LRU caches?

the stack distance distribution enables the application’s miss ratio to be computed for any given cache size, by simply computing the fraction of memory accesses with a stack distances greater than the desired cache size.

Q9. How can the authors reclassify applications based on their replacement ratios?

Using a modified StatStack implementation the authors can reclassify applications based on their replacement ratios after applying cache management, this allows us to reason about how cache management impacts performance.

Q10. How can the authors determine if the next access to the data used by the instruction is a?

By looking at the forward stack distances of an instruction the authors can easily determine if the next access to the data used by that instruction will be a cache miss, i.e. the instruction is non-temporal.

Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses

Summary (4 min read)

Introduction

II. MANAGING CACHES IN SOFTWARE

III. CACHE MANAGEMENT INSTRUCTIONS

IV. LOW-OVERHEAD CACHE MODELING

A. A first simplified approach

B. Refining the simple approach

C. Handling sticky ETM bits

D. Handling sampled data

A. Model system

B. Benchmark preparation

C. Algorithm parameters

D. Benchmarks

VII. RESULTS AND ANALYSIS

ACKNOWLEDGMENTS

Figures (10)

Citations

Cites methods from "Reducing Cache Pollution Through De..."

Cites background from "Reducing Cache Pollution Through De..."

References

"Reducing Cache Pollution Through De..." refers background or methods in this paper

"Reducing Cache Pollution Through De..." refers methods in this paper

"Reducing Cache Pollution Through De..." refers background in this paper

"Reducing Cache Pollution Through De..." refers background in this paper

"Reducing Cache Pollution Through De..." refers background or methods in this paper

Related Papers (5)

Frequently Asked Questions (10)

Q1. What are the contributions in "Reducing cache pollution through detection and elimination of non-temporal memory accesses" ?

Q2. What are the future works mentioned in the paper "Reducing cache pollution through detection and elimination of non-temporal memory accesses" ?

Q3. How did the authors measure the cycles and instruction counts?

Q4. What is the implicit assumption that caches can be modeled to be?

Q5. What is the reason for the speedup when running with victims?

Q6. What is the way to manage the cache for these applications?

Q7. What is the main advantage of using a non-temporal instruction to bypass the entire cache?

Q8. What is the stack distance distribution for LRU caches?

Q9. How can the authors reclassify applications based on their replacement ratios?

Q10. How can the authors determine if the next access to the data used by the instruction is a?

Trending Questions (1)